Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Sunday, February 9, 2014

Pentaho for Big Data Analytics book review

When I first heard about this book, I got quite excited. I looked up the info on the Packt website and when I saw the page count, 118 pages, a bit question mark came up. Then I had a look at the table of contents, and suddenly, all my excitement was gone.

The book feels and reads like a marketing booklet that Pentaho themselves could have published with a title like 'Getting started with Pentaho Big Data within 6 hours'. Certainly, for somebody completely new to this topic, such a high level overview is a great introduction. But if you are already a bit familiar with Pentaho and know a little bit about Big Data, I can't quite see what you would win in ready this book. Don't get me wrong: The book is well written, easy to understand, but most of the chapters just scratch the surface, in the sense that they help you to get started, but then don't get into any further detail. The only chapter that probably provides a bit more detail is the one on CDE. The chapter on Pentaho Report Designer even only shows you have to open an existing report (from the biserver) and speaks you through the structure of a report.

One thing that I was really expecting to find in this book were some detailed examples about using Pentaho Kettle with Hadoop. The only thing covered is copying a file to HDFS, then to Hive and exporting a dataset from Hive, which is fairly easy to accomplish. At the very minimum, creating a simple map reduce job in Kettle (like the famous wordcount example) could have been covered. And even then, there could have been so much more written about this topic.

Also, another point is, who actually uses Hive as the data source of choice for powering a dashboard? If the data source has to be something related to Big Data, why not use Impala (or similar projects), where latency wouldn't be such an issue? Or follow to common approach and export the prepared data to a columnar DB like MonetDB etc.

So to sum it up: If you are new to Pentaho and new to Big Data, this book is well worth a read as a brief introduction. It will help you configure most Pentaho components correctly within a short amount of time and give you some ideas on what can be achieved. Take this as a starting point, more detailed questions will then have to be answered by other sources.

Thursday, March 1, 2012

Talend open sources Big Data features

Just a month ago I reported here that Pentaho open sourced their Big data features in their data integration tool (Kettle). And yesterday Talend revealed on their blog that they are about to release a new Talend Open Studio for Big Data. This version will natively support Hadoop Distributed File System (HDFS), Pig, HBase, Sqoop and Hive. Moreover, Talend Open Studio for Big Data will be bundled with Hortonwork's Apache Hadoop distribution.
These are exciting times for data integration experts and companies alike. First of all, this means more choice in terms of open source data integration tools. Secondly, competition is always good and vital to a product's future development.