Sunday, February 9, 2014

Pentaho for Big Data Analytics book review

When I first heard about this book, I got quite excited. I looked up the info on the Packt website and when I saw the page count, 118 pages, a bit question mark came up. Then I had a look at the table of contents, and suddenly, all my excitement was gone.

The book feels and reads like a marketing booklet that Pentaho themselves could have published with a title like 'Getting started with Pentaho Big Data within 6 hours'. Certainly, for somebody completely new to this topic, such a high level overview is a great introduction. But if you are already a bit familiar with Pentaho and know a little bit about Big Data, I can't quite see what you would win in ready this book. Don't get me wrong: The book is well written, easy to understand, but most of the chapters just scratch the surface, in the sense that they help you to get started, but then don't get into any further detail. The only chapter that probably provides a bit more detail is the one on CDE. The chapter on Pentaho Report Designer even only shows you have to open an existing report (from the biserver) and speaks you through the structure of a report.

One thing that I was really expecting to find in this book were some detailed examples about using Pentaho Kettle with Hadoop. The only thing covered is copying a file to HDFS, then to Hive and exporting a dataset from Hive, which is fairly easy to accomplish. At the very minimum, creating a simple map reduce job in Kettle (like the famous wordcount example) could have been covered. And even then, there could have been so much more written about this topic.

Also, another point is, who actually uses Hive as the data source of choice for powering a dashboard? If the data source has to be something related to Big Data, why not use Impala (or similar projects), where latency wouldn't be such an issue? Or follow to common approach and export the prepared data to a columnar DB like MonetDB etc.

So to sum it up: If you are new to Pentaho and new to Big Data, this book is well worth a read as a brief introduction. It will help you configure most Pentaho components correctly within a short amount of time and give you some ideas on what can be achieved. Take this as a starting point, more detailed questions will then have to be answered by other sources.

1 comment: