Saturday, July 30, 2011

Pentaho Data Integration 4 Cookbook - Book Review

Book Review: Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook by María Carina Roldan and Adrián Sergio Pulvirenti illustrates with plenty examples how data can be imported from various sources, transformed and exported to various sources with the open source tool Kettle (Pentaho Data Integration). It's all a hands-down approach: Follow the step by step guide of the examples and you will have the right knowledge at the end of the book to create your own data integration solutions. The book is easy to read even for novices and offers a great introduction to the world of Pentaho Kettle.

In times of economic downturn and an extremely competitive market companies are trying to save money where ever possible. In the last few years we have seen a trend towards open source business intelligence solutions. The quality and functionality of open source business intelligence tools has increased over time as well and nowadays they are a very convincing alternative to very costly commercial business intelligence solutions.

For a lot of open source business intelligence tools, documentation has been a weak point in the past. Some products have a really strong community following and maybe a Wiki, which were often you main source of information. In the last 2 years this situation changed: There are now some publications available that cover some of these open source business intelligence packages, one of the very recent one being Pentaho Data Integration 4 Cookbook by María Carina Roldan and Adrián Sergio Pulvirenti.

Data Integration (ETL, short for extract, transform, load data) is one of the most important building blocks of a business intelligence solution. You can have the fanciest dashboards and reports, but if you don't have the correct data for them, they are pretty useless. ETL involves sourcing that data from a database, text file, Excel spreadsheet etc, transforming the data and outputting it to any convenient file format or database. Most ETL tools offer a graphical user interface which allows you to create data flows. The beauty of this is that there is hardly any coding involved, it is easy to understand (as you are looking at a flow diagram) and you are working quicker as well (you are not coding everything). Nowadays, the probably two most popular open source ETL tools are Pentaho Data Integration and Talend.  

Pentaho Data Integration 4 Cookbook is an ideal book for somebody who looks for a hands-down tutorial-style book. It is filled with an abundance of examples and you can download the accompanying files from the dedicated website as well. Now what does this book cover:

The first chapter is all about how to get data from databases, parameterizing SQL queries and updating/deleting records from database tables. Especially interesting is the section about using parameters with your SQL query to dynamically source your data. The authors also explain how you can generate primary key values for your dataset in Kettle in case your database table requires them.
The second chapter focuses on getting data from text files. Rich examples are provided to handle structured as well as instructed text files. Especially the section about working with various forms of unstructured text files is really informative. The last section of the chapter is all about retrieving data from Excel spreadsheets. Chapter 3 focuses on working with XML files: Kettle offers features to read, validate and generate XML documents. 
Kettle is not only an ETL tool, but also offers workflow management as well: This book gives a good overview on how to create jobs, which allow to run transformations and other tasks (e.g. send email email notifications, create folders, check if a file exists, etc) in sequential order. File management is covered next, which looks at copying, moving and deleting files and folders. Once you imported the data, you also might have to look up some additional data from other sources, so chapter 5 gives you a good overview on this topic. There is really one nifty example: It is about transforming data about museums and looking about the opening hours by using a web service. 
Up next: Learn how to split, merge and compare data streams. Chapter 7 shows you how to execute and reuse jobs and transformations. Then it's all about how you can integrate your Kettle ETL processes with the other Pentaho products (Business Intelligence Server, Report Designer, Dashboards etc). The final chapter learns you the tricks on how get the most out of Kettle.

All in all, this book is in excellent read for somebody who wants to have a quick start in open source ETL process design. The book is fully based on examples, which explained in an easy to understand language. There are many illustrations as well and at the end of each example you will find a summary and sometimes hints on where to find more info. This book is ideal for somebody who wants to get up and running with this popular open source ETL tool in a short amount of time. If you interested in this book, you can get it directly from here.

No comments:

Post a Comment