Saturday, July 30, 2011

Pentaho Data Integration 4 Cookbook - Book Review

Book Review: Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook by María Carina Roldan and Adrián Sergio Pulvirenti illustrates with plenty examples how data can be imported from various sources, transformed and exported to various sources with the open source tool Kettle (Pentaho Data Integration). It's all a hands-down approach: Follow the step by step guide of the examples and you will have the right knowledge at the end of the book to create your own data integration solutions. The book is easy to read even for novices and offers a great introduction to the world of Pentaho Kettle.

In times of economic downturn and an extremely competitive market companies are trying to save money where ever possible. In the last few years we have seen a trend towards open source business intelligence solutions. The quality and functionality of open source business intelligence tools has increased over time as well and nowadays they are a very convincing alternative to very costly commercial business intelligence solutions.

For a lot of open source business intelligence tools, documentation has been a weak point in the past. Some products have a really strong community following and maybe a Wiki, which were often you main source of information. In the last 2 years this situation changed: There are now some publications available that cover some of these open source business intelligence packages, one of the very recent one being Pentaho Data Integration 4 Cookbook by María Carina Roldan and Adrián Sergio Pulvirenti.

Data Integration (ETL, short for extract, transform, load data) is one of the most important building blocks of a business intelligence solution. You can have the fanciest dashboards and reports, but if you don't have the correct data for them, they are pretty useless. ETL involves sourcing that data from a database, text file, Excel spreadsheet etc, transforming the data and outputting it to any convenient file format or database. Most ETL tools offer a graphical user interface which allows you to create data flows. The beauty of this is that there is hardly any coding involved, it is easy to understand (as you are looking at a flow diagram) and you are working quicker as well (you are not coding everything). Nowadays, the probably two most popular open source ETL tools are Pentaho Data Integration and Talend.  

Pentaho Data Integration 4 Cookbook is an ideal book for somebody who looks for a hands-down tutorial-style book. It is filled with an abundance of examples and you can download the accompanying files from the dedicated website as well. Now what does this book cover:

The first chapter is all about how to get data from databases, parameterizing SQL queries and updating/deleting records from database tables. Especially interesting is the section about using parameters with your SQL query to dynamically source your data. The authors also explain how you can generate primary key values for your dataset in Kettle in case your database table requires them.
The second chapter focuses on getting data from text files. Rich examples are provided to handle structured as well as instructed text files. Especially the section about working with various forms of unstructured text files is really informative. The last section of the chapter is all about retrieving data from Excel spreadsheets. Chapter 3 focuses on working with XML files: Kettle offers features to read, validate and generate XML documents. 
Kettle is not only an ETL tool, but also offers workflow management as well: This book gives a good overview on how to create jobs, which allow to run transformations and other tasks (e.g. send email email notifications, create folders, check if a file exists, etc) in sequential order. File management is covered next, which looks at copying, moving and deleting files and folders. Once you imported the data, you also might have to look up some additional data from other sources, so chapter 5 gives you a good overview on this topic. There is really one nifty example: It is about transforming data about museums and looking about the opening hours by using a web service. 
Up next: Learn how to split, merge and compare data streams. Chapter 7 shows you how to execute and reuse jobs and transformations. Then it's all about how you can integrate your Kettle ETL processes with the other Pentaho products (Business Intelligence Server, Report Designer, Dashboards etc). The final chapter learns you the tricks on how get the most out of Kettle.

All in all, this book is in excellent read for somebody who wants to have a quick start in open source ETL process design. The book is fully based on examples, which explained in an easy to understand language. There are many illustrations as well and at the end of each example you will find a summary and sometimes hints on where to find more info. This book is ideal for somebody who wants to get up and running with this popular open source ETL tool in a short amount of time. If you interested in this book, you can get it directly from here.

Tuesday, July 12, 2011

Metadata Driven ETL and Reporting

Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer

Tutorial Details 

  • Software: If PDI Kettle 4.2 GA and PRD 3.8.1 GA or later are available, download them from Sourceforge, otherwise here: PDI/Kettle 4.2 latest stable version and PRD 3.8.1 latest stable version (Note: Do not use these last stable versions in a production environment)
  • Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)
  • OS: Windows, Linux or Mac OS X
  • Tutorial files can be downloaded here


I would like to thank both Matt Casters (Chief Data Integration, Pentaho) and Thomas Morgner (Chief Reporting, Pentaho) for providing me answers on all the questions that I had on this interesting topic. 


This tutorial will demonstrate how you can create extremely flexible data processing and reporting pipelines by using the metadata approach.

So what does Metadata Driven mean? Imagine you source various data fields from text files or a database, transform the data and then output a final pixel perfect report with summarized data. What happens if a new field shows up in the source data and it has to be added to the data processing and reporting pipeline? In the past this meant that you had to manually add this new field to nearly every step of your pipeline (change the ETL transformation and report template directly, or better said, hard code everything). Metadata driven means that you do not directly change ETL and reporting files, but only supply the name of the fields that have to be processed (and any other metadata).

This article focuses on three very interesting features in the Pentaho BI Suite:

  • Metadata driven ETL processes with Pentaho Data Integration (Kettle)
  • Index column reports in Pentaho Report Designer (PDR) which allow extremely flexible report definitions and can utilize the metadata provided by the data query
  • Pentaho Report Designer can source data from a Kettle transformation … now this is an extremely powerful combination and opens up the door for a lot of new solutions

Now why do I think this is quite important:

  • Some projects ask for a lot of flexibility in regards to ETL and reporting. New data points should be picked up ideally automatically and pushed to the report output. Now, by using Kettle and PDR together, you can even do this for real-time-reporting.
  • Another example is template driven ETL. As a developer you can create ETL processes that you can easily reuse for a lot of projects without inventing the wheel over and over again.
  • Using MDX output in PRD was a bit difficult at times. Imagine a scenario where you had to present the daily performance of the last 10 days. In MDX you can easily use a function to display the last 10 days. The result set might have the dates as columns, which is inconvenient, as they will always change. Hence so far you had to give each date column an alias (i.e.: “date_10_days_ago”) which would not change. Using an index column in PRD is extremely convenient, because you don’t have to worry about this complexity any more
  • … and many more …

Certainly, this approach has its limitations. Not everything can just happen automatically, hence you have to consider carefully in which scenarios this approach will work well.

Creating a Metadata Driven Report

Defining the query
Open Pentaho Report Designer (PRD) and create a new report. Let’s set up a data source:

  1. Click on the Data tab
  2. Right click on Data sets and add a query (follow all the necessary steps)
  3. Click Ok 
  4. Copy your query somewhere in a text editor or so to have it handy, then delete the data source in PRD
  5. So now there shouldn’t be anything under Data Sets. Right click on Data Sets and choose Advanced > JDBC (Custom). In the settings dialog choose your database connetion details and then click OK.
  6. In the data tab, create a new parameter called Query (type: string)
  7. Click on the Structure tab and mark the Master Report
  8. Click on the Attributes tab and click on the query name value field. Insert your query here.
  9. Click on the query name formula icon (+) and insert following open formula: =[Query]

Do not drag and drop any fields onto the details band. Do not put any fields at all on the details band or the details header or footer! If you want to auto-generate the reports on the fly you have to use a scripted pre-processor to configure the fields based on the structure of the result set returned by the query and then the report design wizard generates the fields and formats for the report.

Thomas Morgner explains: 
A report-pre-processor is a specialized function that is executed after the report processing queried the data from the data-factory, but before the report's functions get initialized and the actual report-processing starts. Each master- and sub-report can have its own set of report-processors. The pre-processor is executed only once for each report-definition. At this point in the processing chain, a pre-processor has full control over the layout and data-computation aspects of the report-definition. The pre-processors are defined on the "advanced::pre-processor" attribute of the "master-report" object.
More info can be found on the Pentaho Wiki.

Defining the report pre-processor

  1. Click on the Structure tab and mark the Master Report
  2. Click on the Attributes tab and scroll down to the wizard section
  3. Set wizard-enabled to true
  4. Scroll down to the advanced section
  5. Click on the value field of pre-processor
  6. Add a Relational Report Auto-Generator and Wizard Processor

Note: You can have a look at the code of the Relational Report Auto-Generator by clicking twice on the name in the above shown dialog, then on the right hand side a "..." button will show up. Click this one, another dialog will come up, click on the value field of the script row and you will be shown the whole script. If you want, you can edit it here.

PRD includes some examples of report-pre-processors:

  • Flat output (RelationalAutoGeneratorPreProcessor): Places the fields in the detail band and the labes in the detail header
  • WizardProcessor: Extended version of the one before, can be used as well to only refresh the metadata of predefined fields
  • Most features (BSFReportPreProcessor): Use a scripting language to manipulate various settings

Again, these are only examples, you can build your own report-pre-processor.

The code for the report-pre-processor is stored as a expression-property within the report definition. The code uses the wizard-API to define the fields and then the wizard-process generates the fields. With this approach you can do whatever you want with the (GUI)-wizard itself.

In case your are aiming for really fancy reports, you can add another report-pre-precessor after the wizard processor to post-process the fields generated by the wizard.
Designing the report

  1. Expand the Structure tree and mark Details Header
  2. Click on the Attributes tab and set hide-on-canvas to false
  3. Do the same for the Details Footer
  4. Apply any format that you wish to the Details Header, Details and Details Footer (do not place any elements in these areas). I would advice to set at least font-size, text-color, background-color and height.

Note: If you want to use bands, make sure you click on the Attributes tab, scroll down to the wizard section and set generate-content-marker to true. Normally, if you define fields within the band, the wizard wouldn’t be used as the fields are explicitly set. By enabling generate-content-marker this behaviour is overwritten (the wizard removes your fields and brings its own fields).
Sample design:

Now click on the preview button and write any SQL query in the parameter field, then click Update. Your report will be generated on the fly and will look something like this:

Note: This report setup is for illustration purposes only. I strongly advice to be very careful with allowing users to directly enter queries as this is a good target for SQL injections. Somebody could just enter a DELETE statement and execute it. So make sure you have the right security in place in case you really have to use this setup.

Someone unfamiliar with Pentaho Report Designer might wonder what is metadata driven in this example? Well, normally we would define all the fields in the details header, details and details footer bands. Here we didn't define them at all, but relied on Pentaho Report Designer to make use of the query metadata and display the fields accordingly. This could be achieved by using the report pre-processer followed by the wizard functionality.

Together we are stronger: Pentaho Data Integration and Pentaho Report Designer for total flexibility

In this section I will try to highlight future potential of a metadata driven approach. Currently the tools don't have all the necessary features, hence I will point out at the end what new features should be ideally added.

Metadata driven ETL

Now that we know how to set up a metadata driven report, we can have a look at creating a metadata driven ETL process to supply data to our metadata driven report template.

PDI (Kettle) only recently received a new transformation step called ETL Metadata Injection. The idea is to have one transformation that acts as a metadata injector to the main transformation. This main transformation is a skeleton of an ETL process: All the steps are defined, but not all or most of the metadata. 

Currently not all transformation steps support metadata injection. Find below a list of steps that currently work with the ETL Metadata Injection step (status: July 2011):

  • CSV Input
  • Excel Input
  • Select Values
  • Row De-normalizer 
  • Row normalizer 

So right now, the possibilites are quite limited, but this list will be growing over time as people start submitting requests on the Pentaho JIRA website. So if you already have an idea as to which transformation step should have metadata injection support, visit the aforementioned website and submit a JIRA feature request (ideally check first if there is not already one set up).

If you use the Metadata Injector step and point it to the main transformation, it will show which steps are supported. 

As Matt Casters put it: "The golden rule is: if the step doesn't show up, it's not supporting metadata injection yet".

We will be looking at an extremely simple example (tr_metadata_injection_main.ktr):

Image in we have a dataset which looks like this one:

Our ETL process should pivot this data in this way:

Now if our original dataset contains only a static number of product names (in our case: electronics, consulting), then we can easily set up a predefined ETL process. But what if the number of product names might change in future?

So far we only had the option to amend our ETL process manually, which means we had to edit our Kettle transformation and add the new name(s) to the Row denormalizer step. This was rather inconvenient as ideally this should have been picked up automatically. The great news is, that now this is possible with the ETL Metadata Injection step.

Create following table on your database of choice (amend if necessary):


city VARCHAR(20),
continent VARCHAR(20),
product VARCHAR(20),
sales INT(255)

("city4","South America","electronics",342),
("city5","North America","consulting",1243),

Now let's create our transformation based on the screenshot below:

This transformation will query data from the table we just created. The Select values step allows use to keep only the required fields (It's not really useful in this example, but mainly here for demonstration purposes). Next the data goes into the Row denormalizer step, which basically pivots our data.

To test this transformation I suggest adding a Text file output step and filling out all the required settings in all the steps. Once this works, change the settings of the relevant steps and hops so that they look like the ones shown in the screenshot above. As you can see, we leave the Select values step completely unconfigured and for the Row denormalizer step we only define the Key and Group by fields. Everything else we will configure by using the ETL Metadata Injection step.

We create a second transformation (tr_metadata_injection_config.ktr) to supply the metadata to our main transformation:

First off we have to get the unique product names in order to pivot the data correctly. We use a Table input step therefore with following query:

DISTINCT(product) AS product

For this stream we also add some additional fields to have all the metadata complete for injection to the Row denormalizer step in our original transformation.

We also use a Data Grid step to define the fields that we want to keep for the Select values step in our original transformation. 

Now we can map our metadata with the steps of our original transformation:

  1. Double click on the ETL Metadata Injection step and the settings dialog will pop up. 
  2. Click the Browse button to choose which transformation you want to inject metadata to. 
  3. Click Ok, then double click again on the ETL Metadata Injection step and you will see a tree-like representation of your main transformation. For each step you see all the settings that can be used by the metadata injection. You do not have to provide all of them though, but definitely the essential ones. If you are unsure as to what the essential once are, then just create a separate testing transformation and test which settings are necessary for a certain step to work properly.
  4. For the settings that you want to provide metadata for, click on the Source step field in the respective row and a dialog will guide you through the mapping:
  5. As we want to use the output of our original transformation in this current transformation, we also define the Row denormalizer step as Source step to read from (optional).
The last step Results (Add constant values step) is a hack: Currently there is no way to pass on the dynamic results of the ETL metadata injection to PRD.
By using the Add constant values step, we can define the fields and field types that we expect to be return from the metadata injected transformation. In our case it will look like this:

At runtime the ETL metadata injection step will output fields of the same name and pass them to the Add constant values step. As this last step adds field that already exist in the input stream, these field names will be automatically to _1 since they are duplicates.

Note: You execute the original transformation by running the metadata transformation. So in our example, you would run tr_metadata_injection_config.ktr.
So let's test our setup: Mark the Result (Add constant values) step in tr_metadata_injection_config.ktr and click the Preview button. The output will look like this:


PDI Data Source in Pentaho Report Designer

Now that our metadata driven ETL process is running successfully, the next step is to use this data directly for the report output. In Pentaho Report Designer we can define a PDI transformation as a data source:

  1. Open Pentaho Report Designer
  2. Create a new report
  3. Click on the Data tab
  4. Right click on Data Source and choose Pentaho Data Integration
  5. In the dialog click on the (+) button to specify a new query
  6. Click on Browse to find the metadata configuration transformation we just created
  7. Mark the Result step and click Preview
  8. Click Close followed by OK
  9. Define the report-preprocessor and wizard as described in the Defining the report pre-processor
  10. Design the report as described in Designing the report
  11. Preview the report. It should now look something like this:

Room for improvement
As you can see, the only problem currently is that there is no way to pass the dynamic Kettle dataset to PRD - that is why we had to use a work-around. If a new product name shows up, we still have to change the Result (Add constant values) step in our example. Now imagine the possibilites if a future version of PRD would accept dynamic Kettle datasets: In our example we would not have to change anything. The new product name would automatically be passed on to the report output.

Let's add a new product name to our table (use your favourite SQL client):


In Kettle (Spoon), mark the ETL Metadata Injection step and click on Preview. You will see that our new product name was automatically added to the result set:

The problem now is that we have to manually at this new field name to the Result (Add constant values) step in order for PRD to be able to read it.

The main improvement should happen in PRD to allow dynamic PDI input. PRD already allows dynamic input for most other data source, so adding this feature for PDI data sources should not be a lot of work.

Working with fixed width MDX output in Pentaho Report Designer

We will now have a look at how to easily work with a fixed width MDX result set in Pentaho Report Designer. As mentioned before, sometimes you might have a result set with changing column names. Imagine we have a daily summary for the last 10 days. Each column name has the actual date. Now if I run the same query tomorrow, the column names will have changed. PRD has a feature called index column: So instead of referring to the column names, it allows you to refer to the column position.

For this example here I will reuse an existing PRD example query so that everyone can follow easily:

Changing the Report Designer settings

Before we can get started, we have to enable a special setting. Open up the Pentaho Report Designer and go to Edit > Settings > General and enable Display the index columns in the Report-Designer’s field selectors:

Defining a query

  1. Create a new report
  2. Click on the Data tab
  3. Right click on Data Source and choose OLAP > Pentaho Analysis
  4. Define the OLAP schema file: steelwheels.mondrian.xml. This file should be in the same location as this report.
  5. Mark following database on the left hand side: SampleData. This database comes with PRD, so you don't have to worry about anything here.
  6. In the dialog click on the (+) button to specify a new query (copy the query shown after this numbered list).
  7. Click Preview
  8. Click Close followed by OK
  9. Notice that apart from the standard fields the index fields are shown as well in the Data Sets overview:

    Now we can just drag and drop these index fields on the canvas as we would do with standard fields.
  10. Drag and drop the index fields in the details band.
  11. Do exactly the same for the details header. Mark all of the details header fields and choose Format > Morph > label.
  12. Mark all the index fields (in the details band and in the details header) in the Structure tree using CTRL, go to the Attributes tab and enable following attributes for query-metadata:
    data-format: true
    style-format: true
  13. For each index field in the details header, set the labels-detail-header attribute respectively (i.e.: ::column::0)
  14. If you wish, apply additional formats to the report
  15. Preview the report. It should now look something like this:

MDX to copy (this query actually comes with one of the PRD examples):
  set [TopSelection] as
  'TopCount(FILTER([Customers].[All Customers].Children,[Measures].[Sales]>0), Parameter("TopCount", NUMERIC, 10, "Number of Customers to show"), [Measures].[Sales])'
  Member [Customers].[All Customers].[Total] as 'Sum([TopSelection])'
  Member [Customers].[All Customers].[Other Customers] as '[Customers].[All Customers] - [Customers].[Total]'
select NON EMPTY {[Measures].[Sales],[Measures].[Quantity] } ON COLUMNS,
  { [TopSelection], [Customers].[All Customers].[Other Customers]} ON ROWS
from [SteelWheelsSales]
strToMember(Parameter("sLine", STRING, "[Product].[All Products].[Classic Cars]")), 
strToMember(Parameter("sMarket", STRING, "[Markets].[All Markets].[Japan]")), 
strToMember(Parameter("sYear", STRING, "[Time].[All Years].[2003]"))

As we don't know the field values upfront, we also have to make sure that extremely long values can be handled gracefully. This is out of scope of this article.

Working with dynamic MDX output in Pentaho Report Designer
Now we want to actually go a step further and use dynamic MDX results. For this scenario we will not make use of index fields, but follow the same approach as before with the PDI data source. We use exactly the same query again, but theoretically you could now use any MDX query: 

  1. Open Pentaho Report Designer
  2. Create a new report
  3. Click on the Data tab
  4. Right click on Data Source and choose OLAP > Pentaho Analysis
  5. Define the OLAP schema file: steelwheels.mondrian.xml. This file should be in the same location as this report.
  6. Mark following database on the left hand side: SampleData. This database comes with PRD, so you don't have to worry about anything here.
  7. In the dialog click on the (+) button to specify a new query (copy the query shown after this numbered list).
  8. Click Preview
  9. Click Close followed by OK
  10. Define the report-preprocessor and wizard as described in the Defining the report pre-processor
  11. Design the report as described in Designing the report
  12. Preview the report. It should now look something like this:

MDX to copy (this query actually comes with one of the PRD examples):
  set [TopSelection] as
  'TopCount(FILTER([Customers].[All Customers].Children,[Measures].[Sales]>0), Parameter("TopCount", NUMERIC, 10, "Number of Customers to show"), [Measures].[Sales])'
  Member [Customers].[All Customers].[Total] as 'Sum([TopSelection])'
  Member [Customers].[All Customers].[Other Customers] as '[Customers].[All Customers] - [Customers].[Total]'
select NON EMPTY {[Measures].[Sales],[Measures].[Quantity] } ON COLUMNS,
  { [TopSelection], [Customers].[All Customers].[Other Customers]} ON ROWS
from [SteelWheelsSales]
strToMember(Parameter("sLine", STRING, "[Product].[All Products].[Classic Cars]")), 
strToMember(Parameter("sMarket", STRING, "[Markets].[All Markets].[Japan]")), 
strToMember(Parameter("sYear", STRING, "[Time].[All Years].[2003]"))

The disadvantage of this approach is that the column headers are not particular nice to read ... but there might be a way to control this.


Exploring these advanced features of Pentaho Data Integration and Report Designer has been an extremely interesting journey. There are many more features available which couldn't be covered in this tutorial. I suggest that you take this information here as a starting point and explore all the possibilities if you have time.
Overall, if the next version of PRD brings dynamic PDI data source support, there will a huge potential available. Over the time, PDI will also add additional ETL steps for metadata injection and then our journey will even get more exciting!