Monday, January 21, 2013

Creating a federated data service with Pentaho Kettle

Creating a federated data service with Pentaho Kettle


  • Kettle (PDI) 5: download here [Not for production use]
  • You are familiar with Pentaho Kettle (PDI)
  • You are familiar with the Linux command line

What is the goal?

We have data sitting around in various disparate databases, files, etc. By creating a simple Kettle transformation which joins all these data together, we can provide a data service to various applications via a JDBC connection. This way, the application does not have to implement any logic on how to deal with all these disparate data sources, but instead only connect to the one Kettle data source. These applications can send standard SQL statements to our data service (with some restrictions), which in turn will retrieve the data from all the various disconnected data sources, join them together and return a result set.
This Kettle feature is fairly new and still in development, but it holds a lot of potential.

Configure the Kettle transformation

I created a very simple transformation which gets some stock data about lenses with prices in GBP (For simplicity sake I use a Data Grid step. In real world scenarios this would be a Database Input step). We get the current conversion rate from a web service and use this rate to convert our GBP prices to EUR. The transformation looks like this:

You can download the transformation from here.

Note the yellow database icon on the top right hand corner of the Output (Select Values) step. This indicates that this step is used as Service step. This can be configured in the Transformation Properties by pressing CTRL+T:

You also have the option to catch the service data in the local memory.

Perform a preview on the last step (named Output):

This is basically the dataset which we want to be able to query from other applications.

Configure Carte

If you don’t already have a configuration file in the PDI root directory, create one:

vi carte-config.xml

And paste this xml in there (please adjust the path to the ktr file):


Save and close.
Let’s start the server now passing the config file as the only argument:
sh carte-config.xml

Query service data from an application

Once the server has started successfully, you can access the service by any client of your choice as long as they support JDBC. Examples of clients are Mondrian, Squirrel, Pentaho Report Designer, Jaspersoft iReport, BIRT, and many many more.

For simplicity sake, we will just query the data service directly from Kettle:
  1. Click on the View tab.
  2. Right click on Database Connections and choose New Connection Wizard.
  3. Enter the following details:
DriverKettle Thin JDBC Driver (org.pentaho.di.core.jdbc.ThinDriver)

Then click the Test button. Kettle should be able to successfully connect to our data service.

Finally, click OK.

Next we just want to execute a simple SQL query. In the View tab, in Database connections, right click on the connection name you just created and choose SQL Editor and insert the following query and click execute:

SELECT * FROM lensStock WHERE price_gbp > 100

Note that the table name is the service name that we configured earlier on in the carte-config.xml.
The returned dataset will look like this:

Some other applications ask for a JDBC connection string, which looks like this:

Further Reading

You can find a lot more detailed info on the Pentaho Wiki.


  1. I've tried to find a version of kettle on and even on that has this feature built in - but no success...
    After checking out about 10 git branches just to try to find this "data service" tab I now give up with this :-(

    1. It seems that this feature is now available in the enterprise edition only.

  2. Actually this git commit has data service enabled: pentaho-kettle-72cad8596129ac407ad32b3b0adb8db936e696f7