Thursday, November 18, 2010

Pentaho Kettle Data Input: Pivoted Data

Pentaho Kettle Data Input: Pivoted Data

I found quite an interesting example about importing a pivoted data set with an unknown amount of columns on the Pentaho forum. Matt Casters demonstrated a neat solution on how to import this data set. As this example is quite hidden away in the forum, I thought it would be a good idea to publish a short tutorial based on it.

Our data set looks like this (download here) (Screenshot shows only part of the whole data set):
The data was originally retrieved from Euribor.

The "problem" with this data set is that we don't really know how many columns there are in advance (as the date is in the columns). We have to find some way to unpivot the data set. 

Matt's solution (published 2008) looks like this (download here):
Let us have a look at the steps:

Excel Input: In the content tab make sure you disable header. In the fields tab, name the first field "type" and all the other ones something like "col1"-"col366".

A tip by Matt to speed up this process:
"One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row."

That should make entering large amounts of columns a matter or a few minutes, not hours
As we don't know in advance how many columns there are, the best method is to be prepared for the maximum amount of columns. Make sure that you set all columns to the format String. Also, set the decimal and grouping signs accordingly. Once you have filled out all the necessary settings, hit "Preview rows" to make sure everything is interpreted as it should be.

Filter rows: Here we check if col1 is not null, which just basically makes sure that there are any figures in the data set

rownr: We add a sequence so that we know the row number of each data row

Row Normaliser: Now we unpivot the data. We create a new grouping field called colnr to achieve this. Our data looks now like this:
No empty data: In this step we want to discard any row that has no data

Grab a date: This short JavaScript basically makes sure that each record gets the correct date assigned:

var store;

var coldate;

if (rownr == 1) {
  if (colnr==1) {
var store = new Array() // only init once!
  }
  // Store the date for the column number
  // 
  store[colnr] = str2date(data, "yyyy/MM/dd HH:mm:ss.SSS");
}

coldate = store[colnr];

Drop first data raw: Now that we have the date available for each record, we can drop the records that we retrieved from the first row of the original data set.

The last two steps basically choose which fields we want to keep and do a proper conversion to the respective data types.

2 comments:

  1. Dieter, thanks for picking this one up. I had a lot of fun coming up with this example.

    One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
    Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row.
    That should make entering large amounts of columns a matter or a few minutes, not hours.

    Cheers,
    Matt

    ReplyDelete
  2. Thanks a lot Matt for this hint! I'll add it to the tutorial.
    Best regards,
    Diethard

    ReplyDelete