Pentaho Kettle Data Input: Pivoted Data
I found quite an interesting example about importing a pivoted data set with an unknown amount of columns on the Pentaho forum. Matt Casters demonstrated a neat solution on how to import this data set. As this example is quite hidden away in the forum, I thought it would be a good idea to publish a short tutorial based on it.
The "problem" with this data set is that we don't really know how many columns there are in advance (as the date is in the columns). We have to find some way to unpivot the data set.
Let us have a look at the steps:
Excel Input: In the content tab make sure you disable header. In the fields tab, name the first field "type" and all the other ones something like "col1"-"col366".
A tip by Matt to speed up this process:
"One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row."
That should make entering large amounts of columns a matter or a few minutes, not hours
As we don't know in advance how many columns there are, the best method is to be prepared for the maximum amount of columns. Make sure that you set all columns to the format String. Also, set the decimal and grouping signs accordingly. Once you have filled out all the necessary settings, hit "Preview rows" to make sure everything is interpreted as it should be.
Filter rows: Here we check if col1 is not null, which just basically makes sure that there are any figures in the data set
rownr: We add a sequence so that we know the row number of each data row
Row Normaliser: Now we unpivot the data. We create a new grouping field called colnr to achieve this. Our data looks now like this:
No empty data: In this step we want to discard any row that has no data
Grab a date: This short JavaScript basically makes sure that each record gets the correct date assigned:
var store;
var coldate;
if (rownr == 1) {
if (colnr==1) {
var store = new Array() // only init once!
}
// Store the date for the column number
//
store[colnr] = str2date(data, "yyyy/MM/dd HH:mm:ss.SSS");
}
coldate = store[colnr];
Drop first data raw: Now that we have the date available for each record, we can drop the records that we retrieved from the first row of the original data set.
The last two steps basically choose which fields we want to keep and do a proper conversion to the respective data types.
Dieter, thanks for picking this one up. I had a lot of fun coming up with this example.
ReplyDeleteOne missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row.
That should make entering large amounts of columns a matter or a few minutes, not hours.
Cheers,
Matt
Thanks a lot Matt for this hint! I'll add it to the tutorial.
ReplyDeleteBest regards,
Diethard