Sunday, June 13, 2010

PDI: Clone your data in Kettle! (Video Tutorial)

In this tutorial we look at the cloning feature of Pentaho Data Integration/Kettle and use the power of the JavaScript step. Please watch the video here:

Part 1:

Part 2:

Part 3:

The files for this tutorial: Download the budget data here and the PDI/Kettle file here.

You can find the video's also on YouTube:

Video Part 1: click here
Video Part 2: click here
Video Part 3: click here

Please find below some more information:

Code for 1st JavaScript step:

function daysInMonth(iMonth, iYear)
return 32 - new Date(iYear, iMonth, 32).getDate();

var days_in_month=daysInMonth(month,year);
var days_loop_nr=days_in_month-1;

var year=2010;

var bdgt_unsubs=bdgt_sub_base_start+bdgt_new_subs-bdgt_sub_base_end;

I found a nifty function (”daysInMonth”) on the internet that calculates the days of a particular month (we are neither JavaScript heros nor do we have time). This function provides us with the required number of days in a month, which we assign to the days_in_month variable. Important thing to remember is that we already have one row of data for each month, hence we have to deduct 1 from the days_in_month. We store this value in the “days_loop_nr” variable which we will use for the clone step.

We also add a default year and calculate how many users unsubscribed from our service (”bdgt_unsubs”). We need this figure as we want to calculate later on a daily figure for the subscription base (users subscribe and unsubscribe all the time).

Code for the 2nd JavaScript step:

var day_of_month;
// used to hold value of previous row:
var year_month_old;
var bdgt_sub_base_start_old;
var bdgt_new_subs_daily_old;
var bdgt_unsubs_daily_old;

//define day of month
var year_month=(year+"")+(month+"");

else {

//calculate the daily sub base at the start of the day
if(month==1 && day_of_month==1){
var bdgt_sub_base_start_daily=bdgt_sub_base_start;
else {
var bdgt_sub_base_start_daily=bdgt_sub_base_start_daily_old+bdgt_new_subs_daily_old-bdgt_unsubs_daily_old;

//keep value of current fields for next iteration
//assign them to a new variable so that they are available in the next iteration
//it is important that you only save the new value of year_month
//here as you want to work with the old value in the if statement

We declare some variables first. Then we create the day number for each row: We use an if condition therefore. We have to start with 1 when a new month begins and then increase the number by 1.

It is important to understand the condition:

For the very first row of our dataset, the variable year_month_old won’t have a value. Hence, if(year_month!=year_month_old) will be true.

At the very end of the script we create a new variable call year_month_old which stores the current value of year_month for the next iteration. We only create this variable at the end of the script, so that we can use it in the next iteration for the condition before we overwrite it again.

A variable will be only overwritten if a new value is available.

So imagine we are in the second row of our dataset, the if(year_month!=year_month_old) condition will look like this: if(201001!=201001), hence the script will increase the value of the day by 1.  Once the month changes (i.e. if(201002!=201001)) then we have to start with 1 again.

That’s quite a nice solution. Again the javascript code might not be the best one, but it works. Please feel free to improve it.

We also have figures that we cannot just divide by the number of days in a month. The subscription base has to be calculated in a more complex way (well, not that complex). The calculation uses the same concept as above: saving the field value of the previous row in a new variable so that you can use it again in the current row.

No comments:

Post a Comment