Sunday, December 12, 2010

Be careful with running multiple step copies in Pentaho Kettle

Be careful with running multiple step copies in Pentaho Kettle

In this review we have a look at running multiple copies of one step in Kettle (Pentaho Data Integration). If your computer has more than one core, Kettle can use the power by running multiple copies of a given step in parallel. Each copy of one step uses one core. This is an extremely powerful feature, but it should be used with care.

It is important to consider that you cannot just thoughtlessly apply x amounts of copies to any given step. You always have to keep in mind what this step actually does.

I prepared here an example which will demonstrate you how you can run into troubles if you don't pay attention. Note: This is not a Kettle error, but a user error.

You can download the sample transformation here.

Scenario 1: This simple example uses a data grid input step. The data is denormalised and then joined to an additional data set and finally we create a summary.

Scenario 2: We use exactly the same process again, only now we increase the amount of copies for the denormaliser step to 3.You can change the amount of copies to run in parallel by right clicking on the step and choosing "Change number of copies to start ...". All this does is use the definition of your step and run multiple copies of this step in parallel (highlighted by x3 on the top left corner of the step).

Screenshot of our transformation:

Output of the denormaliser step scenario 1:

Output of the denormaliser step scenario 2 (running 3 copies of the denormaliser step):

Output of the Group by step scenario 1:

Output of the Group by step scenario 2:

As you can see, there is a huge difference in the total amount of additional revenue in scenario 2:

The aggregation of our original sales records works fine, but as you can see in the preview the additional sales figures are 3 times as much as they should be, which is due to the 3 copies of the "Row denormaliser 2". This is down to the fact that we forgot to aggregate the output of this step.

So why did this happen? How does it work when you run multiple copies of a step?

Basically Kettle distributes rows in a round-robin fashion to each copy of the step, so in our example (running 3 copies of one step) the first row will go to the first step, the 2nd row to the 2nd step, the 3rd row to the 3rd step, the 4th row to the 1st step and so on.

What is the correct approach to create this transformation using multiple copies?

After the denormalise step, we add a group by step to summarize the data by date:

Now the output of the Group by step looks fine:

Note: If you placed the Join step directly after the denormaliser step (set to multiple copies), Kettle would show a warning message, indicating that you have to summarize your data before the join. Sometimes you will have some additional steps between the step that your run in multiple copies and a join, hence no warning message is displayed.

After applying multiple copies to one step I strongly suggest that you make use of the preview function to analyze how your data set looks.

You can download the transformation here.

Saturday, November 27, 2010

How to Set Up Pentaho Community Build Framework

How to Set Up Pentaho Community Build Framework

Introduction

Please note that I am still writing on this article. I will update some parts later on. For some parts I have open questions that still have to be clarified. If you have any ideas/suggestions, please let me know and I happily will add it to this article.

Some sections were also copied from other sources, some of which are mentioned in the "Sources" section.

In the following chapters we will have a look at how to set up an easily maintainable environment. This article is not an introduction to Pentaho. I assume that you are familiar with the Pentaho products.

We will look at how to set up the Pentaho BI Server with the Community Build Framework (CBF).

Why would you use CBF?

Easily upgrade to new version of the BI server without going through a long setup process
Easily create environment specific versions of your BI server without going through a long setup process

Pentaho BI Server Setup with CBF

Sources:

(OS)BI Blog

Pentaho Wiki

Introducing Community Build Framework

We are in the lucky position that Pedro Alves from Webdetails offers the freely available CBF (Community Build Framework), which makes setting up and upgrading the BI Server for various environments a relative easy task. I try to talk you through the setup for the most popular operating sytems.

The main folder/file structure of CBF looks like this (simplified):

CBF

Pentaho BI Server source code folder(s): Each version of the source code will be stored in one folder. These folders are left completely untouched. Changes to the original source can be made via patches.
Project folder(s): For each project you can create a separate folder (i.e. project-nike, project-puma, ...) which holds the project specific configurations and files.

config

build.properties file, which tells ant what and where we want to build. A minimal config contains the path to your Java compiler, solutions and web server (among others).
build-env.properties (optional) where env is a placeholder for any environment. So you could name it build-prod.properties, build-test.properties etc. This is very neat as you can have global settings in the build.properties and environment specific ones in the specific build-env.properties. When we are using Ant to build our BI server, ant will load the build.properties and then use the specific build-env.properties.

patches: all files that need some kind of modification are copied here from the source code (maintaining nearly the same folder structure) and replacing constant values with tokens in form @tokenname@. The tokens are specified in the build.properties.
solution: This folder holds your reports, xactions etc. (same as Pentaho solutions folder on standard BI Server)

Target-Build: All source code files will be copied to this folder to build your BI server. All patches will be applied to this BI server.
Target-Dist: The BI Server will be copied to this folder after everything has been built
build.xml: Note that any build.xml doesn't work with all versions of the BI server. When downloading make sure you choose the version of build.xml that accepts the version of your BI server.

Note that the Pentaho BI Server source code folder(s), target-build and target-dist are shared among all projects. target-build and traget-dist are handled by ant, so you don't have to do anything about them.

Note: Currently the main folder structure has to be created mostly manually.

Create following folder structure to start with. This is common for all operating systems. We will discuss soon OS specific steps.

Pentaho

bi-server-source
project-tutorial

config
patches
solution

target-build
target-dist

The folder for the source code will be created automatically once we download the code with SVN.

In general, the below is how a folder structure could look like after completion of the setup. Ours will look slightly different.

CBF

|-- build.xml

|-- pentaho

| |-- bi-platform-api

| |-- bi-platform-appserver

| |-- bi-platform-assembly

| |-- bi-platform-assembly-open

| |-- bi-platform-build

| |-- bi-platform-engine-core

| |-- bi-platform-engine-security

| |-- bi-platform-engine-services

| |-- bi-platform-legacy

| |-- bi-platform-plugin-actions

| |-- bi-platform-plugin-services

| |-- bi-platform-repository

| |-- bi-platform-sample-data

| |-- bi-platform-sample-solution

| |-- bi-platform-scheduler

| |-- bi-platform-test-foundation

| |-- bi-platform-ui-foundation

| |-- bi-platform-util

| |-- bi-platform-web

| |-- bi-platform-web-portlet

| |-- bi-platform-web-servlet

| |-- dummy-jre

| |-- mantle

| `-- test-solutions

|-- project-client

| |-- patches

| | |-- pentaho

| | `-- target-preconfiguredinstall

| `-- solution

|-- target-build

`-- target-dist

The pentaho directory is where we will copy the a recent build of Pentaho BI Server. The idea is to leave this directory untouched and do all changes in the project-client directory. The below is an example structure, which depends on your specific project needs:

project-client/

|-- config

| |-- build.properties

| `-- build-pedro.properties

|-- patches

| |-- target-build

| | `--

| `-- target-dist

| `-- server

| |-- conf

| | |--

| | `-- jboss-service.xml

| |-- webapps

| | |-- pentaho

| | | `-- WEB-INF

| | | `-- web.xml

| `-- lib

| `-- postgresql-8.2-505.jdbc3.jar

`-- solution

|-- Portal.properties

|-- Portal.url

|-- Portal_pt.properties

|-- admin

.... etc

Pedro Alves explains:

"The idea is very simple:

All changes that would normally go to pentaho/* are placed under "patches" directory (project-client/patches/).
The CBF ant script will pick up the files in the project-client/patches/ directory, scan for tokens and replace the tokens with the variables defined inside the project-client/config/build.properties files, and copy the files to the top level directory of the entire project (In this example MyProjectDir [remark: here we change it to CBF]).
It's not recommended to patch anything under pentaho/*; sources changes are patched in to target-build/* and all other changes are made by patching the final directory, target-dist."

Windows setup

Please note that I set up everything in the directory D:\Pentaho. You might want to choose a more convenient folder for your own project.

Java, Tomcat and Ant

If you don't have Java, Tomcat and Ant running already, install them:

Java: JDK [Java Standard Edition; short Java SE] (JRE is not enough as Ant needs JDK to work properly). Read installation instructions. Install it in D:\Pentaho
Tomcat: At the time of writing, v6 was the most recent one. The location of the tomcat directory is not important; it can be placed almost any where. The location of the tomcat will be set inside of the CBF build.properties file. I installed Tomcat in D:\Pentaho
Ant: Download Apache Ant 1.8.1 and place it in D:\Pentaho. Installation notes: here.

Set environment variables

Using the GUI

Windows 7: Control Panel > System > Advanced System Settings > Click on the Environment Variables button

Set the JAVA_HOME environment variable to the directory where you installed JDK:

Click on New under the System variables section.
Type JAVA_HOME in the variable name field.
Type D:\Program Files\Java\jdk1.6.0_21 in the variable value field

Set the ANT_HOME environment variable to the directory where you installed Ant:

Click on New under the System variables section.
Type ANT_HOME in the variable name field.
Type C:\ant in the variable value field.

Set the PATH environment variable to include the directory where you installed the Ant bin directory:

Find the PATH environment variable in the list. If PATH is not listed, click on New under the System variables section.
Type %ANT_HOME%\bin;%JAVA_HOME%\bin;

Important: If there are other variables listed, create a new variable separated by a semicolon. Ensure there are no spaces before or after the semicolon.

Windows Note:

The ant.bat script makes use of three environment variables - ANT_HOME, CLASSPATH and JAVA_HOME. Ensure that ANT_HOME and JAVA_HOME variables are set, and that they do not have quotes (either ' or ") and they do not end with \ or with /. CLASSPATH should be unset or empty.

Using Command Line

Assume Ant is installed in c:\ant\. The following sets up the environment:

set ANT_HOME=c:\ant

set JAVA_HOME=c:\jdk-1.5.0.05

set PATH=%PATH%;%ANT_HOME%\bin;%JAVA_HOME%\bin;

Check Ant and Java are working

You can check the basic installation by opening a new shell and typing ant. You should get a message like this

Buildfile: build.xml does not exist!

Build failed

or something like this:

D:\Pentaho\CBF>ant

Buildfile: D:\Pentaho\CBF\build.xml

BUILD FAILED

D:\Pentaho\CBF\build.xml:7: FATAL: 'project' property not set. Please provide it

the ant command, eg: ant -Dproject=myproject -Denv=dev

So Ant works. This message is there because you need to write an individual buildfile for your project. With a ant -version you should get an output like

D:\Pentaho\CBF>ant -version

Apache Ant version 1.8.1 compiled on April 30 2010

Type in Java, hit Enter and you should get same info back:

D:\Pentaho\CBF>java

Usage: java [-options] class [args...]

(to execute a class)

or java [-options] -jar jarfile [args...]

(to execute a jar file)

where options include:

-client to select the "client" VM

-server to select the "server" VM

-hotspot is a synonym for the "client" VM [deprecate

The default VM is client.

-cp <class search path of directories and zip/jar files>

If you have problems with either Ant or Java, check that your environment variables are properly set.

Creating the .ant directory

For our work we will need an optional ant task that is not available with the default set up. In your ant directory you find a file called build.xml.

This is an Ant script that you can run to install almost all the dependencies the optional Ant tasks need.

To do so, change to the ANT_HOME directory and execute the command:

ant -f fetch.xml -Ddest=user

You should see then something like this (extract):

D:\Pentaho\apache-ant-1.8.1>ant -f fetch.xml -Ddest=user

Buildfile: D:\Pentaho\apache-ant-1.8.1\fetch.xml

pick-dest:

[echo] Downloading to C:\Documents and Settings\diethard.steiner\.ant\lib

probe-m2:

download-m2:

[echo] Downloading to C:\Documents and Settings\diethard.steiner\.ant\lib

[mkdir] Created dir: C:\Documents and Settings\diethard.steiner\.ant\lib

[get] Getting: http://ibiblio.org/maven2//org/apache/maven/maven-artifact-

ant/2.0.4/maven-artifact-ant-2.0.4-dep.jar

[get] To: C:\Documents and Settings\diethard.steiner\.ant\lib\maven-artifa

ct-ant-2.0.4-dep.jar

In general, options for Ddest are:

system - store in Ant's lib directory
user - store in the user's home directory
optional - store in Ant's source code lib/optional directory, used if building Ant source code

If you go to your user directory now, you should see the .ant folder.

XMLTask.jar

Download XMLTask.jar from here and move the jar file to the .ant\lib\ folder.

Subversion

Install on of the distributions. I provide here a list of options, choose one that suits you:

CollabNet Subversion Edge 1.2.2
http://www.sliksvn.com/en/download and a client like http://tortoisesvn.net/downloads or subclipse (an Eclipse plugin). For some additional info have a look here: http://www.codinghorror.com/blog/2008/04/setting-up-subversion-on-windows.html
Hudson is also an option as it includes subversion
Or download http://www.cygwin.com/ and issue the Linux command.

Database

Let's get started with Hypersonic DB, let on you can change to your own choice of DB.

Download HSQL 2.0 http://sourceforge.net/projects/hsqldb/files/ and save it in a convenient folder
Start the DB Server by double clicking on runServer.bat (in D:\Pentaho\hsqldb-2.0.0\hsqldb\bin)

CBF Build XML

Create the directory D:\Pentaho\CBF. Download CBF's build.xml from the Wiki Page and rename it to build.xml. Save it in Create the directory D:\Pentaho\CBF

Download a BI Server Build

Open the command prompt (DOS Client) and go to this folder. Change to D:\Pentaho\CBF\bi-server-source\ and issue following command to download (check out) BI Server 4:

svn co svn://source.pentaho.org/svnroot/bi-platform-v2/branches/4.0

If you are under a proxy you can use http:// instead of svn://

svn co http://source.pentaho.org/svnroot/bi-platform-v2/branches/4.0

You will see if the command is working once you don't get an error message and it starts adding a lot of files (A at the beginning of each line).

If you are not sure which version is the latest BI Server, just type this URL in your web browser:

http://source.pentaho.org/svnroot/bi-platform-v2/branches/

Create build.properties

An example of a build.properties file you can find here. Make sure that for Windows you use double backslashes in your paths!

#####################################

## GENERIC PROPERTIES

######################################

javac.path = D:\\Program Files\\Java\\jdk1.6.0_21\\bin\\

# Solution path. Required for compile time

solution.path = D:\\Pentaho\\CBF\\project-tutorial\\solution\\

# Solution path required for runtime. Defaults to the above value but can be different if we're

# deploying to another place. Should be used in WEB-INF/web.xml in the solution-path parameter

solution.deploy.path = D:\\Pentaho\\CBF\\project-tutorial\\solution\\

#accepted values: tomcat or jboss

server.name = tomcat

# Tomcat 5.5 path:

tomcat.path = D:\\Pentaho\\apache-tomcat-6.0.29\\

# Copy the sample solutions to our project directory? true | false

copy.pentaho.samples = true

BASE_URL = put your URL here

# Java options for the run command

java.opts = -Xmx512m -XX:MaxPermSize=512m -Xrunjdwp:transport=dt_socket,address=8765,server=y,suspend=n

#####################################

## PROJECT DIRECTORIES - The defaults usually work fine

######################################

pentaho.dir = bi-server-source\\3.5\\

pentaho.build.dir = target-build\\

#####################################

## DEPLOY OPTIONS

######################################

deploy.mode = rsync

deploy.args = -avz --exclude '.svn/' --exclude '*.log' --exclude 'work/' --exclude 'temp_user/' --exclude 'temp/' --exclude 'tmp/'

deploy.dest.server = user@host:path

deploy.dest.solution = user@host:path

Create patches

We will only go through one example here. Please prepare all other files that need changes in a similar fashion.

Example:

Copy web.xml from pentaho/bi-platform-appserver/webapps/pentaho/WEB-INF/web.xml to project-client/patches/target-dist/server/webapps/pentaho/WEB-INF/web.xml

Inside of this file, tokens can be placed that will be set by the CBF ant script. See the example snippet of web.xml bellow.

<context-param>

<param-name>solution-path</param-name>

<param-value>@solution.deploy.path@</param-value>

</context-param>

<context-param>

<param-name>base-url</param-name>

<param-value>@BASE_URL@</param-value>

</context-param>

So basically, the files stays exactly the same, you only replace some values by tokens (highlighted in red).

The tokens @solution.deploy.path@ and @BASE_URL@ are defined in the project-client/config/build.properties or project-client/config/build-client.properties files and will be replaced by the CBF ant script and the new revised web.xml with the replaced tokens will be copied to the top level directory (In this example CBF).

Start Hypersonic DB

Go to D:\Pentaho\hsqldb-2.0.0\hsqldb\bin and start runServer.bat.

Create the build

In your command line tool go to the CBF folder and issue ant -Dproject=tutorial -p, which will show you all the parameters that you can pass to the build.xml. Find an extract below:

D:\Pentaho\CBF>ant -Dproject=tutorial -p

Buildfile: D:\Pentaho\CBF\build.xml

[echo] --------------------------------------------------------------

[echo] ------ CBF - Community Build Framework -------

[echo] ------ Version: 3.5.0 -------

[echo] ------ Author: Pedro Alves (pedro.alves@webdetails.pt) -------

[echo] --------------------------------------------------------------

[echo]

[echo] SETTINGS:

[echo]

[echo] Project: ukint

[echo] Environment: build.properties

[echo] Config File: project-tutorial/config/build.properties

[echo] Server: tomcat

[echo] --------------------------------------------------------------

Pentaho BI Platform build helper..

Main targets:

all Compiles and builds the entire project

clean Cleans the solution

copy-finish Copy target files

copy-init Copy project files

Now issue the following:

ant -Dproject=tutorial -Denv=dev dist-clean all run

The command will first load build.properties and then because of the -Denv=dev, the command will load build-dev.properties. As mentioned, you don't have to have a build-env.properties file, it's optional. If you don't you don't have this file, you just run the following:

ant -Dproject=tutorial dist-clean all run

dist-clean will delete any previous buids.

Ubuntu (Linux) Setup

Open terminal (Applications > Accessories > Terminal).
Issue javac -version to see if you have a recent JDK installed.

The next section describes the setup for users who want to avoid working with command line as much as possible. Users familiar with the Terminal, please jump to "Using Terminal".

Using Applications > Ubuntu Software Center

For users that are not familiar with the command line interface, this is the way to go ... (although we still have to use the Terminal a bit)

In Ubuntu Software Center search for "openjdk-6-jdk" and click to install it [OPEN]
search for Ant, "Java based built tool like Make, Ant" will show up. Click the install button.
Search for subversion. "Advanced version control system, subversion" will shop up. Click the install button. This package includes the subversion client (svn), tools to create a Subversion repository (svnadmin) and to make a repository available over a network (svnserve). The fastest way now to progress is to use the Terminal. Follow these instructions, skip the first step as we already installed subversion. Your svn repository should then reside in /usr/local/svn/repos.
Search for subversion and choose one of the clients, like Subcommander
Search for Tomcat. At the time of this writing, Tomcat 6 was the current version. Install it.
Download a hypersonic database from here. Go to Place > Home Folder, click File > Create Folder and name it "Pentaho". Unzip the HSQLDB file in the Downloads folder and move the unzipped folder to the recently created "Pentaho" folder. [/home/diethardsteiner/Pentaho]
You can also download MySQL if you want (some install info you can find here). We will not cover setting up the environment with MySQL here, but you can later on progress to include MySQL in your environment.
Still being in the "Pentaho" folder, go to File > Create Folder and name it "CBF". Download CBF's build.xml form the CBF Wiki page, extract it and move it to the recently created "CBF" folder. Mark the file, hit F2 and rename it to build.xml. [/home/diethardsteiner/Pentaho/CBF]

Using Terminal

Note: If you followed the instructions of "Using Applications > Ubuntu Software Center", then you can ignore this section.

Install Java: sudo apt-get install openjdk-6-jdk

Install ant: sudo apt-get install ant
Install subversion: sudo apt-get install subversion. A good documentation on how to set up subversion on Ubuntu can be found here. Follow these instructions. Your svn repository should then reside in /usr/local/svn/repos. Additional info can be found here.
Install Tomcat: sudo apt-get install tomcat6
Download a hypersonic database from here. Create a folder: mkdir $Home/diethardsteiner/Pentaho. Move the folder in this directory.

Create the .ant Directory

Ant is located in usr/share/ant. Follow the instructions mentioned in the Windows section: click here.

XMLTask.jar

Download XMLTask.jar from here and move the jar file to /home/<username>/.ant/lib/. ".ant" is a hidden folder, hence you normally wont see it if you go to Places > Home Folder. Press "CTRL+H" to temporarily see the hidden folders or go to Edit > Preferences and in the "View" tab tag "Show hidden and backup files" to permanently see the hidden files.

CBF Build XML

See here

Download a BI Server Build

See here

Create build.properties

See here

Create patches

See here

Mac OS X Setup

I won't go through the procedure here again, just some special notes:

Subversion is included in Mac OS X Leopard and Snow Leopard. Have a look at this tutorial on how to get it running.

Mac specific SVN Clients: Cornerstone and SCPluging

Kettle: Handling Dates with Regular Expression

Kettle: Handling Dates with Regular Expression

This is the second tutorial that focuses on using regular expressions with Pentaho Kettle (PDI). This is again a demonstration on how powerful regular expression acutally are.

In this example we are dealing with a simplified data set containing date values. The problem is, that the date doesn't have a standard format. We have following values:

date

2010/1/2

2010/01/2

2010/1/02

2010/01/02

20100102

We assume for now, that all follow at least this basic standard: year, month, day. Now, we need to somehow generate a standard date in the format yyyy-MM-dd using the value from the date field as input.

Regular expressions are of much help here, as we can use the concept of capturing groups. Capturing groups will allow us to retrieve the year, month and day parts of the date easily, then we build together our final date (we want it in the yyyy-MM-dd format). Now let's see how this is done:

For our convenience, we use the data grid step to store our dummy dates for testing:

Now we are ready to go ahead and work on our regular expression. Insert the Regex Evaluation step from the Scripting folder and configure it as shown in the screenshot below:

On the content tab make sure that you select "Permit whitespace and comments in pattern". Now let's have a look at the regular expression:

The string must start with 4 numbers. We enclose the definition by brackets to create the first capturing group. Note: I added #1. This is a comment and helps to mark the capturing groups for easy reference.

Next we say that a dash can follow or not. This is our 2nd capturing group.

I guess you get the idea for the remaining capturing groups. In the end we make sure that nothing else follows, hence we use the dollar sign.

Once we have created our capturing groups, we can reference them in the "Capture Group Fields". First make sure that "Create fields from capture" is activated. Then fill out the fields below as shown.

This step will basically get the year, month and day parts.

Now we have to build the date together. We can use the formula step therefore:

First we check if a value exists, if not, we set the field to 0s. In case a value exists, we check if the month and day part have a leading zero, if not, we add it. Finally, we build the whole date string together.

Now, it is not a very good approach to save non standard 0000-00-00 dates in the database (i.e.), hence we use the "Null if ..." step to set records with 0000-00-00 date to null:

The final step left to do is to convert the string to a proper date. You can do this with the "Select values" step. Go to the Meta-data tab and fill it out as shown below:

Save the transformation and run it. Now let's have a look at the output:

You can see that it is quite simple to handle non standardized data with regular expressions. You can download the transformation here.

Friday, November 26, 2010

Review "Pentaho Kettle Solutions"

A short review of the "Pentaho Kettle Solutions" book

Matt Casters, Roland Bouman and Jos van Dongen's Kettle bible was released about 3 months ago and I finally managed to finish reading it (600+ pages!!!).

This is the second book about Kettle: At the beginning of this year María Carina Roldán released "Pentaho 3.2 Data Integration: Beginner's Guide" (my review here), which is written in quite a hands-on/step by step approach. "Pentaho Kettle Solutions" is more a theoretical piece of work, mixed with a lot of examples (jobs and transformation that you can download from the accompanying website). It takes the bigger context and tries to map Kettle features to industry standards (namely Ralph Kimball's subsystem). So instead of saying: Kettle can do this, the book speaks about the data warehousing requirements and identifies Kettle steps that meet these requirements. Important topics like creating star schemas and loading a data warehouse are explained. The next chapters talk you through working with OLAP data, the ETL development lifecycle, versioning, auditing, linage, performance tuning, parallelization, clustering and partitioning, .... it explains how to access all the various data sources (also including examples of SOAP, RSS, etc), how to create your own Kettle plugin and so much more that I won't be able to fit it all in a short review. It definitely is the ultimate resource for anything to do with Kettle.

Now, which book about Kettle should you get?

Answer 1: Both. These books are very different in what they are trying to bring across. They are not really overlapping, so it makes sense reading both.
Answer 2: If you want to have a quick start in a practical step by step fashion, get "Pentaho 3.2 Data Integration: Beginner's Guide"
Answer 3: If you want to understand the bigger picture, then go for "Pentaho Kettle Solutions".

Overall, I have to say "THANKS" to the authors for providing such an extensive book about Kettle! It's been a pleasure to read and it's extremely informative! There is only one thing left to say: Go out and get it ;)

P.S.: Roland Bouman and Jos van Dongen are also the authors of "Pentaho Solutions", another excellent book that you want to have in your library.

Thursday, November 18, 2010

Pentaho Kettle Data Input: Pivoted Data

Pentaho Kettle Data Input: Pivoted Data

I found quite an interesting example about importing a pivoted data set with an unknown amount of columns on the Pentaho forum. Matt Casters demonstrated a neat solution on how to import this data set. As this example is quite hidden away in the forum, I thought it would be a good idea to publish a short tutorial based on it.

Our data set looks like this (download here) (Screenshot shows only part of the whole data set):

The data was originally retrieved from Euribor.

The "problem" with this data set is that we don't really know how many columns there are in advance (as the date is in the columns). We have to find some way to unpivot the data set.

Matt's solution (published 2008) looks like this (download here):

Let us have a look at the steps:

Excel Input: In the content tab make sure you disable header. In the fields tab, name the first field "type" and all the other ones something like "col1"-"col366".

A tip by Matt to speed up this process:

"One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).

Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row."

That should make entering large amounts of columns a matter or a few minutes, not hours

As we don't know in advance how many columns there are, the best method is to be prepared for the maximum amount of columns. Make sure that you set all columns to the format String. Also, set the decimal and grouping signs accordingly. Once you have filled out all the necessary settings, hit "Preview rows" to make sure everything is interpreted as it should be.

Filter rows: Here we check if col1 is not null, which just basically makes sure that there are any figures in the data set

rownr: We add a sequence so that we know the row number of each data row

Row Normaliser: Now we unpivot the data. We create a new grouping field called colnr to achieve this. Our data looks now like this:

No empty data: In this step we want to discard any row that has no data

Grab a date: This short JavaScript basically makes sure that each record gets the correct date assigned:

var store;

var coldate;

if (rownr == 1) {

  if (colnr==1) {

var store = new Array() // only init once!

  }

  // Store the date for the column number

  //

  store[colnr] = str2date(data, "yyyy/MM/dd HH:mm:ss.SSS");

}

coldate = store[colnr];

Drop first data raw: Now that we have the date available for each record, we can drop the records that we retrieved from the first row of the original data set.

The last two steps basically choose which fields we want to keep and do a proper conversion to the respective data types.

Friday, November 12, 2010

Using regular expressions with Pentah...

Using regular expressions with Pentaho Data Integration (Kettle)

There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few.

Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):

Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).

To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.

For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression:

^([a-zA-z]+)(\s{1})(.*)$

This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.

In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.

Finally, we output the results to a text file. The output looks like this:

Some info about using regular expressions you can find here and here and here and ...

You can find the XML code for the transformation here.

If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.

Monday, November 1, 2010

PDI Kettle Plugins

Pentaho Data Integration Plugins

Agile BI

This extremly useful plugin can be downloaded from the Pentaho website (plugin for PDI 4, plugin for PDI 4.1 RC1). [Due to the fact that this plugin is not open source, PDI doesn't have it installed by default. Pentaho was so kind to make it available for the community version for free.]

Unzip the file into the data-integration/plugins/spoon directory. It will create a folder in there named agile-bi. Start spoon and the new capabilities will automatically be available.
Once you have done this, fire up Spoon, create a transformation which outputs the data to a completely denormalized table.
Once there is data in this table, right click on the table output step, choose Model.
In the model view, you can click the "Auto populates model with default dimensions and measures" icon. If this doesn't do a decent job generating your model, you can always change it manually.
Once you have properly prepared your model, save it and return to the data integration perspective.
Right click on the table output step again and choose Visualize > Analyzer. In this perspective you can fully dig into your data and discover any problems. For example, my data set has a country data point. The values are supposed to be full country names, but I realize that somehow in my data "AR" shows up instead of "Argentina". So I can go back to the data integration perspective, do the necessary changes to the transformation, save it, run it again, go back to the Analyzer, refresh the data and I can see that now all my country values are valid. This is an absolute time saver and very efficient approach to quality checking your data.

There are a couple of outer things you can do with this plugin as well, i.e. create a report with the wizard known from the Report Designer.

Kettle Franchising Factory

This is a very interesting project to give you a framework for ETL development. The project description reads as follows:

The Kettle Franchising Factory (KFF) adds on top of the existing kettle platform the necessary tools to open multiple data integration restaurants in a rapid, flexible and organised way. KFF allows you to deploy a large series of data integration solutions (multi-customer, multi-solution) in a fully standardized way.

KFF is composed of:
Kettle plugins
re-usable transformations/jobs
logging/scheduling framework
standards
naming conventions
best practices for set-up
directory structures

I hope that we see further development on it. You can find a presentation about it here and download it here.

Kettle Cookbook

Another very promising project initiated by Roland Bouman (the co-author of the excellent Pentahos Solution books): This is not really a plugin, but a job that auto generates a documentation based on the description you added to your steps, jobs, etc. So there are no excuses any more not to create a documentation! Have look here for more info.

Pentaho Report Output Step

Matt Casters made this step available: It allows you to pass data points to a PRPT (Pentaho Report). You can specify where the report template is located, to which directory and in which format the report should be outputted and also specify report parameters. You can use this in simple scenarios where you just want to output a single report and an more complex fashion for report bursting i.e..

Please find more information about it here.

Excel 2007 XLSX Output Step

Slawo was so kind to provide this step which will be of much use if you are mainly working with newer versions of MS Office. You can find more info here.

As you see, plugins can add some interesting features to Kettle which facilitate our work enormously!

Tuesday, September 28, 2010

Mondrian MDX and Schema Validation Difference PDR and Schema Workbench

PDR 3.6.1 doesn't seem to behave the same way in processing MDX queries and XML Schemas than other tools (Schema Workbench, JPivot, etc).

In example, CurrentDateMember([Date.Weekly Calendar], '[yyyy]\.[ww]').Lag(3.0) works perfectly in Schema Workbench, but not in PDR. PDR requires a more accurate writing style: CurrentDateMember([Date.Weekly Calendar], '["Date.Weekly Calendar"]\.[yyyy]\.[ww]').Lag(3.0). So whereas Schema Workbench and JPivot can figure out where to find year and week, PDR needs the exact path.

Another example: The formula of a calculated member looks like this in the Schema: [Subscription Base Revenue Share Deals (Actual)]/[Subscription Base (Actual)]. If you use JPivot or Schema Workbench, everything will work perfectly. But PDR requires the formula to be like this: [Measures].[Subscription Base Revenue Share Deals (Actual)]/[Measures].[Subscription Base (Actual)].

For both example, the syntax that PDR requires is the really accurate one. I would really appreciate if Schema Workbench (version 3.2.0.13661) would have the same strict validation as well. This would avoid a lot of confusion as to why we get so many more error messages in PDR.

When discussing this topic on the Mondrian developer mailing list, Julian Hyde commented the following:
"It looks like PRD is using mondrian to validate formulas. I suspect that it is an earlier version of Mondrian, which had weaker validation rules. I don't recall why we made the change, but people will log bugs that MDX succeeds in SSAS and fails in mondrian, and we will (rightly) change mondrian.

Qualifying members with their dimension & hierarchy name is recommended. Mondrian can resolve members faster if you do.

We can't give an error if people don't qualify member names. But should we emit a warning if someone writes [Store Sales] / [Store Cost] in a formula? I don't have a strong opinion either way."

As Thomas pointed out in the comment below, have a look at the mondrian.properties file located in the PDR folder report-designer\resources. You can find various settings there like this one:

mondrian.olap.elements.NeedDimensionPrefix=true

This seems to be the one that stopped my "not so accurate" MDX queries to run. I do not recommend changing this setting though, but advise to write precise MDX queries and make sure that the calculated members in your Schema have to complete reference as well.

UPDATE 2010/10/04:

Make sure the PDR mondrian.properties file is set up the same way as the BI Server and Schema Workbench ones.

Open prd-ce-3.6.1-stable\report-designer\resources\mondrian.properties and look for the settings shown below (these ones, especially the first one, will have a major impact):

###############################################################################

# Property determines if elements of dimension (levels, hierarchies, members)

# need to be prefixed with dimension name in MDX query.

# For example when the property is true, the following queries

# will error out. The same queries will work when this property

# is set to false.

# * select {[M]} on 0 from sales

# * select {[USA]} on 0 from sales

# * select {[USA].[CA].[Santa Monica]} on 0 from sales

# When the property is set to true, any query where elements are

# prefixed with dimension name as below will work

# * select {[Gender].[F]} on 0 from sales

# * select {[Customers].[Santa Monica]} on 0 from sales

# Please note that this property does not govern the behaviour where in

# * [Gender].[M]

# is resolved into a fully qualified

# * [Gender].[All Gender].[M]

# In a scenario where the schema is very large and dimensions have large

# number of members a MDX query that has a invalid member in it will cause

# mondrian to to go through all the dimensions, levels, hierarchies, members

# and properties trying to resolve the element name. This behaviour consumes

# considerable time and resources on the server. Setting this property to

# true will make it fail fast in a scenario where it is desirable

mondrian.olap.elements.NeedDimensionPrefix=true

Don't change this one: It's important that your Schema and MDX has properly defined syntax.

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during schema load. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembers=true

Change to: mondrian.rolap.ignoreInvalidMembers=false

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during query validation. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembersDuringQuery=true

Change to: mondrian.rolap.ignoreInvalidMembersDuringQuery=false

Now open psw-ce-3.2.0.13661\schema-workbench\mondrian\properties and add the above highlighted properties from the PRD properties file (if these properties already exist, amend them so that they are exactly set the same way).

In my example mondrian.rolap.ignoreInvalidMembers was already in the Schema Workbench mondrian.properties file and properly set to "false" (so no changes necessary). I also added the two other properties.