Diethard Steiner on Business Intelligence: Using regular expressions with Pentah...

Friday, November 12, 2010

Using regular expressions with Pentah...

Using regular expressions with Pentaho Data Integration (Kettle)

There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few.

Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):

Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).

To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.

For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression:

^([a-zA-z]+)(\s{1})(.*)$

This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.

In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.

Finally, we output the results to a text file. The output looks like this:

Some info about using regular expressions you can find here and here and here and ...

You can find the XML code for the transformation here.

If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.

11 comments:

KirillMay 16, 2012 at 3:16 AM
It was very helpful! Thank You
ReplyDelete
Replies
-Nash-January 11, 2013 at 10:02 PM
Nice 1 brother . It helped me :)
ReplyDelete
Replies
CarlMarch 29, 2013 at 2:28 PM
This seems to be exactly what I'm looking for, but without the images... it isn't super helpful.
ReplyDelete
Replies
UnknownJuly 18, 2013 at 6:48 PM
Hi,

I want to replace control characters in my data. I have tried using \\000, \\x00, [[:cntrl:]], \p{Cntrl} as the Search string in Replace in String step and Javascript step; however it does not work.

Can you please help and let me know which pattern to use? THe specific control character I am getting is NUL (\x00).

Thanks,
Vinod
ReplyDelete
Replies
italoscalvinoJuly 7, 2014 at 4:16 AM
I capture dates from a date string, but I want to replace non matching strings with empty. Can you help me with that?
Here is what I did so far. search ((19|20)\d\d)\d{0,1}[- \/.,](0{0,1}[1-9]|1[012])[- \/.,](0[1-9]|[12][0-9]|3[01]) and replace with $1-$2-$3.

The problem is that non matching strings (eg. year only) and stops the stream when converting to date.

Thank you

ita
ReplyDelete
Replies