Friday, November 12, 2010

Using regular expressions with Pentah...

Using regular expressions with Pentaho Data Integration (Kettle)

There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few. 

Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):

Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).
To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.
For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression: 

^([a-zA-z]+)(\s{1})(.*)$

This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.

In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.

Finally, we output the results to a text file. The output looks like this:


Some info about using regular expressions you can find here and here and here and ... 
You can find the XML code for the transformation here.

If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.

11 comments:

  1. It was very helpful! Thank You

    ReplyDelete
    Replies
    1. Thanks a lot for your feedback! Much appreciated!

      Delete
    2. Hey steiner,
      Great job. This was the article i was looking for. But theres a small issue. I don see any image after "Finally, we output the results to a text file. The output looks like this:" Please look into it. Without it i cant decide whether the third control group is first from left or first from right. Thanks again.

      Delete
  2. Nice 1 brother . It helped me :)

    ReplyDelete
  3. This seems to be exactly what I'm looking for, but without the images... it isn't super helpful.

    ReplyDelete
  4. Hi,

    I want to replace control characters in my data. I have tried using \\000, \\x00, [[:cntrl:]], \p{Cntrl} as the Search string in Replace in String step and Javascript step; however it does not work.

    Can you please help and let me know which pattern to use? THe specific control character I am getting is NUL (\x00).

    Thanks,
    Vinod

    ReplyDelete
    Replies
    1. I usually build and test my regular expressions in a dedicated regex app ... there are also some quite nice online regex builder/checker sites available. Try to do this first before using them in PDI. This way you at least know that the problem is not with you regex.

      Delete
  5. I capture dates from a date string, but I want to replace non matching strings with empty. Can you help me with that?
    Here is what I did so far. search ((19|20)\d\d)\d{0,1}[- \/.,](0{0,1}[1-9]|1[012])[- \/.,](0[1-9]|[12][0-9]|3[01]) and replace with $1-$2-$3.

    The problem is that non matching strings (eg. year only) and stops the stream when converting to date.

    Thank you

    ita

    ReplyDelete
    Replies
    1. I think you should try to make a regular expression capture group for the whole date. So it either find the whole date or not. Currently you check for parts of the date if I understood correctly. This should solve your problem. Also, try to use something like http://regexr.com for testing ... there are also examples there in the Community section.

      Delete
    2. Thanks, but I don't think it would solve it. My goal is not only to capture but to impose a uniform date format that kettle can parse easily. Or am I missing the point?

      Delete
    3. Well, make sure you output the capture to a new field using the "out stream field" option and output the full date, not just part of it - you cannot ask Kettle to parse a year to a date, so you have to output the whole date (null values are ok in case there is no match). Then use a "Select Values" Step and its "Metadata" tab to convert to type date using the correct date mask. Something along these lines ... I don't have a chance right now to try it but from memory it should work like this.

      Delete