Comments on Diethard Steiner on Business Intelligence: Kettle Transformation Logging and Change Data Capture (New)

A great tutorial. Just what i wanted. However, my ...

2016-04-26T03:10:51.458-07:00

A great tutorial. Just what i wanted. However, my transformation fails to run. Iam getting the error "Data truncation: Incorrect datetime value: '1900-01-01 04:30:00' for column 'STARTDATE' at row 1"
My kettle job is exactly the way it is described here.

I have two variables "start_date_range" and end_date_range" varaibles and iam following the second CDC approach where i want my max date to come from my sql table.

Any suggestions? What am I doing wrong??

2015-11-19T12:40:16.966-08:00

This comment has been removed by the author.

Thanks Diethard. I'm actually considering the ...

2014-01-29T16:06:22.129-08:00

Thanks Diethard.
I'm actually considering the two options and going for a hybrid approach ;)

Thanks for your feedback. Yes, I see this is quite...

2014-01-29T00:19:03.713-08:00

Thanks for your feedback. Yes, I see this is quite cumbersome. In such a case I would usually create an additional custom logging table. If you want to stick with the PDI log only, one option would be to write to site name into log. This is not ideal when retrieving the info with a SQL query, because parsing the log field might be quite performance intensive, but it works. You could just use a regex in your SQL query to extract the value.
Alternatively, you could also create a fairly empty job whose job name includes the site name. This job is called by the main job, has no logic included other than calling the generic transformations. This makes it easier to query the log info. There might be some other options as well ...

Firstly, apologies - I know this is an old post bu...

2014-01-28T22:20:44.969-08:00

Firstly, apologies - I know this is an old post but it's the best fit to what I'm trying to achieve.

Secondly, I'm relatively new to PDI, having just started development with it a couple of months ago. Although, it's such a great tool that I managed to deliver some 3 or 4 integration projects during this time.

Now straight to business: I have a job which I need to loop through the same set of transformations for every site (same db schema across the country) for which I'm bringing data into a data warehouse. Everything is done dynamically at job level to make sure that connections and other pieces of config are switched to the next site every time one is finished until I run through all of them.

The only thing I cannot change are the transformation names, because they won't accept variables. The problem here is that I absolutely love and depend on the out of the box logs for the built-in CDC feature to work, and it uses the transformation names to keep track of star/end dates. I have to record this at transformation + site level, because they will sometimes run at different times of the day (in case of comms having been down, for example).

The only way I could find around this was physically creating copies of the whole set of transformations for each site, manually renaming them (adding site name) and running one job for each site. But this makes maintenance/enhancements a pain in the butt, because it means applying that tiny small change 10 times. Deployment to QA and Prod is also very painful.

Does anyone have ever gone through this same problem? Suggestions?

This AWESOME blog has been favourited :) Keep up with the good tips!

Thanks a lot.

Only fields that have a $ icon next to them accept...

2014-01-21T10:37:39.872-08:00

Only fields that have a $ icon next to them accept variables.

Is there a way to provide variables for 'Maxda...

2014-01-21T07:55:31.691-08:00

Is there a way to provide variables for 'Maxdate table' and 'Maxdate field' fields?

You are welcome!

2013-03-26T00:40:27.996-07:00

You are welcome!

Thanks A Lot Diethard:) Its Working ...

2013-03-25T23:47:26.658-07:00

Thanks A Lot Diethard:)
Its Working ...

Well, the logging tables (as the one set up in thi...

2013-03-25T05:15:17.452-07:00

Well, the logging tables (as the one set up in this tutorial) are for all transformations [if you use the same name for the logging table] (and if you set them up for a job as well they are available for all jobs). It's worth having a look at this table to understand what is stored in it ... you will see that there is a transformation name in it as well. So if you use the built-in CDC in more than one independent transformation, then it should work like this. Obviously, always test your setup!
I am not too sure what you mean by CDC on a job level ... the built in CDC functionality is normally defined on a transformation level. So if you have more independent fact transformations, then you have to define the built-in CDC for each of them.
But as said before, this is just a very simple CDC logging ... I hardly ever use it because usually my requirements are more complex.

Thanks for Quick Response Diethard I didn't t...

2013-03-25T04:38:27.921-07:00

Thanks for Quick Response Diethard

I didn't the dates tab to maintain max date feild & table name JOB level
Can we use the same log table for the whole JOB(Multiple Transformations)
I see an options in Get Sytem Info(start date range(job),end date range(job))

Fyi -- I am using Kettle 4.3

Setting up the logging for jobs is similar to sett...

2013-03-25T04:09:13.621-07:00

Setting up the logging for jobs is similar to setting it up for transformations, so this should be fairly straight forward. There is no built-in option to set up CDC on a job level. If your CDC requirements are not that simple (and often they are) you will have to create your own solutions for it i.e. using some transformations that you link together in a job.

Hi, I would like to implement Kettle JOB Logging ...

2013-03-25T03:57:45.983-07:00

Hi,

I would like to implement Kettle JOB Logging and Change Data Capture(Safe approach)
Can Any one please help me on this ..

Thanks In Advance
Ravan

As I am not familiar with your setup, I can only r...

2013-03-08T01:08:06.340-08:00

As I am not familiar with your setup, I can only recommend filing a jira case on jira.pentaho.com in case you think it's a bug.

I am finding that in Pentaho 4.3, it looks like th...

2013-03-07T16:40:30.590-08:00

I am finding that in Pentaho 4.3, it looks like the Transformation is hanging trying to get a LOCK on the transformation log table. Anyone know if this is a known issue.

Andrius, maybe you've just learnt this by now,...

2013-02-18T05:44:06.178-08:00

Andrius, maybe you've just learnt this by now, but performance analysis is made using the performance step logging. After having data you must - pretty much like GUI - look for steps with high input buffer and zero or little output buffer. This is the bottleneck signature.

Once the lookup/join data is available it should j...

2012-05-25T14:48:33.187-07:00

Once the lookup/join data is available it should just release the row.

You can also set up email notifications etc. So you can set up your process to monitor the log and in case something unexpected happens get notified by email.

Thanks a lot for your feedback, much appreciated!

If you want to learn quickly more about Kettle, check out Maria's book: http://www.packtpub.com/pentaho-data-integration-4-cookbook/book

You right. It's enough for development. But wh...

2012-05-25T14:24:57.470-07:00

You right. It's enough for development.
But when we deploy the solution it should switch to fully automated mode.
No GUI just email or other type of notifications send by the ETL engine.

Is join a blocking transformation in Pentaho?
I guess in SSIS "one to one" join doesnt block the flow. It release the row as soon as pair is found.

What I forgot to say is - thanks for the great post. It's really useful and handy. I found it very easy to setup and test by reading your blog.
That's what I like about Pentaho - getting started easily. And this was my first day with Pentaho ;)
Thanks again.

Well I normally use a given fixed data set for tes...

2012-05-25T13:00:06.411-07:00

Well I normally use a given fixed data set for testing. I try to tune the steps by observing the throughput and also looking at the total execution time (among others). It's a game, yes, but it's fairly straight forward as you can do all via the GUI (Spoon) and can quickly understand what's going on.
All steps start in parallel (waiting for the data to arrive). And yes, some steps are specific, like joins, because they need to have all the data first before they can pass it on.

That's not a friendly way to do so. I cant tel...

2012-05-25T12:15:24.183-07:00

That's not a friendly way to do so. I cant tell what's wrong just looking at the throughput. There could be a step having 10000r/s because of large load or 100r/s by complex calculation which is already tuned up to the max.
So you need to compare against previous execution to find out what step is making troubles.
This comparison will be valid only if rowcount is the same and day/week/hour time is similar. If not then you are in deep.
I dont have time for these games. It needs to be automated which is not possible with this tool at the moment.

Even though all steps are started in parallel, in reality they do nothing until From steps finish their work. (for example aggregation). But the way pentaho starts all steps in parallel spoils the statistics completely.

Well I understand that complex deployment requires more sophisticated tools.
And I just wanted to check how much of this is doable with pentaho.

Well, all transformation steps are executed in par...

2012-05-25T11:16:59.072-07:00

Well, all transformation steps are executed in parallel. If you want to understand which step is slowing down your transformation, one of the main things to watch out for is the throughput.

Check the the step logging screen. There is just &...

2012-05-25T11:12:53.877-07:00

Check the the step logging screen. There is just "Log date" of date type.
"Step start ", "Step end", "Duration" do not exist in here.
So we do not know how much time took to execute the step.
Unless you see this on some diff version? I have 4.3.0.

Well, the out-of-the-box functionality is for simp...

2012-05-25T10:56:43.729-07:00

Well, the out-of-the-box functionality is for simple use cases. If you require a more complex logging for CDC, you will have to create it yourself. Also not all CDC techniques are time based, so there is no out-of-the-box functionality for this case in Kettle either. Understandably, as the logic can sometimes be quite complex. It's not too difficult though to create it yourself.
I am not too sure what your refer to with "How can we analyze the performance without knowing the duration of each atomic steps inside the transformation." This sounds like another pair of shoes to me, not really related to CDC: If you want to log the performance of each step within your transformation, you can enable this via the standard logging settings. Hope this helps.

I really liked the idea of having CDC and logging ...

2012-05-25T10:13:59.309-07:00

I really liked the idea of having CDC and logging out of the box with Pentaho.

But ... it's annoying not to have step start/end time logged in database.
How can we analyze the performance without knowing the duration of each atomic steps inside the transformation.
Execution time of the entire transformation can be calculated by deducting "Date range end" from "Log date".
Unfortunately it's not enough for us:(
Thus we need to develop and maintaint our own logging tables.

Also we could have some performance issues with the CDC model.
It looks nice for some simple deployment.
But my case is: tons of steps to be tracked, near realtime DW with continous ETL execution, statistics are very important, so no truncation.
In this way we generate thousands of rows in logging table per day, millions per month.
Sure we can move some old rows to a separate aggregated table and delete them.
Anyway this is not enough unless Pentaho stores the "Date range end" somewhere else?
Actually it looks like the only place is the database, because if I disable the logging to DB the CDC will not work.
Thus Pentaho uses max to find the last valid "Data range end" from previous execution and this is not very nice.
So again we need to build and maintain our own CDC tables.

Regards,
Andrius

I didn't have a chance yet to look at the late...

2011-02-18T01:49:47.404-08:00

I didn't have a chance yet to look at the latest KFF developments, so for not, unfortunately, I cannot answer your question. I'll try to get up-to-date on the KFF project in the next few weeks (It's quite a promising project).