Open source business intelligence tutorials: Pentaho, Talend, Jasper Reports, BIRT and more.
Topics: Data Integration, Data Warehousing, Data Modeling, BI Server Setup, OLAP, Reporting, Dashboarding, Master Data Management and many more.
Tuesday, June 28, 2011
Pentaho Data Integration 4 Cookbook
A new book on Pentaho Data Integration is out: Pentaho Data Integration 4 Cookbook! I just received a review sample and I'll be posting a full review in the next few weeks.
Jump over to PacktPub to see the full details ...
I have also received this book, hoping to complete reading in next two weeks.
I would like to get your feedback regarding using Quartz Scheduler for scheduling Transformations and Jobs with the help of Action Sequences.
1. Is this is a good approach compared to a cron job based scheduling? 2. ALso is there a book which we can refer that covers all details about creating Actions Sequences which calls PDI jobs? 3. We would specifically like to know how good we can design a work flow using Xactions for scheduling different PDI jobs? 4. How can we communicate between different jobs using Xactions? 5. How can we schedule one xaction which upon completion executes another 2 xactions in parallel etc?
From the Pentaho solutions book we could get the basic idea of calling transformations and jobs from xaction but an in depth details is not available. Also while when we create some xactions for calling transformations it shows erros which are not possible to debug and these transformations are executable from spoon without any issues.
It will be great if you can provide us some tips related to this topic.
Hi Sunil, I didn't have a chance yet to read this particular chapter. I've never used Xactions to execute PDI jobs. I guess you are referring to using Xactions with the Quartz scheduler (BI Server). In this case, you will add additional workload to your BI Server which might impact its performance (It all depends on your setup and how heavy your PDI processes are). I've always used crontabs to schedule PDI jobs and transformations. Usually I have the ETL processes running on dedicated server(s). In this case you don't have a nice GUI for scheduling ... Pentaho also offers an Enterprise Edition of PDI with a dedicated server and GUI which handles scheduling (among other additional features). This is a good approach as well as this will not impact the performance of your normal BI server (assuming you set it up on dedicated machines). Best regards, Diethard
Thanks a lot for your quick reply. I too agree with you. Quartz Scheduling will be an additional work load for the BI server.
Just a quick clarification on this topic. The Quartz and Xaction are executed by the BI server and the underlying the PDI job is executed by the DI server which will be running on the same machine correct?
We have an enterprise edition with us and you are mentioning about the scheduler available from the Spoon correct?
Also do you know whether this Spoon based scheduler will be giving more advanced features in the future releases like a GUI for monitoring the scheduled tasks, an option for administrators to design a work flow for scheduling completely different PDI jobs. What I am referring is the facilities a scheduler tool like Autosys provides?
Do you have any reference ebook links for learning more above Action Sequences?
Hi Sunil, If you use the open source BI server then everything will be executed on the same machine. The same applies for the enterprise edition of the BI Server. There is an enterprise edition of PDI as well which comes with a dedicated data integration server and a nice GUI to schedule jobs. You can have this PDI server running on a dedicated machine. There are two ways to schedule jobs: Directly from Spoon or by using the PDI web-interface. If you only want to use an open source version, there is another option as well: On a dedicated server, you can start the carte server which listens to any requests that ask for a job to be executed. But there is no GUI for this one as far as I know. Best regards, Diethard
Thanks a lot for this information. I haven't tried out the EE of PDI. We were using the PDI as part of the EE of BI suite. I will download this and explore the features and will share my comments.
Will this scheduler works for file based repository or will it work only with the Enterprise Repository only?
Hi Sunil, As far as I know the scheduler for PDI EE works only with the enterprise repository. The upside is you get versioning as well! Best regards, Diethard
I tried to install PDI v4.2 EE. Enterprise console and Data integration are the two things part of this installation. I have connected to the PDI Enterprise repository. I tried to create a simple transformation and tried to schedule it using the spoon GUI scheduler. The scheduled transformation is shown in the Schduler prespective and it vanishes once the scheduiled time is passed and the transformation is not getting executed as well. I tried with Run now option and it also is not giving any results. Am I doing something wrong here?
Also in the Enterprise console I have selected "Running Data Integration Server Only" since I don't have BI server now. You had mentioned about one Web based interface in your reply which we can use for scheduling. Can you help me out in finding this option. I am not able find this option in the Enterprise console.
Hi Sunil, I tested PDI EE some weeks ago and as far as I remember you can only schedule jobs from the spoon GUI. It's fine if you run the Data Integration server only ... you can add your repository (I think using the enterprise console) and then you can schedule jobs from there. It's best your refer to the Enterprise documentation, which is really quite good. I followed all the steps mentioned there and got things working. You should have received a link and login details for the documentation when you signed up for the trial. Best regards, Diethard
I will try the scheduling for pdi jobs.in the enterprise console I am able configure input the di server. I am able to see only the data integration menu option in the left panel in the enterprise console. From there I can configure carte and register the jobs and run it. I am not sure about the scheduling option there. I will try to get the enterprise documentation as you have suggested.by the way if you have the link to the doc pls let me know.
Regards, Sunil george. jobs.in the enterprise console I am able configure input the di server. I am able to see only the data integration menu option in the left panel in the enterprise console. From there I can configure carte and register the jobs and run it. I am not sure about the scheduling option there. I will try to get the enterprise documentation as you have suggested.by the way if you have the link to the doc pls let me know.
Hi Diethard,
ReplyDeleteI have also received this book, hoping to complete reading in next two weeks.
I would like to get your feedback regarding using Quartz Scheduler for scheduling Transformations and Jobs
with the help of Action Sequences.
1. Is this is a good approach compared to a cron job based scheduling?
2. ALso is there a book which we can refer that covers all details about creating Actions Sequences which calls PDI
jobs?
3. We would specifically like to know how good we can design a work flow using Xactions for scheduling different
PDI jobs?
4. How can we communicate between different jobs using Xactions?
5. How can we schedule one xaction which upon completion executes another 2 xactions in parallel etc?
From the Pentaho solutions book we could get the basic idea of calling transformations and jobs from xaction
but an in depth details is not available. Also while when we create some xactions for calling transformations
it shows erros which are not possible to debug and these transformations are executable from spoon without any issues.
It will be great if you can provide us some tips related to this topic.
Regards,
Sunil George.
Hi Sunil,
ReplyDeleteI didn't have a chance yet to read this particular chapter. I've never used Xactions to execute PDI jobs. I guess you are referring to using Xactions with the Quartz scheduler (BI Server). In this case, you will add additional workload to your BI Server which might impact its performance (It all depends on your setup and how heavy your PDI processes are).
I've always used crontabs to schedule PDI jobs and transformations. Usually I have the ETL processes running on dedicated server(s).
In this case you don't have a nice GUI for scheduling ... Pentaho also offers an Enterprise Edition of PDI with a dedicated server and GUI which handles scheduling (among other additional features). This is a good approach as well as this will not impact the performance of your normal BI server (assuming you set it up on dedicated machines).
Best regards,
Diethard
Hi Diethard,
ReplyDeleteThanks a lot for your quick reply. I too agree with you. Quartz Scheduling will be an additional work load for the BI server.
Just a quick clarification on this topic. The Quartz and Xaction are executed by the BI server and the underlying the PDI job is executed by the DI server which will be running on the same machine correct?
We have an enterprise edition with us and you are mentioning about the scheduler available from the Spoon correct?
Also do you know whether this Spoon based scheduler will be giving more advanced features in the future releases like a GUI for monitoring the scheduled tasks, an option for administrators to design a work flow for scheduling completely different PDI jobs. What I am referring is the facilities a scheduler tool like Autosys provides?
Do you have any reference ebook links for learning more above Action Sequences?
Best Regards,
Sunil George.
Hi Sunil,
ReplyDeleteIf you use the open source BI server then everything will be executed on the same machine. The same applies for the enterprise edition of the BI Server. There is an enterprise edition of PDI as well which comes with a dedicated data integration server and a nice GUI to schedule jobs. You can have this PDI server running on a dedicated machine. There are two ways to schedule jobs: Directly from Spoon or by using the PDI web-interface.
If you only want to use an open source version, there is another option as well: On a dedicated server, you can start the carte server which listens to any requests that ask for a job to be executed. But there is no GUI for this one as far as I know.
Best regards,
Diethard
Hi Diethard,
ReplyDeleteThanks a lot for this information. I haven't tried out the EE of PDI. We were using the PDI as part of the EE of BI suite. I will download this and explore the features and will share my comments.
Will this scheduler works for file based repository or will it work only with the Enterprise Repository only?
Regards,
Sunil George.
Hi Sunil,
ReplyDeleteAs far as I know the scheduler for PDI EE works only with the enterprise repository. The upside is you get versioning as well!
Best regards,
Diethard
Hi Diethard,
ReplyDeleteI tried to install PDI v4.2 EE. Enterprise console and Data integration are the two things part of this installation. I have connected to the PDI Enterprise repository. I tried to create a simple transformation and tried to schedule it using the spoon GUI scheduler. The scheduled transformation is shown in the Schduler prespective and it vanishes once the scheduiled time is passed and the transformation is not getting executed as well. I tried with Run now option and it also is not giving any results. Am I doing something wrong here?
Also in the Enterprise console I have selected "Running Data Integration Server Only" since I don't have BI server now. You had mentioned about one Web based interface in your reply which we can use for scheduling. Can you help me out in finding this option. I am not able find this option in the Enterprise console.
Regards,
Sunil George.
Hi Sunil,
ReplyDeleteI tested PDI EE some weeks ago and as far as I remember you can only schedule jobs from the spoon GUI.
It's fine if you run the Data Integration server only ... you can add your repository (I think using the enterprise console) and then you can schedule jobs from there. It's best your refer to the Enterprise documentation, which is really quite good. I followed all the steps mentioned there and got things working. You should have received a link and login details for the documentation when you signed up for the trial.
Best regards,
Diethard
Hi Diethard,
ReplyDeleteI will try the scheduling for pdi jobs.in the enterprise console I am able configure input the di server. I am able to see only the data integration menu option in the left panel in the enterprise console. From there I can configure carte and register the jobs and run it. I am not sure about the scheduling option there. I will try to get the enterprise documentation as you have suggested.by the way if you have the link to the doc pls let me know.
Regards,
Sunil george. jobs.in the enterprise console I am able configure input the di server. I am able to see only the data integration menu option in the left panel in the enterprise console. From there I can configure carte and register the jobs and run it. I am not sure about the scheduling option there. I will try to get the enterprise documentation as you have suggested.by the way if you have the link to the doc pls let me know.
Regards,
Sunil george.
Hi Sunil,
ReplyDeleteYou should have received an email with the link and login details. If not, please get in contact with support.
Best regards,
Diethard