tag:blogger.com,1999:blog-4527276399966185267.post7159943873121133303..comments2024-03-06T20:16:30.599-08:00Comments on Diethard Steiner on Business Intelligence: Pentaho Data Integration: Designing a highly available scalable sAnonymoushttp://www.blogger.com/profile/05683544764949933581noreply@blogger.comBlogger36125tag:blogger.com,1999:blog-4527276399966185267.post-4683717005477667532015-01-23T00:40:03.230-08:002015-01-23T00:40:03.230-08:00The latest version of PDI has support for YARN. Th...The latest version of PDI has support for YARN. This might be EE only though.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-88629702178208798402015-01-22T17:36:52.373-08:002015-01-22T17:36:52.373-08:00I am interested to see how people have extended th...I am interested to see how people have extended this framework. The adoption rate of hadoop has me working with files at an increased pace. In one particular project I am creating additional hadoop optimized files from the original (avro, orc, etc)tchttps://www.blogger.com/profile/12680714888085276318noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-27420457887943284182013-12-02T10:04:16.141-08:002013-12-02T10:04:16.141-08:00I was trying to say that it is down to the data in...I was trying to say that it is down to the data integration/ETL developer to further improve this model. Consider this as a starting point, there are certainly a few more things you can improve. If something that you want to do is not possible with today's Kettle version, then you can submit a feature request on jira.pentaho.org.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-90232058370841136392013-12-02T09:56:28.282-08:002013-12-02T09:56:28.282-08:00Thanks Diethard for your quick response. Noted you...Thanks Diethard for your quick response. Noted your point! It would be super cool if Matt Casters tries to optimize this model.<br /><br />Best,<br />Kanwar Asrar AhmadAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-16234007746933249062013-12-02T00:42:42.485-08:002013-12-02T00:42:42.485-08:00Thanks for your feedback and sharing your experien...Thanks for your feedback and sharing your experience! As always, there is room for improvement ... consider this as an example only.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-32361565885094044092013-12-01T23:42:08.197-08:002013-12-01T23:42:08.197-08:00Hi Matt and Diethard,
I followed your above menti...Hi Matt and Diethard,<br /><br />I followed your above mentioned ETL tutorial as it is and it works quite well as you mentioned. Following are some of my findings which should must be discussed with you guys.<br /><br />(PC Specs: 3.4 GHz Core i7, 8GB RAM)<br /><br />1. I ran your model with ~30,000 csv files (1KB-1000KB, 21 columns) it returned the same Exception as mentioned above by Fatima<br /><br />2. I modified it to read a csv file and directly dump into a MySql table. Parallelism is being achieved and so is scalability but is much slower than sequential csv loading into database<br /><br />3. When the solution is run in a loop ("Process one queued file per available slave" to "Dummy" to "Slave servers status check"). It keeps on assigning slave server to the last group even if finished status is OK. In your 5 files group. It keeps on assigning a slave server to 5th file<br /><br />4. When ran in same above loop slave server assignment is in order of Slave1, Slave2,Slave3, Slave4 irrespective of csv file size!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-87153972612397373862013-11-29T04:00:05.902-08:002013-11-29T04:00:05.902-08:00Hi Diethard,
I run this job in loop for a bulk of...Hi Diethard,<br /><br />I run this job in loop for a bulk of 63,000 files and i am facing following exception in the job "Process one queued file per available slave" Can you please give any suggestions on it?<br /><br />2013/11/29 16:51:17 - Get filename and slave.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)<br />2013/11/29 16:51:17 - started_date.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)<br />2013/11/29 16:51:17 - Update FILE_QUEUE.0 - Finished processing (I=1, O=0, R=1, W=1, U=1, E=0)<br />2013/11/29 16:51:17 - Process a file - Starting entry [Load a file]<br />2013/11/29 16:51:19 - Load a file - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : Error running job entry 'job' : <br />2013/11/29 16:51:19 - Load a file - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : org.pentaho.di.core.exception.KettleException: <br />2013/11/29 16:51:19 - Load a file - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : java.lang.NullPointerException<br />2013/11/29 16:51:19 - Load a file - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at java.lang.Thread.run (null:-1)<br />2013/11/29 16:51:19 - Load a file - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at org.pentaho.di.job.entries.job.JobEntryJobRunner.run (JobEntryJobRunner.java:68)<br /><br />Regards,<br />Fatima.mmmhttps://www.blogger.com/profile/10995636916441074638noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-13994797410761527772013-11-28T10:33:53.183-08:002013-11-28T10:33:53.183-08:00Yuvakesh,
Did you find some way out to process al...Yuvakesh,<br /><br />Did you find some way out to process all 10 files in loop using 4 carte servers ?Kanwar Asrar Ahmadhttps://www.blogger.com/profile/08975417739900345017noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-37398822478277108932013-04-03T03:40:39.828-07:002013-04-03T03:40:39.828-07:00Thanks for letting me know! No too sure what happe...Thanks for letting me know! No too sure what happened ... I will try to find some time this weekend to fix this.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-72622031621487428972013-04-03T03:35:40.615-07:002013-04-03T03:35:40.615-07:00Hi Diethard,
The screenshots and attachment on th...Hi Diethard,<br /><br />The screenshots and attachment on this post seem to be missing. Are they recoverable?<br /><br />Kind regards,<br /><br />DaveAnonymoushttps://www.blogger.com/profile/06683832981912620538noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-13543087296646557492012-11-23T03:32:19.749-08:002012-11-23T03:32:19.749-08:00Ok, I had a look at this. It's a long time ago...Ok, I had a look at this. It's a long time ago I wrote this, so actually, you were correct, the job will only process 4 files. You can certainly improve this so that it loops over the remaining files.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-86822697330176154762012-11-22T22:00:09.062-08:002012-11-22T22:00:09.062-08:00Thanks for quick replay Diethard!!
The above proce...Thanks for quick replay Diethard!!<br />The above process can process only 4 files each for slave and the remaining files shall be processed in next run. Even you have mentioned that in above tutorial at last 2 lines.<br />Kindly let me know how to process more than 10 files in one execution by using 4 slaves only.<br /><br />Thanks!<br />Anonymoushttps://www.blogger.com/profile/15486795519509135400noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-24571424199406244402012-11-22T08:10:55.541-08:002012-11-22T08:10:55.541-08:00Hi Yuvakesh,
Thanks for your feedback! You mainly ...Hi Yuvakesh,<br />Thanks for your feedback! You mainly have to thank Matt Casters for this! The above job should process all your files, not just 4 of them (at least at the time I wrote this tutorial it did). <br />Best regards,<br />Diethard<br />Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-26869021749858612482012-11-22T05:45:29.500-08:002012-11-22T05:45:29.500-08:00Hi Diethard,
Thanks a lot for implementing Paralle...Hi Diethard,<br />Thanks a lot for implementing Parallel process in Pentaho. In above process, I would like to know how to execute more than 4 files in one loop using current 4 slaves only.<br />Current Code Behavior:<br />If there are 10 files in queue to process, but the job will end after archiving 4 files.<br />But In File_Queue there will be some more files available to process. I would like to process entire files (10files) at one shot.<br /><br />Thanks.<br />Anonymoushttps://www.blogger.com/profile/15486795519509135400noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-3881183419600466892011-12-01T10:31:20.345-08:002011-12-01T10:31:20.345-08:00No, if you deliver one file, it will be only readi...No, if you deliver one file, it will be only reading this one file in one go.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-47851990859492953652011-11-30T19:04:46.632-08:002011-11-30T19:04:46.632-08:00curios question, my scenario is I have one huge fi...curios question, my scenario is I have one huge file and cannot be split in different files, how is the processing/transformation of the file will happen, will it be split on the fly and distribute the task on each slave server?AndDegshttps://www.blogger.com/profile/14455388863533916398noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-35684090835640648382011-11-29T12:54:17.258-08:002011-11-29T12:54:17.258-08:00I am happy to hear you got the process working. Yo...I am happy to hear you got the process working. You do not have to define slave server entries in the job file. All the slave details get passed in dynamically, so there shouldn't be a need for this.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-60809118204059802622011-11-29T07:33:34.446-08:002011-11-29T07:33:34.446-08:00I got it working. I wanted to submit jobs to carte...I got it working. I wanted to submit jobs to carte server through kitchen.bat. I did this <br /><br />Job Xml changes are required to tell kettle to run the job and transformations on remote server.<br />Slave Server entry is needed to add in the Jobs xml/kjb. below is the example of it. I have also attached image to determine slave server entries.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-16529591890446977292011-11-02T12:03:47.497-07:002011-11-02T12:03:47.497-07:00Hi Vishal. Is your comment regarding "Pentaho...Hi Vishal. Is your comment regarding "Pentaho Data Integration: Designing a highly available scalable solution for processing files" or is this a general question?Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-75032607394156603632011-11-02T11:35:56.498-07:002011-11-02T11:35:56.498-07:00Hi dietheard,
first of all thanks alot for your d...Hi dietheard,<br /><br />first of all thanks alot for your detailed blog. It helped alot. <br />I run jobs through kitchen. but I dont see any option to set it to execute remotely. can you tell me how can I do that?<br />thanks<br />-vishalAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-88279233615936726712011-07-12T04:31:57.768-07:002011-07-12T04:31:57.768-07:00Thanks a lot for your feedback! Much appreciated!Thanks a lot for your feedback! Much appreciated!Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-87746289952199348232011-07-12T03:54:46.534-07:002011-07-12T03:54:46.534-07:00Very Nice walk through and samples..Thanks a lot
...Very Nice walk through and samples..Thanks a lot<br /><br />Regards,<br />GururajGuruhttps://www.blogger.com/profile/04313945815433155046noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-22327210102051337302011-06-18T12:41:00.825-07:002011-06-18T12:41:00.825-07:00Hi Sunil,
The idea is that all of them are execute...Hi Sunil,<br />The idea is that all of them are executed in parallel. All Carte instances should be used unless there is an error. The slave_list table should tell you how many instances are available.<br />"I can have the Dummy Javascript step in Load a file job replaced with my required transformation step, correct?" <br />Yes, exactly.<br />Best regards,<br />DiethardAnonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-53022050087178741142011-06-18T00:06:13.077-07:002011-06-18T00:06:13.077-07:00Hi Diethard,
Sorry for the delay in my replay. I ...Hi Diethard,<br /><br />Sorry for the delay in my replay. I was help up with some urgent task these days. I got the Job running in my latest PDI version. The problem was with my old version HTTP_Client step which didn't had the response_time field option.<br /><br />I executed the job as per your example and I got the first 4 files processed in the first run. I had started 4 Carte instances in four different port but while execution only 3 carte servers where used. Ie at port 8082 two files got executed and 8084 port was unused. But as per the file_queue table we have four carte servers assigned to first four files. What could be the reason for this?<br /><br />Also what I understood from executing job/transformation for each input row option is that it will be creating a sequential loop. Or is that a parallel triggering? ie Will the file process job be triggered in parallel. Ie if I have four transformation where each one takes 10 mins for execution, will this four transformation run in parallel in four different carte instances? So in practical terms will I get the four transformations executed in say 12 or 13 mins instead of 4x10 = 40 mins?<br /><br />I can have the Dummy Javascript step in Load a file job replaced with my required transformation step, correct?<br /><br />Many thanks in advance for your replies.<br /><br />Regards,<br />Sunil George.Sunilhttps://www.blogger.com/profile/15367045717460873043noreply@blogger.comtag:blogger.com,1999:blog-4527276399966185267.post-71098098923867987812011-06-05T10:18:11.236-07:002011-06-05T10:18:11.236-07:00If you see some problems in the way Carte is behav...If you see some problems in the way Carte is behaving, it is best you set up a JIRA case on jira.pentaho.com. But first of all I advise you use a recent version of PDI.<br />Also, if you want a start, stop and restart facility, please submit a request for this on the JIRA website as well.Anonymoushttps://www.blogger.com/profile/05683544764949933581noreply@blogger.com