Diethard Steiner on Business Intelligence: Kettle: Sourcing data from Hadoop Hive

Thursday, May 26, 2011

Kettle: Sourcing data from Hadoop Hive

Pentaho Data Integration (Kettle): Sourcing data from Hadoop Hive

Tutorial Details

Software: PDI/Kettle 4.1 (download here)

Knowledge: Beginner

Has your company recently started using Hadoop to cope with enormous amounts of data? Have you been using Kettle so far for your ETL? As you are probably aware of, with the Kettle Enterprise Edition you can now create map-reduce jobs for Hadoop. If you want to stay with the open source version, the good news is, that it’s very simple to connect to Hive - a database which can be set up on top of Hadoop.

If you have one of the latest versions of Kettle installed, you will see that it comes already with the required Hive driver. Hence, setting up a connection to Hive is straight forward.

Create a new database connection

Create a new transformation

Click the View tab on the right hand side

Right click on Database connections and choose New

Alternatively, you can also activate the Design tab and drag and drop a Table input step on the canvas, open it and click on New to create a new database connection.

In the database settings window choose Generic database from the available databases.

For the connection URL insert the following:

jdbc:hive://youramazonec2url:10000

For the driver specify:

org.apache.hadoop.hive.jdbc.HiveDriver

Depending on your setup you might have to provide other details as well.

Click Test and all should be working now.

If you want to use any specific settings or user defined functions (UDF), then you can call them as follows:

In the database settings window, click on Advanced in the left hand pane.

Insert these statements in the field entitled Enter the SQL statements (separated by ;) to exectue right after connecting

Click Test again to check if a connection can be created with these specific settings

Everything is set up now for querying Hive from Kettle. Enjoy!

6 comments:

TroyaAugust 16, 2011 at 12:22 AM
i try to connect Bi-server with Hadoop hive, but error happen.

i have post this problem to pentaho forum,please check it.i need any help ..

http://forums.pentaho.com/showthread.php?83490-Pentaho-with-hive-datasource-FAILED

Regards,

Troya
ReplyDelete
Replies
UnknownAugust 16, 2011 at 8:02 AM
I just saw that your questions were answered on the forum. Yes, I think this works with the Enterprise Edition only.
ReplyDelete
Replies
TroyaOctober 13, 2011 at 7:13 PM
So do you mean, to connecting hive to reporting, it's only for Enterprise Edition ? how is if i use pentaho community edition ?
ReplyDelete
Replies
UnknownOctober 14, 2011 at 11:22 AM
You can access Hive with Kettle CE, but for the bi server Hive access is EE only. (At least half a year ago it was like this)
ReplyDelete
Replies
TroyaOctober 17, 2011 at 12:18 AM
do you have simple project which use pentaho and hadoop ?

i have try use reference from
http://sandbox.pentaho.com/2011/04/comprehensive-how-two-pentaho-and-hadoop-tutorial/

but almost all project and job which include of documentation failed when i import to data integration.

i use pdi-ce 4.2
ReplyDelete
Replies
UnknownOctober 17, 2011 at 10:13 AM
At the start of this tutorial you can find a download link to the example. If you have all the connection details correct, it should just work.
ReplyDelete
Replies

Add comment