Thursday, May 26, 2011

Kettle: Sourcing data from Hadoop Hive

Pentaho Data Integration (Kettle): Sourcing data from Hadoop Hive

Tutorial Details


  • Software: PDI/Kettle 4.1 (download here)
  • Knowledge: Beginner


Has your company recently started using Hadoop to cope with enormous amounts of data? Have you been using Kettle so far for your ETL? As you are probably aware of, with the Kettle Enterprise Edition you can now create map-reduce jobs for Hadoop. If you want to stay with the open source version, the good news is, that it’s very simple to connect to Hive - a database which can be set up on top of Hadoop.

If you have one of the latest versions of Kettle installed, you will see that it comes already with the required Hive driver. Hence, setting up a connection to Hive is straight forward.

Create a new database connection

  1. Create a new transformation
  2. Click the View tab on the right hand side
  3. Right click on Database connections and choose New

Alternatively, you can also activate the Design tab and drag and drop a Table input step on the canvas, open it and click on New to create a new database connection.

In the database settings window choose Generic database from the available databases.

For the connection URL insert the following:

jdbc:hive://youramazonec2url:10000

For the driver specify:
org.apache.hadoop.hive.jdbc.HiveDriver

Depending on your setup you might have to provide other details as well.

Click Test and all should be working now.
If you want to use any specific settings or user defined functions (UDF), then you can call them as follows:

  1. In the database settings window, click on Advanced in the left hand pane.
  2. Insert these statements in the field entitled Enter the SQL statements (separated by ;) to exectue right after connecting
  3. Click Test again to check if a connection can be created with these specific settings

Everything is set up now for querying Hive from Kettle. Enjoy!

6 comments:

  1. i try to connect Bi-server with Hadoop hive, but error happen.

    i have post this problem to pentaho forum,please check it.i need any help ..

    http://forums.pentaho.com/showthread.php?83490-Pentaho-with-hive-datasource-FAILED

    Regards,

    Troya

    ReplyDelete
  2. I just saw that your questions were answered on the forum. Yes, I think this works with the Enterprise Edition only.

    ReplyDelete
  3. So do you mean, to connecting hive to reporting, it's only for Enterprise Edition ? how is if i use pentaho community edition ?

    ReplyDelete
  4. You can access Hive with Kettle CE, but for the bi server Hive access is EE only. (At least half a year ago it was like this)

    ReplyDelete
  5. do you have simple project which use pentaho and hadoop ?

    i have try use reference from
    http://sandbox.pentaho.com/2011/04/comprehensive-how-two-pentaho-and-hadoop-tutorial/

    but almost all project and job which include of documentation failed when i import to data integration.

    i use pdi-ce 4.2

    ReplyDelete
  6. At the start of this tutorial you can find a download link to the example. If you have all the connection details correct, it should just work.

    ReplyDelete