Pentaho Data Integration (Kettle): Sourcing data from Hadoop Hive
- Software: PDI/Kettle 4.1 (download here)
- Knowledge: Beginner
Has your company recently started using Hadoop to cope with enormous amounts of data? Have you been using Kettle so far for your ETL? As you are probably aware of, with the Kettle Enterprise Edition you can now create map-reduce jobs for Hadoop. If you want to stay with the open source version, the good news is, that it’s very simple to connect to Hive - a database which can be set up on top of Hadoop.
If you have one of the latest versions of Kettle installed, you will see that it comes already with the required Hive driver. Hence, setting up a connection to Hive is straight forward.
Create a new database connection
- Create a new transformation
- Click the View tab on the right hand side
- Right click on Database connections and choose New
Alternatively, you can also activate the Design tab and drag and drop a Table input step on the canvas, open it and click on New to create a new database connection.
In the database settings window choose Generic database from the available databases.
For the connection URL insert the following:
For the driver specify:
Depending on your setup you might have to provide other details as well.
Click Test and all should be working now.
If you want to use any specific settings or user defined functions (UDF), then you can call them as follows:
- In the database settings window, click on Advanced in the left hand pane.
- Insert these statements in the field entitled Enter the SQL statements (separated by ;) to exectue right after connecting
- Click Test again to check if a connection can be created with these specific settings
Everything is set up now for querying Hive from Kettle. Enjoy!