Tuesday, July 23, 2013

Introducing the Kettle Test Framework Beta

Kettle Test Framework (KTF)

Subtitle: Kettle Testing for the Non-Java Developers

Announcing the KTF Beta:

Precautions



Please note that KTF is still in Beta and has undergone only minimal testing. Please report any bugs on the dedicated Github page so that they can be easily fixed for everybody’s advantage. Do not use for any production purposes.


You must not run this process on a production environment! You should only run this process on a dedicated test environment where it is ok to lose all the data in the database. You must run this process on a dedicated test database! This process wipes out all your tables!

The small print upfront

Please note that this is a community contribution and not associated with Pentaho. You can make use of this framework at your own risk. The author makes no guarantees of any kind and should not be hold responsible for any negative impact.  


You should have a solid understanding of Pentaho Data Integration/Kettle. I made a minimal attempt to document this framework - the rest is up to you to explore and understand.

Motivation

The main idea behind this framework is to create a base for best test practises in regards to working with the Kettle ETL tool. Please add any ideas which could improve this test framework as “improvement” on the dedicated Github page.


Code and samples can be downloaded from Github.


Testing data integration processed should be core part of your activities. Unfortunately, especially for non Java developers, this is not quite so straightforward (Even for Java developers it is not quite that easy to unit test their ETL processes, as highlighted here.). This framework tries to fill this gap by using standard Kettle transformations and jobs to run a test suite.

When you create or change a data integration process, you want to be able to check if the output dataset(s) match the ones you expect (the "golden" dataset(s)). Ideally, this process should be automated as well. By using KTF's standard Kettle transformations and jobs to do this comparison every data integration architect should be in the position to perform this essential task.


Some other community members have published blog posts on testing before, which this framework strongly took ideas/inspiration from (especially Dan Moore’s excellent blog posts [posts, github] ). Also, some books published on Agile BI methodologies were quite inspirational (especially Ken Colliers “Agile Analytics”) as well.
While Dan focused on a complete file based setup, for now I tried to create a framework which works with processes (jobs, transformations) which make use of Table input and Table output steps. In the next phase the aim is to support file based input and output (csv, txt) as well. Other features are listed below.

Contribute!

Report bugs and improvements/ideas on Github.


Features

Let’s have a look at the main features:
  • Supports multiple input datasets
  • Supports multiple output datasets
  • Supports sorting of the output dataset so that a good comparison to the golden output dataset can be made
  • The setup is highly configurable (but a certain structure is enforced - outlined below)
  • Non conflicting parameter/variable names (all prefixed with “VAR_KTF_”)
  • Non intrusive: Just wraps around your existing process files (except some parameters for db connections etc will have to be defined … but probably you have this done already anyways)

Current shortcomings

  • Not fully tested (only tested with included samples and on PostgreSQL)
  • Currently works only with Table input / output transformations. Text/CSV file input/output will be supported in a future version (which should not be too complicated to add).
  • Dictates quite a strict folder structure
  • Limited documentation

Project Folder Structure

Stick to this directory layout for now. In future versions I might make this more flexible.


  1. Adjust parameter values in config/.kettle/kettle.properties and JNDI details in config/simple-jndi/jdbc.properties
  2. Your transformations have to be prefixed for now with “tr_” and your jobs with “jb_”. Do not use any special characters or spaces in your job/transformation names. Your transformations and jobs have to be saved within repository/main. There is currently now subfolder structure allowed within this folder.
  3. Within the config/test-cases directory create a folder for the processes you want to test. A process can be a transformation or a job. Name these folders exactly the same as the job/transformation you want to test (just without the file extension).  Each process folder must have an input and output folder which hold the DDL, Kettle data type definitions and in case of the output the sort order definition (see tr_category_sales sample on how to set this up). If your output dataset does not require any sorting, create an empty sort def file (see tr_fact_sales example). Note that KTF can handle more than one output/input dataset.
  4. The process folder must also contain at least one test case folder (which has to have a descriptive name). In the screenshot above it is called “simpletest”. A test case folder must contain an input and output folder which each hold the dedicated datasets for this particular test case. In case of the output folder it will hold the golden output dataset(s) (so that dataset that you want to compare your ETL output results to).
  5. Users working on Windows: For all the CSV output steps in transformations under /repository/test set the Format to Windows (Content tab). KTF has not been tested at all on Windows, so you might have to make some other adjustments as well.
  6. Add environment variables defined in config/set-project-variables.sh to your .bashrc. Then run: source ~/.bashrc
  7. Start Spoon (this setup requires PDI V5) or run the process from the command line.
  8. Run the test suite
  9. Analyze results in tmp folder. If there is an error file for a particular test case, you can easily visually inspect the differences like this:
dsteiner@dsteiner-Aspire-5742:/tmp/kettle-test/tr_category_sales/sales_threshold$ diff fact_sales_by_category.csv fact_sales_by_category_golden.csv
2c2
< 2013-01-01;Accessories;Mrs Susi Redcliff;399;Yes
---
> 2013-01-01;Accessories;Mrs Susi Redcliff;399;No
5c5
< 2013-01-02;Groceries;Mr James Carrot;401;No
---
> 2013-01-02;Groceries;Mr James Carrot;401;Yes


All files related to testing (KTF) are stored in repository/test. You should not have to alter them unless you find a bug or want to modify their behaviour.


To get a better idea on how this is working, look at the included examples, especially tr_category_sales (has multiple inputs and outputs and proper sorting). The other example, tr_fact_sales has only one output and input and no sorting defined (as it only outputs one figure).

Future improvements

Following improvements are on my To-Do list:
  • Write test results to dedicated database table
  • Improvement of folder structure
  • Support for text file input and output for main processes (jobs/transformations)

FAQ

My input data sets come from more than one data source. How can I test my process with the KTF?

Your process must have parameter driven database connections. This way you can easily point your various JNDI connections to just one testing input database. The main purpose of testing is to make sure if the output is as expected, not to test various input database connections. Hence for testing, you can “reduce” you multiple input connections to one.

Wednesday, July 17, 2013

Pentaho Kettle Parameters and Variables: Tips and Tricks

Pentaho Kettle Parameters and Variables: Tips and Tricks

This blog post is not intended to be a formal introduction to using parameters and variables in Pentaho Kettle, but more a practical showcase of possible usages.


Please read my previous blog post Pentaho Data Integration: Scheduling and command line arguments as an introduction on how to pass command line arguments to a Kettle job.


When I mention parameters below, I am always talking about named parameters.

Parameters and Variables

Definitions upfront

Named Parameter: “Named parameters are a system that allows you to parameterize your transformations and jobs.  On top of the variables system that was already in place prior to the introduction in version 3.2, named parameters offer the setting of a description and a default value.  That allows you in turn to list the required parameters for a job or transformation.” (Source)


Variable: “Variables can be used throughout Pentaho Data Integration, including in transformation steps and job entries. You define variables by setting them with the Set Variable step in a transformation or by setting them in the kettle.properties file. [...] The first usage (and only usage in previous Kettle versions) was to set an environment variable. Traditionally, this was accomplished by passing options to the Java Virtual Machine (JVM) with the -D option. The only problem with using environment variables is that the usage is not dynamic and problems arise if you try to use them in a dynamic way.  Changes to the environment variables are visible to all software running on the virtual machine.  [...] Because the scope of an environment variable is too broad, Kettle variables were introduced to provide a way to define variables that are local to the job in which the variable is set. The "Set Variable" step in a transformation allows you to specify in which job you want to set the variable's scope (i.e. parent job, grand-parent job or the root job).” (Source). “

Example

Let’s walk through this very simple example of using parameters and variables. I try to explain all the jobs and transformations involved. The files are also available for download here. You can find the following files in the folder intro_to_parameters_and_variables.

jb_main.kjb

In this extremely simple job we call a subjob call jb_slave.kjb. In this case, we defined hard coded parameter values in the job entry settings. Alternatively, to make this more dynamic, we could have just defined parameters in the job settings.

jb_slave.kjb

This subjob executes the transformations tr_set_variables.ktr and tr_show_param_values.ktr. In this case, in order to access the parameter values from the parent job, we defined the parameters without values in the job settings:
Note: This is just one of the ways you can pass parameters down to the subprocess.

tr_set_variables.ktr

This transformation sets a variable called var1 with scope Valid in parent job so that successive processes can make use if it. In this case the values originate from a Generate Rows step for demonstration purposes; in real world examples you might read in some values from a file or a database table.

tr_show_param_values.ktr

The main transformation has the sole purpose of writing all the parameter and variable values to the log. We retrieve the parameters and variable by using a Get Variables step. We also check if a value is present by using a Filter Rows step. In case one value is missing, we Abort the transformation, otherwise the values are written to the log.


There is no need to set the parameter names in this transformations; there is an advantage though if you do it:
Missing parameter values will be properly displayed as NULL, which makes it a bit easier to check for them.
If you don't define them in the transformation settings, missing parameter values will be displayed as ${PARAMETERNAME}.


Important: Variables coming from tr_set_variables.ktr MUST NOT be listed in the Parameter tab in the Transformation Settings as this overrides the variable.

Making Parameters available for all subprocesses in an easy fashion

As you saw above, defining the parameters for each subprocess just to be able to pass them down can be a bit labour intensive. Luckily, there is a faster way of doing just this:


  1. In the main job specify the parameters that you want to pass in in the Job Settings:
    This way parameters and their values can be passed in from the command line in example.
  2. Right after the Start job entry use the Set Variables job entry. Specify the variable names, reference the parameters you set up in step 1 and set the scope to Valid in the current job.
  3. There is no need to specify any parameters/variables in any of the subprocesses.


To see how this is working, run jb_main.kjb in the passing_down_parameters_in_an_easy_fashion folder (part of the provided examples).

What if I still want to be able to run my subprocess independently sometimes?

You might have some situations, when you have to run the subprocess independently (so in other words: You do not execute it from the parent/main job, but run it on its own). When we pass down parameters or variables, this can be a bit tricky and usually it just doesn’t work out of the box. Luckily, there is a way to achieve this though:
  1. In the subprocess, specify the parameter that you want to be able to pass in. In our example (which is based on the previous example), we modified the transformation tr_show_param_values.ktr and added following parameters to the Transformation Settings:
    We also amended the
    Get Variables step to make use of these parameters:
    This way, we can already run this transformation on its own. Now we only have to adjust the parent job so that we can run it from there as well.
  2. In the parent job, in the Job or Transformation job entry settings, go to the Parameters tab and tick Pass all parameter values down to the sub-transformation/sub-job. Next, as the Parameter set the name of the parameter you defined in the subprocess. As the Value define the variable that you want to pass down: ${variable}. This assumes that this variable was set beforehand by some Set Variables job entry/step.
    In our case, we modified transformation job entry in the job
    jb_slave.kjb and added following mapping to the job entry settings in the Parameters tab:
A sample for this setup is provided in the mulitpurpose_setup_allow_individual_execution_of_subprocesses folder.

Closing remarks


Using parameters and variables in Kettle jobs and transformations allows you to create highly dynamic processes. I hope this tutorial shed some light onto how this can be achieved.