Tuesday, August 27, 2013

Pentaho PostgreSQL Bulk Loader: How to fix a Unicode error

When using the Pentaho PostgreSQL Bulk Loader step, you might come across following error message in the log:

INFO  26-08 13:04:07,005 - PostgreSQL Bulk Loader - ERROR {0} ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x73 0x63 0x68
INFO  26-08 13:04:07,005 - PostgreSQL Bulk Loader - ERROR {0} CONTEXT:  COPY subscriber, line 2

Now this is not a problem with Pentaho Kettle, but quite likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

$ echo $LANG

In this case, we can clearly see it is not an UTF-8 encoding, the one which the bulk loader relies on.

So to fix this, we just set the LANG variable in example to the following:

$ export LANG=en_US.UTF-8

Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Run the transformation again and now you will see that the process just works flawlessly.

Sunday, August 25, 2013

Advanced routing in Pentaho Kettle jobs - Part 2

Part one provided an introduction to Pentaho Kettle routing. This follow-up article will take a look at an additional example, which will retrieve an input value for the condition from a database. This input value could be in example the result of a query which checks in a database logging table if a certain job has been executed successfully:
The usage of the Evaluate rows number in a table job entry is a bit of a hack here: Our query will always return a result, so this will always be true. But this job entry allows the result set to be passed on to the next one. This saves us from writing a dedicated transformation for this purpose:
The setup of the Simple evaluation job entry is quite simple ... just reference the return field from the previous job entry:
You can download this example here.

Saturday, August 10, 2013

Going Agile: Sqitch Database Change Management

Going Agile: Sqitch Database Change Management

You have your database scripts under a dedicated version control and change management system, right? If not, I recommend doing this now.
While there have been handful of open source projects around which focus on DB script versioning and change management control, none has really gained a big momentum and a lot of them are dormant.
But there is a new player on the ground! A light at the end of the db change management tunnel - so to speak. David Wheeler has been working on Sqitch over the last year and the results are very promising indeed! Currently the github projects shows 7 other contributors, so let’s hope this project gets a strong momentum! Also a new github project for a Sqitch GUI was just founded.

Why I like Sqitch:
  • You can run all the commands from the command line and get very good feedback.
  • Everything seems quite logical and straightforward: It’s easy to get to know the few main commands  and in a very short amount of time you are familiar with the tool.
  • You can use your choice of VCS.
  • It works very well.

Supported DBs are currently MySQL, Oracle, SQLite and PostgreSQL. CUBRID support is under way.

So what do we want to achieve? So what do we want to achieve?

Bring all DDL, stored procedures etc under version control. This is what Git is very good for (or your choice of CVS).

Keep track of the (order of) changes we applied to the database, verify that they are valid, be able to revert them back to a specific state if required. Furthermore, we want to deploy these changes (up to a specific state) to our test and production databases. This is was Sqitch is intended for:

The below write-up are my notes partially mixed with David’s ones.




PostgreSQL: cpan App::Sqitch DBD::Pg (You also have to have PostgreSQL server installed)
SQLite: cpan App::Sqitch DBD::SQLite
Oracle: cpan App::Sqitch DBD::Oracle (You also have to have SQL*Plus installed)
MySQL: cpan App::Sqitch

If you want to have support for i.e. PostgreSQL and Oracle you can just run:
PostgreSQL: cpan App::Sqitch DBD::Pg DBD::Oracle

For more install options see here.
Below I will only discuss the setup for PostgreSQL.

# make sure you have build tools install (make etc) sudo apt-get install built-essential

# sqitch requires Locals support (asks for Locale::TextDomain and Locale::Messages)
sudo apt-get install libintl-perl
# make sure pg_config is installed sudo apt-get install libpq-dev # RHEL (Fedora, CentOs, etc) use instead: # sudo yum install postgresql-devel # Note where the psql include dir is located. You might be asked for it on installing sqitch pg_config --includedir # Note the version of psql. You might be asked for it on installing sqitch psql --version # install sqitch (adjust for your DB) sudo cpan App::Sqitch DBD::Pg 

Once installation is finished, check out the man page:
$ man sqitch

Within your git project directory, create a dedicated folder:
$ mkdir sqitch
$ git add .
$ cd sqitch
$ sqitch --engine pg init projectname

Let's have a look at sqitch.conf:
$ cat sqitch.conf

Now let’s add the connection details:
$ vi sqitch.conf

uncomment and specify:
[core "pg"]
client = psql
username = postgres
password = postgres
db_name = dwh
host = localhost
port = 5432
# sqitch_schema = sqitch

If psql is not in the path, run:
$ sqitch config --user core.pg.client /opt/local/pgsql/bin/psql
Add your details:
$ sqitch config --user user.name 'Diethard Steiner'
$ sqitch config --user user.email 'diethard.steiner@'

Let’s add some more config options: Define the default db so that we don’t have to type it all the time:
$ sqitch config core.pg.db_name dwh
Let's also make sure that changes are verified after deploying them:
$ sqitch config --bool deploy.verify true
$ sqitch config --bool rebase.verify true

Check details:
cat ~/.sqitch/sqitch.conf

Have a look at the plan file. The plan file defines the execution order of the changes:
$ cat sqitch.plan

$ git add .
$ git commit -am 'Initialize Sqitch configuration.'

Add your first sql script/change:
$ sqitch add create_stg_schema -n 'Add schema for all staging objects.'
Created deploy/create_stg_schema.sql
Created revert/create_stg_schema.sql
Created verify/create_stg_schema.sql

As you can see, Sqitch creates deploy, revert a verify files for you.

$ vi deploy/create_stg_schema.sql


Make sure you remove the default BEGIN; COMMIT; for this as we are just creating a schema and don’t require any transaction.

$ vi revert/create_stg_schema.sql

DROP SCHEMA staging;

$ vi verify/create_stg_schema.sql

SELECT pg_catalog.has_schema_privilege('staging', 'usage');

This is quite PostgreSQL specific. For other dbs use something like this:
SELECT 1/COUNT(*) FROM information_schema.schemata WHERE schema_name = 'staging';

Now test if you can deploy the script and revert it:

Try to deploy the changes:
The general command looks like this:
$ sqitch -d <dbname> deploy

As we have already specified a default db in the config file, we only have to run the following:
$ sqitch deploy
Adding metadata tables to dwh
Deploying changes to dwh
 + create_stg_schema .. ok

Note the plus sign in the feedback which means this change was added.

When you run deploy for the very first time, Sqitch will create maintenance tables in a dedicated schema automatically for you. These tables will (among other things) store in which “version” the DB is.

Check the current deployment status of database dwh:
$ sqitch -d dwh status
# On database dwh
# Project:  yes
# Change:   bc9068f7af60eb159e2f8cc632f84d7a93c6fca5
# Name:     create_stg_schema
# Deployed: 2013-08-07 13:01:33 +0100
# By:       Diethard Steiner <diethard.steiner@>

To verify the changes run:
$ sqitch -d dwh verify
Verifying dwh
 * create_stg_schema .. ok
Verify successful

To revert the changes the the previous state, run:
$ sqitch revert --to @HEAD^ -y

Side note
You can use @HEAD^^ to revert to two changes prior the last deployed change.

To revert everything:
$ sqitch revert
Revert all changes from dwh? [Yes] Yes
 - create_stg_schema .. ok

To revert back to a specific script (you can also revert back to a specific tag):
$ sqitch revert create_dma_schema
Revert changes to create_dma_schema from dwh? [Yes]

Let’s inspect the log:
$ sqitch log

Note that the actions we took are shown in reverse chronological order, with the revert first and then the deploy.

Now let's commit it.
$ git add .
$ git commit -m 'Added staging schema.'

Now that we have successfully deployed and reverted the current change, let’s deploy again:
$ sqitch deploy
Let’s add a tag:
$ sqitch tag v1.0.0-dev1 -n 'Tag v1.0.0-dev1.'

Deployment to target DBs
So if you want to deploy these changes to your prod DB in example, you can either do it like this:
$ sqitch -d <dbname> -u <user> -h <host> -p <port> deploy
(Important: If you are working with PostgreSQL, make sure you add your password to ~/.pgpass and then comment the password out in sqitch.conf beforehand otherwise this will not work.)
Or bundle them up, copy the bundle to your prod server and deploy it there:
$ sqitch bundle
Distribute the bundle
On the prod server:
$ cd bundle
$ sqitch -d dwhprod deploy

A future version of Sqitch will have better support for target DBs (see here).

Using Sqitch with an existing project (where some ddl already exists)

Sometimes you take over a project and want to bring the existing DDL under version control and change management.

Thanks to David for providing details on this:

The easiest option is to export the existing DDL and store it in one deploy file. For the revert file you could use a statement like this then:

   DROP $schema CASCADE;

Let’s assume we call this change “firstbigchange”:

The first time you do a deploy to the existing database with Sqitch, do it twice: once with --log-only to apply your first big change, and then, from then on, without:

   $ sqitch deploy --log-only --to firstbigchange
   $ sqitch deploy --mode change

The --long-only option has Sqitch do everything in the deploy except actually run deploy scripts. It just skips it, assumes it worked successfully, and logs it. You only want to do this --to that first big dump change, as after that you of course want Sqitch to actually run deploy scripts.

Using more than one DB

DS: Currently it seems like there is a Sqitch version for each of these dbs. What if I was working on a project that used two different dbs installed on the same server and I wanted to use Sqitch for both of them (especially for dev I have more than one db installed on the same server/pc)?

DW: You can have more than one plan and accompanying files in a single project by putting them into subdirectories. They would then be effectively separate Sqitch projects in your SCM. The sqitch.conf file at the top can be shared by them all, though, which is useful for setting up separate database info for them ([core.pg] and [core.mysql] sections, for example).

If you are starting a new project, you would do it like this:

$ sqitch --engine pg --top-dir pg init myproject
$ sqitch --top-dir mysql init myproject

Then you have sqitch.plan, deploy, revert, and verify in pg/, and sqitch.plan deploy, revert, and verify in mysql/. To add a new change, add it twice:

$ sqitch --top-dir pg add foo
$ sqitch --top-dir mysql add foo