Tuesday, August 27, 2013

Pentaho PostgreSQL Bulk Loader: How to fix a Unicode error

When using the Pentaho PostgreSQL Bulk Loader step, you might come across following error message in the log:

INFO  26-08 13:04:07,005 - PostgreSQL Bulk Loader - ERROR {0} ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x73 0x63 0x68
INFO  26-08 13:04:07,005 - PostgreSQL Bulk Loader - ERROR {0} CONTEXT:  COPY subscriber, line 2


Now this is not a problem with Pentaho Kettle, but quite likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:


$ echo $LANG
en_US


In this case, we can clearly see it is not an UTF-8 encoding, the one which the bulk loader relies on.


So to fix this, we just set the LANG variable in example to the following:


$ export LANG=en_US.UTF-8


Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Run the transformation again and now you will see that the process just works flawlessly.

2 comments:

  1. Hi Diethard, thank you for your nice articles!
    They are very helpful!
    Have one question to you, hope you will have time to answer.
    I have checked your articles and have seen that they kind of switching from TALEND to PENTAHO, does it mean that you switched from TALEND to PENTAHO too, what is the reason if so?
    Your opinion about this tools is very interesting.

    Thank you in advance!
    Evghenii

    ReplyDelete
    Replies
    1. I've been working on various projects for various different clients. Sometimes the client sets to requirement to use a certain ETL tool (i.e. they have been using a certain ETL tool in-house for a longer time) and in same cases I can choose the ETL tool (they right tool for the job). There should be some resources online which compare various ETL tools and can give you a good overview.

      Delete