Diethard Steiner on Business Intelligence: September 2009

Tuesday, September 29, 2009

Pentaho Report Designer 3.5: Not mand...

How to set up not mandatory parameters

When creating parameters in Pentaho Report Designer 3.5, you can definie if the paramater is mandatory or not.

Imagine we have a query like this one to feed our report:

SELECT

FROM

table

WHERE

channel = ${channel} AND

media = ${media}

;

I define two parameters for this report: channel and media. The end user will be able to choose a value for these parameters once they call the report. In the query you define the parameter like this: ${parameter_name}. This is fairly easy so far, but how do we configure the report if we want to allow the users to choose either one media value or non (=showing all media types)?

This is not difficult to implement, if you know how to do it. Basically, you have to change your query to this:

SELECT

FROM

table

WHERE

channel = ${channel} AND

(media = ${media} OR 0 = IFNULL(${media},0)

;

Basically, if the user doesn't choose any value for media, the media value will be set to NULL (as in "no value"). So I changed the query to include "(media = ${media} OR 0 = IFNULL(${media},0)", which basically says: If there is a media value, apply it, if not, return all the media types (=no filter). So depending on which part is TRUE, the query result will change!

Now this is a fairly wonderful way to implement this. The only problems are that the Null value/No filter will be only available when you call the report in the first go, once you changed the Media to "TV" in example, the drop down menu (or whatever display type you chose) doens't offer a NULL value any more. Also, it would be great to be able to rename the NULL value for the end user individually, in example to "All Media Types".

Monday, September 28, 2009

Pentaho Kettle: Using Full Outer Join

Today I came across a special use case for an outer join. I was planning to set up an ETL process that imports yesterday's revenue summary by summarizing the input of two tables.

The main problem was, that we might not have made any revenue yesterday at all, or there might be only revenue in one table but not the other.

Basically, if revenue figures exist in both tables, the process would be following:

- input data from table 1: date, service name, revenue_cars

- input data from table 2: date, service name, revenue_carparts

- join the tables, so that the output looks like: date, service name, revenue_cars, revenue_carparts

The problem is, as mentioned, that not every day data will be available in both tables. So I would have to implement a logic like this one:

IF input data table 1 NOT EMPTY AND input data table 2 EMPTY THEN output only data of table 1.

IF input data table 2 NOT EMPTY AND input data table 1 EMPTY THEN output only data of table 2.

IF input data table 1 NOT EMPTY AND input data table 2 NOT EMPTY THEN JOIN everything and output it.

All this can be achieved by using a FULL OUTER JOIN.

So your process in Kettle should be as follows:

Drag and drop two "table input" steps into the working area and define the settings. Our query looks like this:For table1:
SELECT
date,
service_name,
COUNT(*) AS count
FROM
table1
GROUP BY 1,2
For table2:
SELECT
date,
service_name,
COUNT(*) AS count,
SUM(rev) AS rev_sum
FROM
table2
GROUP BY 1,2

For each of them add a sort step and connect them with the table input. Make sure you sort the relevant fields in ascending order (in our case this is date and service_name).

Drag and drop a Join step on the working area, connect them with the two sort steps and then define the two sort steps as input.

In this screenshot you can also see a Group By Step. This one is not in use at all. But in case you want to check if there is any output of this step, the group by step has an option called "Always give back a result row", which, when ticked, will set the aggregation to 0 (Zero). This might be useful in some cases. This is different to the Detect Empty Stream step, which only gives back a completely empty row in any case.

Ok, so now we have the FULL OUTER JOIN working. Have a look at the output by exporting it into a text or Excel file. You will see that in case both result sets didn't match, the number fields (count and rev_sum) are blank. As we want to use these fields later on in calculations, we have to assign a default value in case the number field is null.

Insert a If field value is null step, connect it to the previous step and double click on it afterwards. Tick "Select fields" and in the fields table add count and rev_sum and assign a default value of 0 (Zero) to them.

Another thing that you will also have recognized by looking at the output is that we have now two additional fields: service_name_1 and date_1.

So now we have to do a check. In case there are matching values in both results, both

date and data_1 and

service_name and service_name_1

will be populated. If it is not in both, only one of them will be populated. So we are going to create a new variable called service_name_new and date_new that are going to replace the old ones in a "Modified Java Script Value" Step. Drag and drop this step on the working area and connect it with the Merge Join Step. Douple click on the Modified Java Script Value Step and insert following script plus tick "Compatibility Mode":

if (service_name.isNull()) {

var service_name_new=service_name_1;

}

else

{

var service_name_new=service_name;

}

if (date.isNull()) {

var date_new=date_1;

}

else

{

var date_new=date;

}

Add service_name_new and date_new as output fields and define the type. Then add a Select values Step and add all the fields, but not service_name_1, service_name, date, date_1. Add another step to export it into Excel in example. Our whole process should look like this now (please ignore that some connection have the copy sign, this will not be the case with your process):

In this small tutorial we have now set up a nicely working FULL OUTER JOIN. I hope that you can implement this in your ETL processes now without any problems.

Saturday, September 26, 2009

Full Review of "Pentaho Reporting 3.5 for Java Developers"

This is the first book available that describes the functionality of the Pentaho Report Designer in detail. Will Gorman mainly focused on integrating the reports in Java applications and not so much on using it in combination with the Pentaho BI Server.
Overall, it is an excellent book. It guides you through various tutorials on how to create more complex reports. All the tutorials are fairly easy to follow. Some chapters of the book give a very detailled description of the available functionality and settings, which serves as an excellent reference.
Conclusion: I really enjoyed reading this book. From my point of view Will Gorman provides an interesting overview of Pentaho Reporting. If you are one of those users that are not familiar with Java and just want to get a general overview of Pentaho Reporting, then this book will also be a good ressource as well (just skip some of the pages then that describe the Java integration).
I hope that there will be a Part II of this book, as I certainly would have liked to know more about adding OLAP and Metadata sources to the report and parameterizing them.

Wednesday, September 23, 2009

New books arrived

After coming back from my business trip to South Africa I found a package of new books on my office desk. I finally received a copy of "Pentaho Reporting 3.5 for Java Developers", "Pentaho Solutions" and "The Data Warehouse Toolkit - The Complete Guide To Dimensional Modeling".
I think I am going to start with the Pentaho Reporting book, so that I can post a review here quite soon. It was a tough choice, as all the books sound very interesting.

Thursday, September 17, 2009

Review "Pentaho Reporting 3.5 for Java Developers" Books Chapter 6

Packt Publishing Ltd sent me yesterday the chapter 6 "Including Charts and Graphics in Reports" of the recently published "Pentaho Reporting 3.5 for Java Developers" Book to review. Excited as I was, I had to read it on the same day!

"Pentaho Reporting 3.5 for Java Developers" was written by Will Gorman, a long time Pentaho developer. I found chapter 6 easy to read, so I would say it is ideal for users that are new to Pentaho Reporting as well as those who have already a bit of experience with it. It's also great that the book covers the very latest version of the software, which brings a lot of major improvements.

The book is not only a valuable reference but offers easy-to-follow examples as well. It is definitely worth the investment, as it will save you hours of searching for the right info in forums, blogs and the wiki, which can sometimes be a bit frustrating.

Now let's have a look at chapter 6: Will kicks off by introducing all the different types of data sources (category, x-y, x-y-z and times series datasets, ...) that can be used in Pentaho Reporting and offers nice examples.

Next, all the chart rendering properties are discussed in detail, which serves as a great reference. This is followed by an introduction to all the various chart types that are available in Pentaho Reporting. In the first example you create a new report and load data from a HSQLDB, which you use as a data source for an area chart. Will speaks you through the chart properties that have to be set.

Bar, line, pie, ring, multipie, waterfall, bar-line, bubble, scatter, xy, scatter and radar charts are discussed next. Included is also a very nice example of a pie chart that resets itself with each group header (very useful!). The bubble chart, which uses 3 dimensions, is also very interesting.

The last page of this chapter describes how to insert static and dynamic images into your report.

Overall, I found this chapter very easy to read. It was a great idea to include examples as well and not only to concentrate on a description of the properties. All in all, I can strongly recommend this book based on this chapter and I am looking forward to reviewing the whole book, which should arrive in my post box next week. By the way, you can already buy the book here.

Wednesday, September 16, 2009

Review coming soon ...

Today I received a nice email from Packt Publishing Ltd asking me if I wanted to write a review about the new Pentaho Reporting 3.5 for Java Developers Book.
I've quite some books to read currently about OLAP and Data Warehouse Design, but I'll try to speed up to provide a review about this exciting new book in the next few months. So stay tuned!

Saturday, September 12, 2009

New Books on Pentaho

There hasn't been much literature around about Pentaho, at least not any that I know of. So far most people would search for info in the pentaho forum or wiki. But recently things have changed. There is now whole book about setting up Pentaho with MySQL called "Pentaho Solutions", which you can order from Amazon. This one tries to cover basically every component of Pentaho, which should be a great support for first time users.
Another book was published recently as well, called "Pentaho Reporting 3.5 for Java Developers" by Will Gorman, a long term Pentaho developer. This book will provide you everything you need to know about Pentaho Reporting 3.5. I've already ordered this books and I am quite looking forward to read them!