Sunday, May 29, 2011

Pentaho Report Designer: Creating a Bar-Line chart with secondary


Pentaho Report Designer: Creating a Bar-Line Chart with Secondary Axis in Different Format

Tutorial Details

  • Software: PRD 3.8 (download here)
  • Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)
  • Tutorial files can be downloaded here

Pentaho Report Designer offers a wider variety of chart types. Today we will look at the bar-line chart type. We will create a secondary axis and apply a different format on it. Our final chart will look like this one below (This is a chart created in 2 minutes, if you spend 2 minutes more, it will even look better):
I will not explain everything step by step but only focus on how to set up a chart which is based on data similar to the one shown below:

As you can see, the data set consists of impressions and click thru rate (CTR) broken down by option. In our chart, we want to display the impressions as bars and the CTR as a line. The only complexity that we have here is that CTR should be shown as a percentage whereas the impressions as standard integer values. So how does this work?


The most important point is that although in this case we only have one data set, for the chart we have to define two data sources, as our measures have different formats.

First we will define the settings for the bars (I only mention the very important points here):

  • In the Primary DataSource settings we only define [impressions] in the value-columns 
  • The format can be defined with the y-tick-fmt-str

Thereafter, we click on Secondary DataSource:

  • We define [CTR] for the value-columns
  • Now we would expect something like y2-tick-fmt-str, but this one strangely enough doesn't exist. Instead, the format for the line has to be set in line-tick-fmt:

Provide all the other necessary details in the settings dialog and then the chart should be displayed perfectly.

Thursday, May 26, 2011

Kettle: Sourcing data from Hadoop Hive

Pentaho Data Integration (Kettle): Sourcing data from Hadoop Hive

Tutorial Details


  • Software: PDI/Kettle 4.1 (download here)
  • Knowledge: Beginner


Has your company recently started using Hadoop to cope with enormous amounts of data? Have you been using Kettle so far for your ETL? As you are probably aware of, with the Kettle Enterprise Edition you can now create map-reduce jobs for Hadoop. If you want to stay with the open source version, the good news is, that it’s very simple to connect to Hive - a database which can be set up on top of Hadoop.

If you have one of the latest versions of Kettle installed, you will see that it comes already with the required Hive driver. Hence, setting up a connection to Hive is straight forward.

Create a new database connection

  1. Create a new transformation
  2. Click the View tab on the right hand side
  3. Right click on Database connections and choose New

Alternatively, you can also activate the Design tab and drag and drop a Table input step on the canvas, open it and click on New to create a new database connection.

In the database settings window choose Generic database from the available databases.

For the connection URL insert the following:

jdbc:hive://youramazonec2url:10000

For the driver specify:
org.apache.hadoop.hive.jdbc.HiveDriver

Depending on your setup you might have to provide other details as well.

Click Test and all should be working now.
If you want to use any specific settings or user defined functions (UDF), then you can call them as follows:

  1. In the database settings window, click on Advanced in the left hand pane.
  2. Insert these statements in the field entitled Enter the SQL statements (separated by ;) to exectue right after connecting
  3. Click Test again to check if a connection can be created with these specific settings

Everything is set up now for querying Hive from Kettle. Enjoy!

Monday, May 23, 2011

Tutorial Aggregated Tables for Mondrian (Web)

Tutorial Aggregated Tables for Mondrian
(Republished Tutorial - originally published July 2009)

I've started using Pentaho BI Suite in early 2009. One of my main objectives was to create a multidimensional model for DSS (Decission Support Systems) and Analysis. As the BI Server already comes with Mondrian, it is fairly easy to start off.
I created my first cube definition with the Schema Workbench. The cube definition can be saved as an XML file. This XML file you can then export to your solution folder on the BI Server.
Soon we realized the the need for aggregated tables, as some MDX queries took rather long. I had a look at the Pentaho Aggregation Designer 1.1 and thought that this would be a rather easy task. 
One thing the Aggregation Designer does is to check if your Schema (the cube definition xml file) is valid, it also checks if your tables are all properly set up, which is indeed a very good functionality, as it will show you any errors and you can correct them then.
I used the Advisor Function within the aggregation designer and created all the recommended tables with the DDL and DML (which the Aggregation Designer creates automatically) and exported the new Cube Schema as well.
But the problem was, that nothing worked. The problem was that I thought that the Aggregation Designer would setup everything, but I was quite wrong about that (at the time of the writing Aggregation Designer 1.1 was available), or maybe, I was just doing something wrong. After some unsuccessful approaches I decided to not use the Aggregation Designer, but to create everything from scratch. I am not recommending here that you shouldn't use the Aggregation Designer, it is definitely worth having a look at it. I am just describing here the approach that worked for me. I am also assuming that you use the standard configuration of Mondrian.

Requirements

This article assumes that you are familiar with
  • MDX
  • MySQL (or SQL in general)
  • Pentaho BI Suite
  • Pentaho Schema Workbench
  • Pentaho Design Studio

Know what you are up to

While creating aggregated tables it makes sense to test if they are used for the purpose that you are creating them for. You might have a MDX query that takes quite long to execute and hence you would want to create a dedicated aggregated table therefore. Have a look at the mondrian_sql.log to see which SQL query Mondrian actually creates for your MDX. 
Your aggregated table will look quite similar to it. 
When you execute the MDX queries, always have a look at these logs in the beginning, you will learn at lot!

How to enable MDX and SQL logging on BI Server 3

Whenever a MDX query is executed, Mondrian transforms it into a SQL query. In some cases you want to see what the resulting SQL query is, for example to check if Mondrian is using aggregated tables.

Go to \tomcat\webapps\pentaho\WEB-INF\classes and change log4j.xml. Just uncomment the last half of the file where it says:
  
   
   <!-- ========================================================= -->
   <!-- Special Log File specifically for Mondrian                -->
   <!-- ========================================================= -->


   <appender name="MONDRIAN" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>

     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>
   
   <category name="mondrian">
      <priority value="DEBUG"/>
      <appender-ref ref="MONDRIAN"/>
   </category> 
   


   <!-- ========================================================= -->
   <!-- Special Log File specifically for Mondrian MDX Statements -->
   <!-- ========================================================= -->

   <appender name="MDXLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian_mdx.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>

   <category name="mondrian.mdx">
      <priority value="DEBUG"/>
      <appender-ref ref="MDXLOG"/>
   </category>


   <!-- ========================================================= -->
   <!-- Special Log File specifically for Mondrian SQL Statements -->
   <!-- ========================================================= -->


   <appender name="SQLLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian_sql.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>

   <category name="mondrian.sql">
      <priority value="DEBUG"/>
      <appender-ref ref="SQLLOG"/>
   </category>

If you want the SQL to be displayed in a more readable form, add this to the mondrian.properties in pentaho-solutions\system\mondrian (BI Server 3):

mondrian.rolap.generate.formatted.sql=true

Restart the server, then start JPivot, Analyzer or a similar OLAP client and perform an analysis. In hte Tomcat/bin directory you will find following new log files now:  mondrian.log, mondrian_sql.log and a mondrian_mdx.log. 

How to configure Mondrian to recognize aggregated tables

Well, you have to tell Mondrian as well that aggregated tables exist. The first step is to add following lines to the Mondrian.properties, which you can find on pentaho-solutions\system\mondrian (BI Server 3):

# should Mondrian use aggregate tables?
mondrian.rolap.aggregates.Use=true
#  should Mondrian search for aggregate tables using the standard prefix rules?
mondrian.rolap.aggregates.Read=true




Restart the server.
Furthermore, with the standard setup the table names etc have to follow a convention if they should be auto-detected (this is enabled with the second property shown above). In this case, usually you will not have to define the aggregate tables in the Mondrian schema.  (I've copied here a bit from Julian Hyde's excellent documentation):

Rules for Aggregate Auto-Detection

Table Names

As an example of applying the aggregate table name rule, let the fact table be called sales_fact_1997, the Sales cube's fact table from the FoodMart schema. Applying the specific fact table name to the regular expression template creates the following regular expression:

agg_.+_sales_fact_1997

This will match the following table names:
  • agg_l_05_sales_fact_1997
  • agg_c_14_sales_fact_1997
  • agg_lc_100_sales_fact_1997
  • agg_c_special_sales_fact_1997
  • AGG_45_SALES_FACT_1997
  • AGG_drop_time_id_sales_fact_1997

 Column Names

  • There has to be one column called "fact_count", which stores values of a standard count. (This is a general requirement for aggregate tables, this is not specific to auto-detection)
  • If there are any ID columns in the aggregated table, they have to be of the same name as in the fact or dimensional table.
  • Level columns
    • ${usage_prefix}${level_column_name}: If you have a level that is used more than once, i.e. Year in a week hierarchy and a month hierarchy within the time dimension, then set a "usagePrefix" in the Schema so that they can be distinguished. I.E. usage prefix "TW_" for year in the week hierarchy (resutls in "TW_Year") and "TM_" for year in the month hierarchy (results in "TM_Year").
    • ${hierarchy_name}_${level_name}
    • ${hierarchy_name}_${level_column_name}
    • ${level_column_name}

If any of these parameters have space characters, ' ', these are mapped to underscore characters, '_', and, similarly, dot characters, '.', are also mapped to underscores. So, if the hierarchy_name is "Time", level_name is "Month" and level_column_name is month_of_year, the possible aggregate table column names are:
  • time_month
  • time_month_of_year
  • month_of_year
For this rule, the "hierarchy_name" and "level_name" are converted to lower case while the "level_column_name" must match exactly.
Make sure the column names are lower case!
  •  There are three parameters to matching aggregate columns to measures



 1) ${measure_name}: the fact table's cube's measure name, "measure_name", 
 2) ${measure_column_name}: the fact table's cube's measure column name, "measure_column_name", and 
 3) ${measure_column_name}_${aggregate_name}: the fact table's cube's measure's aggregator (sum, avg, max, etc.), "aggregate_name".

where the measure name is converted to lower case and both the measure column name and aggregate name are matched as they appear. If the fact table's cube's measure name was, "Avg Unit Sales", the fact table's measure column name is "unit_sales", and, lastly, the fact table's cube's measure's aggregate name is "avg", then possible aggregate table column names that would match are:
  • avg_unit_sales
  • unit_sales
  • unit_sales_avg

In example: A measure called "Distinct Users" has to be referenced as "distinct_users".

Your first aggregated tables

We have following tables:

The fact table:

DROP TABLE IF EXISTS `pentaho`.`sn_login`;
CREATE TABLE  `pentaho`.`sn_login` (
  `msisdn` bigint(20) NOT NULL DEFAULT '0',
  `login_date_id` bigint(20) NOT NULL DEFAULT '0',
  `login_channel_id` bigint(20) NOT NULL DEFAULT '6' COMMENT 'Default Value Unknown',
  `sessions` int(255) NOT NULL DEFAULT '0',
  PRIMARY KEY (`msisdn`,`login_date_id`,`login_channel_id`),
  KEY `FK_sn_login_date_id` (`login_date_id`),
  KEY `FK_login_channel_id` (`login_channel_id`),
  KEY `login_msisdn` (`msisdn`),
  CONSTRAINT `FK_login_channel_id` FOREIGN KEY (`login_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `FK_login_msisdn` FOREIGN KEY (`msisdn`) REFERENCES `sn_user` (`msisdn`),
  CONSTRAINT `FK_sn_login_date_id` FOREIGN KEY (`login_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The time dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_time_dimension`;
CREATE TABLE  `pentaho`.`sn_lt_time_dimension` (
  `the_year` int(11) DEFAULT NULL,
  `the_week_of_the_year` int(11) DEFAULT NULL,
  `the_month` int(11) DEFAULT NULL,
  `the_day_of_the_year` int(11) DEFAULT NULL,
  `the_day_of_the_week` int(11) DEFAULT NULL,
  `the_day_of_the_month` int(11) DEFAULT NULL,
  `the_month_name` varchar(9) DEFAULT NULL,
  `tenure_years` int(11) DEFAULT NULL,
  `tenure_days` int(11) DEFAULT NULL,
  `time_dimension_id` bigint(20) NOT NULL DEFAULT '0',
  `the_date` date DEFAULT NULL,
  `the_date_as_string` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  `the_quarter` int(3) DEFAULT NULL,
  PRIMARY KEY (`time_dimension_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


The channel dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_communication_channels`;
CREATE TABLE  `pentaho`.`sn_lt_communication_channels` (
  `communication_channel_id` bigint(20) NOT NULL AUTO_INCREMENT,
  `channel_name` varchar(45) NOT NULL,
  `channel_system_name` varchar(45) NOT NULL,
  PRIMARY KEY (`communication_channel_id`)
) ENGINE=InnoDB AUTO_INCREMENT=9 DEFAULT CHARSET=latin1;

The user dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_user`;
CREATE TABLE  `pentaho`.`sn_user` (
  `msisdn` bigint(20) NOT NULL DEFAULT '0',
  `sex` char(1) DEFAULT NULL,
  `registration_date_id` bigint(20) NOT NULL DEFAULT '0',
  `country_id` bigint(20) NOT NULL DEFAULT '0',
  `birthday_date_id` bigint(20) NOT NULL,
  `registration_channel_id` bigint(20) NOT NULL,
  PRIMARY KEY (`msisdn`),
  KEY `msisdn` (`msisdn`),
  KEY `FK_birthday_date_id` (`birthday_date_id`),
  KEY `FK_registration_date_id` (`registration_date_id`),
  KEY `FK_country_id_2` (`country_id`),
  KEY `FK_registration_channel_id` (`registration_channel_id`),
  CONSTRAINT `FK_birthday_date_id` FOREIGN KEY (`birthday_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `FK_country_id_2` FOREIGN KEY (`country_id`) REFERENCES `sn_lt_countries` (`country_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `FK_registration_channel_id` FOREIGN KEY (`registration_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `FK_registration_date_id` FOREIGN KEY (`registration_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The country dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_countries`;
CREATE TABLE  `pentaho`.`sn_lt_countries` (
  `country_id` bigint(20) NOT NULL DEFAULT '0',
  `country_name` tinytext,
  `country_number` double DEFAULT NULL,
  `iso_country_code` tinytext,
  PRIMARY KEY (`country_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


We have created an XML Schema (Cube Definition that looks like this):

<Schema name="Test">
<Dimension type="TimeDimension" usagePrefix="TM_" name="Time Monthly">
   <Hierarchy name="All Times Monthly" hasAll="true" allMemberName="All Times Monthly" allMemberCaption="All Times Monthly" primaryKey="time_dimension_id">
       <Table name="sn_lt_time_dimension">
       </Table>
       <Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
       </Level>
       <Level name="Quarter" column="the_quarter" type="Integer" uniqueMembers="false" levelType="TimeQuarters" hideMemberIf="Never">
       </Level>
       <Level name="Month" column="the_month" nameColumn="the_month_name" type="String" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never">
       </Level>
       <Level name="Day" column="the_day_of_the_month" type="Integer" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
       </Level>
   </Hierarchy>
</Dimension>
<Dimension type="StandardDimension" name="Country">
   <Hierarchy name="All Countries" hasAll="true" allMemberName="All Countries" allMemberCaption="All Countries" primaryKey="country_id">
       <Table name="sn_lt_countries">
       </Table>
       <Level name="Country" column="country_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
       </Level>
   </Hierarchy>
</Dimension>
<Dimension type="TimeDimension" usagePrefix="TW_" name="Time Weekly">
   <Hierarchy name="All Times Weekly" hasAll="true" allMemberName="All Times Weekly" allMemberCaption="All Times Weekly" primaryKey="time_dimension_id">
       <Table name="sn_lt_time_dimension">
       </Table>
       <Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
       </Level>
       <Level name="Week" column="the_week_of_the_year" type="Integer" uniqueMembers="false" levelType="TimeWeeks" hideMemberIf="Never">
       </Level>
       <Level name="Day" column="the_day_of_the_week" type="Numeric" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
       </Level>
   </Hierarchy>
</Dimension>
<Cube name="Users" cache="true" enabled="true">
   <Table name="sn_user">
   </Table>
   <Dimension type="StandardDimension" name="Gender">
       <Hierarchy name="All Genders" hasAll="true" allMemberName="All Genders" allMemberCaption="All Genders">
           <Level name="Gender" column="sex" nameColumn="sex" type="String" uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
           </Level>
       </Hierarchy>
   </Dimension>
   <Dimension type="StandardDimension" foreignKey="registration_channel_id" name="Registration Channel">
       <Hierarchy name="All Registration Channels" hasAll="true" allMemberName="All Registration Channels" allMemberCaption="All Registration Channels" primaryKey="communication_channel_id">
           <Table name="sn_lt_communication_channels">
           </Table>
           <Level name="Registration Channel" column="channel_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
           </Level>
       </Hierarchy>
   </Dimension>
   <Dimension type="StandardDimension" foreignKey="birthday_date_id" name="Age">
       <Hierarchy name="All Ages" hasAll="true" allMemberName="All Ages" allMemberCaption="All Ages" primaryKey="time_dimension_id">
           <Table name="sn_lt_time_dimension">
           </Table>
           <Level name="Age" column="age" type="Integer" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
           </Level>
       </Hierarchy>
   </Dimension>
   <DimensionUsage source="Time Monthly" name="Registration Time Monthly" caption="Registration Time Monthly" foreignKey="registration_date_id">
   </DimensionUsage>
   <DimensionUsage source="Country" name="Country" caption="Country" foreignKey="country_id">
   </DimensionUsage>
   <DimensionUsage source="Time Weekly" name="Registration Time Weekly" caption="Registration Time Weekly" foreignKey="registration_date_id">
   </DimensionUsage>
   <Measure name="New Registrations" column="msisdn" datatype="Numeric" aggregator="count" visible="true">
   </Measure>
</Cube>

</Schema>

We are going to pay attention to three dimensions:
  • Country: You can see that this dimension is derived from joining 3 tables: sn_login, sn_user and sn_countries. In fact, our Schema is of the type "snowflake schema". 
  • Login Channel: This is a simple dimension which is derived form just one join: sn_login and sn_lt_communication_channels.
  • Login Time: This dimension references a global dimension. Within a Cube Definition, a global dimension can be referenced within several cubes in the same Schema. This is very handy, as you don't have to recreate the same dimension over and over again.

Aggregated table for login channel

We basically want to see how many user login by week in each of the channels.

The MDX query therefore looks like:

WITH 
MEMBER [Measures].[Prior Period Distinct Logins] AS
(
[Login Time Weekly].CurrentMember.PrevMember,
[Measures].[Distinct Users]
)
MEMBER [Measures].[Change in Distinct Logins] AS
(
[Measures].[Distinct Users] - [Measures].[Prior Period Distinct Logins]
)
MEMBER [Measures].[Percent Change] AS
(
([Measures].[Change in Distinct Logins]) / 
([Measures].[Prior Period Distinct Logins])
)
 
, FORMAT_STRING = Iif(
([Measures].[Percent Change] < 0),
"|#.00%|style='red'|arrow='down'",
"|#.00%|style='green'|arrow='up'"
)
SELECT 
NON EMPTY { 

LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)


ON COLUMNS,
NON EMPTY Union (
Crossjoin (
{[Login Channel].[All Login Channels]},
{[Measures].[Distinct Users],
[Measures].[Percent Change]
}
),
Crossjoin (
{[Login Channel].[All Login Channels].Children},
{[Measures].[Distinct Users]
}
)
)
ON ROWS
from [Logins]

We execute the query via JPivot, Analyzer or Saiku. Now we open the mondrian_sql.log and we see that Mondrian create a lot of queries. Analyse these queries carefully.

Now, having looked at the queries in the mondrian_sql.log, I decided to create following aggregated tables:

-- channel / week -------------------------------

DROP TABLE IF EXISTS `agg_channel_week_sn_login`;

CREATE TABLE `agg_channel_week_sn_login` (
`login_channel_id` BIGINT(20),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_channel_week_login_channel_id` (`login_channel_id`),
KEY `agg_channel_week_TW_the_year` (`TW_the_year`),
KEY `agg_channel_week_the_week_of_the_year` (`the_week_of_the_year`)    
);

INSERT INTO `agg_channel_week_sn_login` (
`login_channel_id`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select 
`sn_login`.`login_channel_id` as `login_channel_id`, 
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`, 
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`, 
count(distinct `sn_login`.`msisdn`) as `distinct_users`, 
sum(`sn_login`.`sessions`) as `sessions`, 
count(*) as `fact_count`
from 
`sn_login` as `sn_login`, 
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where 
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2,3
;

And also one higher level one:

-- week ---------------------------------

DROP TABLE IF EXISTS `agg_week_sn_login`;

CREATE TABLE `agg_week_sn_login` (
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_week_TW_the_year` (`TW_the_year`),
KEY `agg_week_the_week_of_the_year` (`the_week_of_the_year`)    
);

INSERT INTO `agg_week_sn_login` (
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select 
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`, 
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`, 
count(distinct `sn_login`.`msisdn`) as `distinct_users`, 
sum(`sn_login`.`sessions`) as `sessions`, 
count(*) as `fact_count`
from 
`sn_login` as `sn_login`, 
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where 
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2
;

Pay special attention on the naming:
  • The year column is named TW_the_year. Why the "TW_" prefix you wonder? This is because the year actually shows up twice in our cube schema: in the Time Week and in the Time Month dimension. We defined a usagePrefix in the cube schema in order to distinguish between these two occurrences. Here in the table definition you have to use this prefix, otherwise Mondrian is not able to distinguish between these two occurrences of the year level and will throw an error.
  • Every table has a column called fact_count, which is a simple count.
  • Most columns are in lower case, also have you realized how our Measure "Distinct Users" has changed into "distinct_users"?
  • All columns have the same name as they have in the dimension table (the year has obviously the prefix).
  • All the table names start with agg_*_<your-table-name>. This is very important!


Since we followed the rules given by Mondrian, these aggregate tables should be auto-detected, meaning you do not have to define them in the Mondrian Schema. Since we set mondrian.rolap.aggregates.Read=true in the mondrian.properties, when we upload a new schema, Mondrian will scan our database for aggregate tables that follow the know patterns (rules) and then detect them/register them.

For the purpose of a complete exercise, we will, however, explicitly define aggregate tables in the Mondrian schema now. In this case you can set mondrian.rolap.aggregates.Read=false in the mondrian.properties. 

Now we have to add following lines to the cube schema (the cube definition) within the table tags in order to tell Mondrian which tables to use etc [highlighted in blue]:

<Table name="sn_login">
       <AggName name="agg_channel_week_sn_login">
           <AggFactCount column="fact_count">
           </AggFactCount>
           <AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
           <AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
           </AggMeasure>
           <AggMeasure column="sessions" name="[Measures].[Sessions]">
           </AggMeasure>
           <AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
           </AggLevel>
           <AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
           </AggLevel>
       </AggName>
       <AggName name="agg_week_sn_login">
           <AggFactCount column="fact_count">
           </AggFactCount>
           <AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
           </AggMeasure>
           <AggMeasure column="sessions" name="[Measures].[Sessions]">
           </AggMeasure>
           <AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
           </AggLevel>
           <AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
           </AggLevel>
       </AggName>
</Table>
  
As you can from this schema we are matching the column names of the aggregated table with the measures and dimensions that we defined in the schema itself. This way Mondrian can identify your columns and link it to a dimension/measure.

An important thing to notice here is that we actually don't have a level login_channel_id in the XML Schema (cube definition), but we can just use the AggForeignKey tag to reference it. Within the AggForeignKey you specify the name of the column in the fact table and the name of the same column in the aggregated table.

Now save everything, export everything to your solution folder on the BI Server, go back to the Pentaho website, refresh the repository, clear the Mondrian cash and try to run the query again. In case you get an error message, check the Mondrian log files to figure out what went wrong. It is very easy to make mistakes, especially the spelling. 
Once you succeed, you will see in the mondrian_sql.log that Mondrian is in fact using your aggregated tables (and certainly you will realize that the execution is way faster).

Aggregated table for country

We have following MDX query with gives us the Top3 countries by week:

SELECT 
[Measures].[Distinct Users]
ON COLUMNS,
Generate (

{LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)}
,
{[Login Time Weekly].CurrentMember}*
TopCount(
[Country.All ISO Country Codes].Children,
3,

[Measures].[Distinct Users]

),
ALL
)
ON ROWS
FROM
[Logins]

Again, execute this query via JPivot, Analyzer or Saiku and watch the log. Then we decide to create following tables:

-- country / week -------------------------------

DROP TABLE IF EXISTS `agg_country_week_sn_login`;

CREATE TABLE `agg_country_week_sn_login` (
`iso_country_code` VARCHAR(100),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_country_week_iso_country_code` (`iso_country_code`),
KEY `agg_country_week_TW_the_year` (`TW_the_year`),
KEY `agg_country_week_the_week_of_the_year` (`the_week_of_the_year`)    
);
 
 INSERT INTO `agg_country_week_sn_login` (
`iso_country_code`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select 
`sn_lt_countries`.`iso_country_code` as `iso_country_code`,
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`, 
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`, 
count(distinct `sn_login`.`msisdn`) as `distinct_users`, 
sum(`sn_login`.`sessions`) as `sessions`, 
count(*) as `fact_count`
from 
`sn_login` as `sn_login`,
`sn_user` as `sn_user`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`,
`sn_lt_countries` as `sn_lt_countries`
where
`sn_login`.`msisdn`=`sn_user`.`msisdn` AND
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id` AND
`sn_lt_countries`.`country_id` = `sn_user`.`country_id`
group by 1,2,3
;

As you can see this is a bit more complex, as we have to join in fact 4 tables, and the iso_country_code is derived from joining 3 tables.

We add following part to the cube definition (xml schema file) [highlighted in blue]:

<Table name="sn_login">
       <AggName name="agg_channel_week_sn_login">
           <AggFactCount column="fact_count">
           </AggFactCount>
           <AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
           <AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
           </AggMeasure>
           <AggMeasure column="sessions" name="[Measures].[Sessions]">
           </AggMeasure>
           <AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
           </AggLevel>
           <AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
           </AggLevel>
       </AggName>
       <AggName name="agg_week_sn_login">
           <AggFactCount column="fact_count">
           </AggFactCount>
           <AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
           </AggMeasure>
           <AggMeasure column="sessions" name="[Measures].[Sessions]">
           </AggMeasure>
           <AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
           </AggLevel>
           <AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
           </AggLevel>
       </AggName>

            <AggName name="agg_country_week_sn_login">
           <AggFactCount column="fact_count">
           </AggFactCount>
           <AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
           </AggMeasure>
           <AggMeasure column="sessions" name="[Measures].[Sessions]">
           </AggMeasure>
           <AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
           </AggLevel>
           <AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
           </AggLevel>
           <AggLevel column="iso_country_code" name="[Country.All ISO Country Codes].[ISO Country Code]">
           </AggLevel>
       </AggName>

 </Table>

You see that in this case we don't make use of an ID, as the country ID is not directly stored in our fact table. We have again collapsed the time dimension to the week level. 
We save everything, export it to the BI Server and run our Xaction. If we get an error, we check the log, if not, we are pleased that we finally got everything running!

Make use of aggregates in other programs

Other programs or packages usually don't take advantage of Mondrian aggregated tables by default. Usually they provide a mondrian.properties file to configure the exact settings. Just open the properties file and enable following properties:

# Allow the use of aggregates
mondrian.rolap.aggregates.Use=true
mondrian.rolap.aggregates.Read=true

Please find below a reference for some of the programs so that you can easily find the mondrian.properties file:
  • Pentaho Data Integration (Kettle)data-integration/libext/mondrian/config/mondrian.properties
  • Pentaho Report Designer: report-designer/resources/mondrian.properties
  • Schema Workbench: schema-workbench/mondrian.properties

This article is just intended as a brief introduction. In no way I claim that this is the best way to create aggregated tables, this should only give you the idea on how it works and you will have to find the best way for your solution. I have found no tuturial on the web on how to do this and it cost me a lot of time to figure out how to get everything working, so I hope that with the help of this article you won't go through the same troubles. Again, creating everything manually is not easy, always check you spellings etc! Also read the excellent documentation by Julian Hyde which you can find here.