Tutorial Aggregated Tables for Mondrian
(Republished Tutorial - originally published July 2009)
(Republished Tutorial - originally published July 2009)
I've started using Pentaho BI Suite in early 2009. One of my main objectives was to create a multidimensional model for DSS (Decission Support Systems) and Analysis. As the BI Server already comes with Mondrian, it is fairly easy to start off.
I created my first cube definition with the Schema Workbench. The cube definition can be saved as an XML file. This XML file you can then export to your solution folder on the BI Server.
Soon we realized the the need for aggregated tables, as some MDX queries took rather long. I had a look at the Pentaho Aggregation Designer 1.1 and thought that this would be a rather easy task.
One thing the Aggregation Designer does is to check if your Schema (the cube definition xml file) is valid, it also checks if your tables are all properly set up, which is indeed a very good functionality, as it will show you any errors and you can correct them then.
I used the Advisor Function within the aggregation designer and created all the recommended tables with the DDL and DML (which the Aggregation Designer creates automatically) and exported the new Cube Schema as well.
But the problem was, that nothing worked. The problem was that I thought that the Aggregation Designer would setup everything, but I was quite wrong about that (at the time of the writing Aggregation Designer 1.1 was available), or maybe, I was just doing something wrong. After some unsuccessful approaches I decided to not use the Aggregation Designer, but to create everything from scratch. I am not recommending here that you shouldn't use the Aggregation Designer, it is definitely worth having a look at it. I am just describing here the approach that worked for me. I am also assuming that you use the standard configuration of Mondrian.
Requirements
This article assumes that you are familiar with- MDX
- MySQL (or SQL in general)
- Pentaho BI Suite
- Pentaho Schema Workbench
- Pentaho Design Studio
Know what you are up to
While creating aggregated tables it makes sense to test if they are used for the purpose that you are creating them for. You might have a MDX query that takes quite long to execute and hence you would want to create a dedicated aggregated table therefore. Have a look at the mondrian_sql.log to see which SQL query Mondrian actually creates for your MDX.Your aggregated table will look quite similar to it.
When you execute the MDX queries, always have a look at these logs in the beginning, you will learn at lot!
How to enable MDX and SQL logging on BI Server 3
Whenever a MDX query is executed, Mondrian transforms it into a SQL query. In some cases you want to see what the resulting SQL query is, for example to check if Mondrian is using aggregated tables.Go to \tomcat\webapps\pentaho\WEB-INF\classes and change log4j.xml. Just uncomment the last half of the file where it says:
<!-- ========================================================= -->
<!-- Special Log File specifically for Mondrian -->
<!-- ========================================================= -->
<appender name="MONDRIAN" class="org.apache.log4j.RollingFileAppender">
<param name="File" value="mondrian.log"/>
<param name="Append" value="false"/>
<param name="MaxFileSize" value="500KB"/>
<param name="MaxBackupIndex" value="1"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
</layout>
</appender>
<category name="mondrian">
<priority value="DEBUG"/>
<appender-ref ref="MONDRIAN"/>
</category>
<!-- ========================================================= -->
<!-- Special Log File specifically for Mondrian MDX Statements -->
<!-- ========================================================= -->
<appender name="MDXLOG" class="org.apache.log4j.RollingFileAppender">
<param name="File" value="mondrian_mdx.log"/>
<param name="Append" value="false"/>
<param name="MaxFileSize" value="500KB"/>
<param name="MaxBackupIndex" value="1"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
</layout>
</appender>
<category name="mondrian.mdx">
<priority value="DEBUG"/>
<appender-ref ref="MDXLOG"/>
</category>
<!-- ========================================================= -->
<!-- Special Log File specifically for Mondrian SQL Statements -->
<!-- ========================================================= -->
<appender name="SQLLOG" class="org.apache.log4j.RollingFileAppender">
<param name="File" value="mondrian_sql.log"/>
<param name="Append" value="false"/>
<param name="MaxFileSize" value="500KB"/>
<param name="MaxBackupIndex" value="1"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
</layout>
</appender>
<category name="mondrian.sql">
<priority value="DEBUG"/>
<appender-ref ref="SQLLOG"/>
</category>
If you want the SQL to be displayed in a more readable form, add this to the mondrian.properties in pentaho-solutions\system\mondrian (BI Server 3):
mondrian.rolap.generate.formatted.sql=true
Restart the server, then start JPivot, Analyzer or a similar OLAP client and perform an analysis. In hte Tomcat/bin directory you will find following new log files now: mondrian.log, mondrian_sql.log and a mondrian_mdx.log.
How to configure Mondrian to recognize aggregated tables
Well, you have to tell Mondrian as well that aggregated tables exist. The first step is to add following lines to the Mondrian.properties, which you can find on pentaho-solutions\system\mondrian (BI Server 3):# should Mondrian use aggregate tables?
mondrian.rolap.aggregates.Use=true
# should Mondrian search for aggregate tables using the standard prefix rules?
mondrian.rolap.aggregates.Read=true
Restart the server.
Furthermore, with the standard setup the table names etc have to follow a convention if they should be auto-detected (this is enabled with the second property shown above). In this case, usually you will not have to define the aggregate tables in the Mondrian schema. (I've copied here a bit from Julian Hyde's excellent documentation):
Rules for Aggregate Auto-Detection
Table Names
As an example of applying the aggregate table name rule, let the fact table be called sales_fact_1997, the Sales cube's fact table from the FoodMart schema. Applying the specific fact table name to the regular expression template creates the following regular expression:agg_.+_sales_fact_1997
This will match the following table names:
- agg_l_05_sales_fact_1997
- agg_c_14_sales_fact_1997
- agg_lc_100_sales_fact_1997
- agg_c_special_sales_fact_1997
- AGG_45_SALES_FACT_1997
- AGG_drop_time_id_sales_fact_1997
Column Names
- There has to be one column called "fact_count", which stores values of a standard count. (This is a general requirement for aggregate tables, this is not specific to auto-detection)
- If there are any ID columns in the aggregated table, they have to be of the same name as in the fact or dimensional table.
- Level columns:
- ${usage_prefix}${level_column_name}: If you have a level that is used more than once, i.e. Year in a week hierarchy and a month hierarchy within the time dimension, then set a "usagePrefix" in the Schema so that they can be distinguished. I.E. usage prefix "TW_" for year in the week hierarchy (resutls in "TW_Year") and "TM_" for year in the month hierarchy (results in "TM_Year").
- ${hierarchy_name}_${level_name}
- ${hierarchy_name}_${level_column_name}
- ${level_column_name}
If any of these parameters have space characters, ' ', these are mapped to underscore characters, '_', and, similarly, dot characters, '.', are also mapped to underscores. So, if the hierarchy_name is "Time", level_name is "Month" and level_column_name is month_of_year, the possible aggregate table column names are:
- time_month
- time_month_of_year
- month_of_year
For this rule, the "hierarchy_name" and "level_name" are converted to lower case while the "level_column_name" must match exactly.
Make sure the column names are lower case!
- There are three parameters to matching aggregate columns to measures:
1) ${measure_name}: the fact table's cube's measure name, "measure_name",
2) ${measure_column_name}: the fact table's cube's measure column name, "measure_column_name", and
3) ${measure_column_name}_${aggregate_name}: the fact table's cube's measure's aggregator (sum, avg, max, etc.), "aggregate_name".
where the measure name is converted to lower case and both the measure column name and aggregate name are matched as they appear. If the fact table's cube's measure name was, "Avg Unit Sales", the fact table's measure column name is "unit_sales", and, lastly, the fact table's cube's measure's aggregate name is "avg", then possible aggregate table column names that would match are:
- avg_unit_sales
- unit_sales
- unit_sales_avg
In example: A measure called "Distinct Users" has to be referenced as "distinct_users".
Your first aggregated tables
We have following tables:The fact table:
DROP TABLE IF EXISTS `pentaho`.`sn_login`;
CREATE TABLE `pentaho`.`sn_login` (
`msisdn` bigint(20) NOT NULL DEFAULT '0',
`login_date_id` bigint(20) NOT NULL DEFAULT '0',
`login_channel_id` bigint(20) NOT NULL DEFAULT '6' COMMENT 'Default Value Unknown',
`sessions` int(255) NOT NULL DEFAULT '0',
PRIMARY KEY (`msisdn`,`login_date_id`,`login_channel_id`),
KEY `FK_sn_login_date_id` (`login_date_id`),
KEY `FK_login_channel_id` (`login_channel_id`),
KEY `login_msisdn` (`msisdn`),
CONSTRAINT `FK_login_channel_id` FOREIGN KEY (`login_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `FK_login_msisdn` FOREIGN KEY (`msisdn`) REFERENCES `sn_user` (`msisdn`),
CONSTRAINT `FK_sn_login_date_id` FOREIGN KEY (`login_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The time dimension table:
DROP TABLE IF EXISTS `pentaho`.`sn_lt_time_dimension`;
CREATE TABLE `pentaho`.`sn_lt_time_dimension` (
`the_year` int(11) DEFAULT NULL,
`the_week_of_the_year` int(11) DEFAULT NULL,
`the_month` int(11) DEFAULT NULL,
`the_day_of_the_year` int(11) DEFAULT NULL,
`the_day_of_the_week` int(11) DEFAULT NULL,
`the_day_of_the_month` int(11) DEFAULT NULL,
`the_month_name` varchar(9) DEFAULT NULL,
`tenure_years` int(11) DEFAULT NULL,
`tenure_days` int(11) DEFAULT NULL,
`time_dimension_id` bigint(20) NOT NULL DEFAULT '0',
`the_date` date DEFAULT NULL,
`the_date_as_string` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`the_quarter` int(3) DEFAULT NULL,
PRIMARY KEY (`time_dimension_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The channel dimension table:
DROP TABLE IF EXISTS `pentaho`.`sn_lt_communication_channels`;
CREATE TABLE `pentaho`.`sn_lt_communication_channels` (
`communication_channel_id` bigint(20) NOT NULL AUTO_INCREMENT,
`channel_name` varchar(45) NOT NULL,
`channel_system_name` varchar(45) NOT NULL,
PRIMARY KEY (`communication_channel_id`)
) ENGINE=InnoDB AUTO_INCREMENT=9 DEFAULT CHARSET=latin1;
The user dimension table:
DROP TABLE IF EXISTS `pentaho`.`sn_user`;
CREATE TABLE `pentaho`.`sn_user` (
`msisdn` bigint(20) NOT NULL DEFAULT '0',
`sex` char(1) DEFAULT NULL,
`registration_date_id` bigint(20) NOT NULL DEFAULT '0',
`country_id` bigint(20) NOT NULL DEFAULT '0',
`birthday_date_id` bigint(20) NOT NULL,
`registration_channel_id` bigint(20) NOT NULL,
PRIMARY KEY (`msisdn`),
KEY `msisdn` (`msisdn`),
KEY `FK_birthday_date_id` (`birthday_date_id`),
KEY `FK_registration_date_id` (`registration_date_id`),
KEY `FK_country_id_2` (`country_id`),
KEY `FK_registration_channel_id` (`registration_channel_id`),
CONSTRAINT `FK_birthday_date_id` FOREIGN KEY (`birthday_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `FK_country_id_2` FOREIGN KEY (`country_id`) REFERENCES `sn_lt_countries` (`country_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `FK_registration_channel_id` FOREIGN KEY (`registration_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `FK_registration_date_id` FOREIGN KEY (`registration_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The country dimension table:
DROP TABLE IF EXISTS `pentaho`.`sn_lt_countries`;
CREATE TABLE `pentaho`.`sn_lt_countries` (
`country_id` bigint(20) NOT NULL DEFAULT '0',
`country_name` tinytext,
`country_number` double DEFAULT NULL,
`iso_country_code` tinytext,
PRIMARY KEY (`country_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
We have created an XML Schema (Cube Definition that looks like this):
<Schema name="Test">
<Dimension type="TimeDimension" usagePrefix="TM_" name="Time Monthly">
<Hierarchy name="All Times Monthly" hasAll="true" allMemberName="All Times Monthly" allMemberCaption="All Times Monthly" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
</Level>
<Level name="Quarter" column="the_quarter" type="Integer" uniqueMembers="false" levelType="TimeQuarters" hideMemberIf="Never">
</Level>
<Level name="Month" column="the_month" nameColumn="the_month_name" type="String" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never">
</Level>
<Level name="Day" column="the_day_of_the_month" type="Integer" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" name="Country">
<Hierarchy name="All Countries" hasAll="true" allMemberName="All Countries" allMemberCaption="All Countries" primaryKey="country_id">
<Table name="sn_lt_countries">
</Table>
<Level name="Country" column="country_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="TimeDimension" usagePrefix="TW_" name="Time Weekly">
<Hierarchy name="All Times Weekly" hasAll="true" allMemberName="All Times Weekly" allMemberCaption="All Times Weekly" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
</Level>
<Level name="Week" column="the_week_of_the_year" type="Integer" uniqueMembers="false" levelType="TimeWeeks" hideMemberIf="Never">
</Level>
<Level name="Day" column="the_day_of_the_week" type="Numeric" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Cube name="Users" cache="true" enabled="true">
<Table name="sn_user">
</Table>
<Dimension type="StandardDimension" name="Gender">
<Hierarchy name="All Genders" hasAll="true" allMemberName="All Genders" allMemberCaption="All Genders">
<Level name="Gender" column="sex" nameColumn="sex" type="String" uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" foreignKey="registration_channel_id" name="Registration Channel">
<Hierarchy name="All Registration Channels" hasAll="true" allMemberName="All Registration Channels" allMemberCaption="All Registration Channels" primaryKey="communication_channel_id">
<Table name="sn_lt_communication_channels">
</Table>
<Level name="Registration Channel" column="channel_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" foreignKey="birthday_date_id" name="Age">
<Hierarchy name="All Ages" hasAll="true" allMemberName="All Ages" allMemberCaption="All Ages" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Age" column="age" type="Integer" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<DimensionUsage source="Time Monthly" name="Registration Time Monthly" caption="Registration Time Monthly" foreignKey="registration_date_id">
</DimensionUsage>
<DimensionUsage source="Country" name="Country" caption="Country" foreignKey="country_id">
</DimensionUsage>
<DimensionUsage source="Time Weekly" name="Registration Time Weekly" caption="Registration Time Weekly" foreignKey="registration_date_id">
</DimensionUsage>
<Measure name="New Registrations" column="msisdn" datatype="Numeric" aggregator="count" visible="true">
</Measure>
</Cube>
</Schema>
We are going to pay attention to three dimensions:
- Country: You can see that this dimension is derived from joining 3 tables: sn_login, sn_user and sn_countries. In fact, our Schema is of the type "snowflake schema".
- Login Channel: This is a simple dimension which is derived form just one join: sn_login and sn_lt_communication_channels.
- Login Time: This dimension references a global dimension. Within a Cube Definition, a global dimension can be referenced within several cubes in the same Schema. This is very handy, as you don't have to recreate the same dimension over and over again.
Aggregated table for login channel
We basically want to see how many user login by week in each of the channels.The MDX query therefore looks like:
WITH
MEMBER [Measures].[Prior Period Distinct Logins] AS
(
[Login Time Weekly].CurrentMember.PrevMember,
[Measures].[Distinct Users]
)
MEMBER [Measures].[Change in Distinct Logins] AS
(
[Measures].[Distinct Users] - [Measures].[Prior Period Distinct Logins]
)
MEMBER [Measures].[Percent Change] AS
(
([Measures].[Change in Distinct Logins]) /
([Measures].[Prior Period Distinct Logins])
)
, FORMAT_STRING = Iif(
([Measures].[Percent Change] < 0),
"|#.00%|style='red'|arrow='down'",
"|#.00%|style='green'|arrow='up'"
)
SELECT
NON EMPTY {
LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)
}
ON COLUMNS,
NON EMPTY Union (
Crossjoin (
{[Login Channel].[All Login Channels]},
{[Measures].[Distinct Users],
[Measures].[Percent Change]
}
),
Crossjoin (
{[Login Channel].[All Login Channels].Children},
{[Measures].[Distinct Users]
}
)
)
ON ROWS
from [Logins]
We execute the query via JPivot, Analyzer or Saiku. Now we open the mondrian_sql.log and we see that Mondrian create a lot of queries. Analyse these queries carefully.
Now, having looked at the queries in the mondrian_sql.log, I decided to create following aggregated tables:
-- channel / week -------------------------------
DROP TABLE IF EXISTS `agg_channel_week_sn_login`;
CREATE TABLE `agg_channel_week_sn_login` (
`login_channel_id` BIGINT(20),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_channel_week_login_channel_id` (`login_channel_id`),
KEY `agg_channel_week_TW_the_year` (`TW_the_year`),
KEY `agg_channel_week_the_week_of_the_year` (`the_week_of_the_year`)
);
INSERT INTO `agg_channel_week_sn_login` (
`login_channel_id`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_login`.`login_channel_id` as `login_channel_id`,
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2,3
;
And also one higher level one:
-- week ---------------------------------
DROP TABLE IF EXISTS `agg_week_sn_login`;
CREATE TABLE `agg_week_sn_login` (
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_week_TW_the_year` (`TW_the_year`),
KEY `agg_week_the_week_of_the_year` (`the_week_of_the_year`)
);
INSERT INTO `agg_week_sn_login` (
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2
;
Pay special attention on the naming:
- The year column is named TW_the_year. Why the "TW_" prefix you wonder? This is because the year actually shows up twice in our cube schema: in the Time Week and in the Time Month dimension. We defined a usagePrefix in the cube schema in order to distinguish between these two occurrences. Here in the table definition you have to use this prefix, otherwise Mondrian is not able to distinguish between these two occurrences of the year level and will throw an error.
- Every table has a column called fact_count, which is a simple count.
- Most columns are in lower case, also have you realized how our Measure "Distinct Users" has changed into "distinct_users"?
- All columns have the same name as they have in the dimension table (the year has obviously the prefix).
- All the table names start with agg_*_<your-table-name>. This is very important!
Since we followed the rules given by Mondrian, these aggregate tables should be auto-detected, meaning you do not have to define them in the Mondrian Schema. Since we set mondrian.rolap.aggregates.Read=true in the mondrian.properties, when we upload a new schema, Mondrian will scan our database for aggregate tables that follow the know patterns (rules) and then detect them/register them.
For the purpose of a complete exercise, we will, however, explicitly define aggregate tables in the Mondrian schema now. In this case you can set mondrian.rolap.aggregates.Read=false in the mondrian.properties.
Now we have to add following lines to the cube schema (the cube definition) within the table tags in order to tell Mondrian which tables to use etc [highlighted in blue]:
<Table name="sn_login">
<AggName name="agg_channel_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
<AggName name="agg_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
</Table>
As you can from this schema we are matching the column names of the aggregated table with the measures and dimensions that we defined in the schema itself. This way Mondrian can identify your columns and link it to a dimension/measure.
An important thing to notice here is that we actually don't have a level login_channel_id in the XML Schema (cube definition), but we can just use the AggForeignKey tag to reference it. Within the AggForeignKey you specify the name of the column in the fact table and the name of the same column in the aggregated table.
Now save everything, export everything to your solution folder on the BI Server, go back to the Pentaho website, refresh the repository, clear the Mondrian cash and try to run the query again. In case you get an error message, check the Mondrian log files to figure out what went wrong. It is very easy to make mistakes, especially the spelling.
Once you succeed, you will see in the mondrian_sql.log that Mondrian is in fact using your aggregated tables (and certainly you will realize that the execution is way faster).
Aggregated table for country
We have following MDX query with gives us the Top3 countries by week:SELECT
[Measures].[Distinct Users]
ON COLUMNS,
Generate (
{LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)}
,
{[Login Time Weekly].CurrentMember}*
TopCount(
[Country.All ISO Country Codes].Children,
3,
[Measures].[Distinct Users]
),
ALL
)
ON ROWS
FROM
[Logins]
Again, execute this query via JPivot, Analyzer or Saiku and watch the log. Then we decide to create following tables:
-- country / week -------------------------------
DROP TABLE IF EXISTS `agg_country_week_sn_login`;
CREATE TABLE `agg_country_week_sn_login` (
`iso_country_code` VARCHAR(100),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_country_week_iso_country_code` (`iso_country_code`),
KEY `agg_country_week_TW_the_year` (`TW_the_year`),
KEY `agg_country_week_the_week_of_the_year` (`the_week_of_the_year`)
);
INSERT INTO `agg_country_week_sn_login` (
`iso_country_code`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_lt_countries`.`iso_country_code` as `iso_country_code`,
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_user` as `sn_user`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`,
`sn_lt_countries` as `sn_lt_countries`
where
`sn_login`.`msisdn`=`sn_user`.`msisdn` AND
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id` AND
`sn_lt_countries`.`country_id` = `sn_user`.`country_id`
group by 1,2,3
;
As you can see this is a bit more complex, as we have to join in fact 4 tables, and the iso_country_code is derived from joining 3 tables.
We add following part to the cube definition (xml schema file) [highlighted in blue]:
<Table name="sn_login">
<AggName name="agg_channel_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
<AggName name="agg_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
<AggName name="agg_country_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
<AggLevel column="iso_country_code" name="[Country.All ISO Country Codes].[ISO Country Code]">
</AggLevel>
</AggName>
</Table>
You see that in this case we don't make use of an ID, as the country ID is not directly stored in our fact table. We have again collapsed the time dimension to the week level.
We save everything, export it to the BI Server and run our Xaction. If we get an error, we check the log, if not, we are pleased that we finally got everything running!
Make use of aggregates in other programs
Other programs or packages usually don't take advantage of Mondrian aggregated tables by default. Usually they provide a mondrian.properties file to configure the exact settings. Just open the properties file and enable following properties:
# Allow the use of aggregates
mondrian.rolap.aggregates.Use=true
mondrian.rolap.aggregates.Read=true
Please find below a reference for some of the programs so that you can easily find the mondrian.properties file:
- Pentaho Data Integration (Kettle): data-integration/libext/mondrian/config/mondrian.properties
- Pentaho Report Designer: report-designer/resources/mondrian.properties
- Schema Workbench: schema-workbench/mondrian.properties
This article is just intended as a brief introduction. In no way I claim that this is the best way to create aggregated tables, this should only give you the idea on how it works and you will have to find the best way for your solution. I have found no tuturial on the web on how to do this and it cost me a lot of time to figure out how to get everything working, so I hope that with the help of this article you won't go through the same troubles. Again, creating everything manually is not easy, always check you spellings etc! Also read the excellent documentation by Julian Hyde which you can find here.
I'd like to use your tutorial, but unfortunately there different issues:
ReplyDelete*) At the beginning you call sn_login table as fact table, but the OLAP schema you provided as a basis uses sn_users table as fact table
*) mdx query you provided as an example fails
*) aggragates definitions you are mentioning are built under the sn_login table, while the cube as I wrote above again uses sn_users
So I am not sure how to work with your sample
Andriy
Thanks for your feedback! I don't think that I have the original files any more as it was quite long ago when I wrote this. I do remember that in this case the user table could act as a fact as well as a dimensional table. But yes, it doesn't help the tutorial. I really should have copied over the sn_login schema. Nevertheless I hope that still you could take some learnings from it.
ReplyDelete
ReplyDeleteHi, I have created aggregate table using aggregate designer. Then published it on server. When I open jpivot for that cube and click expand to any of the dimension, I get an error. When I looked at mondrian_sql log file it shows a qeury created like this below. In that query, there is where clause that refers to aggregate table but the name of the aggregate table does not come up in "from" section. Am I missing any specific configuration?
select
`Dim_Time`.`Time_Id` as `c0`,
`Dim_Time`.`month_no` as `c1`
from
`Dim_Time` as `Dim_Time`
where
`Dim_Time`.`Time_Id` = `AggTable1`.`Dim_Time_Month_(Key)`
group by
`Dim_Time`.`Time_Id`,
`Dim_Time`.`month_no`
order by
ISNULL(`Dim_Time`.`Time_Id`) ASC, `Dim_Time`.`Time_Id` ASC
It looks like the result of a logical problem in your mondrian schema. Apart from that, I guess you have enabled aggregate tables in the mondrian.properties on your server?
ReplyDelete