SeETL039 – Multi-Level Fact Table Summaries Using SQL

0
53

Hello and welcome to this latest blog post!

(Note from Peter: Since I am no longer welcome on Linkedin? Please share this post with your colleagues who build data warehouses for a living.)

Well this post has been a loooooong time coming.

Everyone knows that I was trained at Metaphor Computer Systems in the design techniques that Ralph Kimball later made popular through his series of books. But prior to 1995 the mechanisms by which the Metaphor Data Warehouse Developers built their ETL systems were a closely guarded secret.

Even though I was a personal friend of the Metaphor CEO, Cathy Selleck, and I was really the only person pushing Metaphor in Australia, I was not granted the privilege of learning how to build the ETL for dimensional models when I was visiting Metaphors head office in 1993.

So I had to learn the hard way!

Learning how to build multi-level dimensional data models the “Ralph Kimball way” took some time. The cobol code (remember cobol?) was really hard to write the first time.

I soon realised that if I could produce ETL software that made this ETL development easy I might make some money out of that software. This software I built back in 1995 is what has evolved in to the free open source #SeETL product today, 2019.

And if you are interested? Yes, I made a lot of money out of the SeETL product because om the 90s we used it to sell million dollar deals to billion dollar companies. We could build the ETL subsystem for big companies for a much lower cost than anyone else. #SeETL was the “secret sauce” we used to do that.

I built my first terabyte data warehouse using #SeETL Cobol in 1997. It was a great success. We embedded my cobol software in to the Hitachi Data Systems Data Warehousing Offering and we won many large deals with an end to end solution that by that time included Brio Query.

One of the major features of the “Ralph Kimball” way of doing things in the 90s was to have multi-level dimension tables and multi-level fact tables. Indeed, Ralphs database product Red Brick had these features built right in to the database manager.

Since it was hard to sell Red Brick against Oracle and DB2 in Australia in the 90s my #SeETL software achieved the same result as Red Brick only as cobol running on Oracle and DB2.

Brio Query suited the data models we were building very well because the founder of Brio Query was ex-Metaphor. And the final CEO of Metaphor, Chris Grejtak, moved over to Brio Query in 1996. I sold the first 1,000 seat Brio-Query customer in Australia in 1997.

One of the big differentiators I was able to implement for my customers in the 90s was “multi-level dimensions” and “multi-level fact tables.”

“Multi-Level” data gave performance boosts to all queries and in the 90s query performance was an issue.

Ralph and I talked about these things as a “great idea” on the dwlist forum but we were just not getting any real attention on the subject.

By 2003 both Ralph and I had pretty much given up trying to talk to people about why multi-level data inside the data warehouse database was a good idea. So many people were going to SSAS and Essbase Cubes we just wished them well on that journey!

In 2002 I re-wrote the old #SeETL Cobol product in to the then more modern C++. In 2004 we created the innovation of using the mapping workbooks as source code. Storing the workbook as xml and reading it with VB to generate all the objects needed.

Of course, in that period I also converted the cobol code that was needed to manage multi-level data across to C++. During that migration we added a lot of new features to increase the scalability of the attribution, aggregation and consolidation processing.

With the new C++ version we no longer had “generated code”. We had C++ programs that discovered the data structures of the source and target tables and did all the processing. We went from the situation with cobol of needing one program per function per mapping to just one program per function.

Whereas in the cobol #SeETL product it was not at all unusual to generate 300,000+ lines of cobol code that would then need to be maintained. On one project we even broke the 500,000 lines of generated cobol code line!

In the new 2002 C++ #SeETL, the version you can download today, we did not generate any code. We had programs that would “internally configure” themselves to adapt to each mapping. We separated the algorithm from the data structures.

This is what made C++ #SeETL so different. And this is why I decided to sell it for EUR20,000 per copy in 2003!

The attribution, aggregation, consolidation processing was MUCH improved across many dimensions. It ran faster. It scaled further. It ran on more hardware. It talked to more databases.

Indeed, in 2003 we used #SeETL to build the prototype data warehouse for Saudi Telecom who had 20 million customers and 60 million CDRs per day. When we took delivery of our 18 CPU Sun 12K and 15TB EMC storage system we put #SeETL on to it and ran it through it’s paces.

Using just 6 CPUs we could get the CDR processing done in about 6 hours. C++ #SeETL could be used all the way up to telcos quite successfully. We used it again on Orange Romania in 2005, Electronic Arts in 2006, Carphone Warehouse in 2008, and SkyTalk in 2010.

I also sold copies to consulting companies who wanted to do ETL but did not want to buy DataStage or Informatica which were licensed by CPUs back in those days.

When we did Carphone Warehouse in 2009 Brian Ganly suggested that rather than use the C++ version of #SeETL we could generate the ETL subsystem as SQL and run it on Netezza. I resisted this as “totally crazy” for a couple of months. But Brian can be very persuasive and so I finally spent a weekend testing this theory. That Netezza was so fast that it could do the 80M CDR batch of records in one statement for the attribution processing.

In testing this was taking about 5 hours to get through on the development machine. We were running C++ #SeETL on a linux server and accessing the data on the netezza machine and then reloading the finished records on to the netezza machine.

In the testing I did on that fateful weekend the small development netezza machine was able to process the 80 million CDRs with 20 dimension table lookups in just 20 minutes.

On the production machine it was 5 minutes.

Brian Ganlys persistence had paid off. Sean Kelly and I were now believers that ETL subsystems could be written in SQL for Netezza.

There was, of course, one problem. The multi-level data C++ code could not be converted easily in to SQL because of the large amount of functionality that was included in it.

People who had Netezza didn’t care because all queries were fast anyway!

But for the people on SQL Server, this was still an issue. And so we continued to use the C++ version of #SeETL on SQL Server projects in order to get the mult-level summary fact table functionality.

Over the years since 2009 I have made a number of attempts to figure out if I could generate the code needed to create multi-level fact tables in SQL. Each time I ended up thinking “nope, can’t figure that one out”.

I was recently asked a different question that prompted a new way of looking at that code and today I decide today was the day I would have another crack at this problem.

And I cracked the case!!

As funny as it is now. The way to make this work is deceptively simple but it required me thinking about how to perform the processing very differently to how the processing has “always been done”. My problem over the last 10 years is every time I looked at trying to solve this problem as SQL I tried to emulate the “way we have always done it” in SQL and not start with a “blank slate”.

So here for your reading enjoyment is how you can create a multi-level fact table for sales transactions using SQL. Note that we are still using the C++ programs to create the multi-level dimensions. But since they are incrementally updated and dimension tables are so small we have no intention of trying to migrate that to SQL.

Firstly, lets check the #SeETL aggregation control table. It looks like this:


create table dbo.ctl_aggregation_control
(
pk_aggregate_number integer
, run_type varchar(20)
, fact_table_name varchar(256)
, number_of_dimensions integer
, level_dimension_1 integer
, level_dimension_2 integer
, level_dimension_3 integer
, level_dimension_4 integer
, level_dimension_5 integer
...
, level_dimension_50 integer

You give each row an aggregate number which must be unique. We generally use number ranges of 100 for each fact table. This allows 100 summary levels per fact table.

The run type is set to ‘always’ for daily processing.

The fact table name is the #SeETL mapping name for the fact table.

You tell it how many dimensions there are in this fact table. You have a maximum of 50.

You then tell it the level of aggregate in each dimension starting at dimension 1.

Each row in this table translates in to one summary level inside one fact table.

So if your first dimension is time. And the detailed level is “day” and level 1 is “week” and level 2 is “month” and level 3 is “quarter” and so on.

If you put 2 in level_dimension_1 then that level of aggregate will summarise to monthly data.

To show you how the levels of the keys work. td0005 is our day dimension table in BI4ALL. It has 9 summary level keys on it as well as the detailed level keys. The attibution view looks like the following.


create view [dbo].[z01_vm_day_01_at] as select
td0005.dim_char_ky_fld pk_dim_char_ky_fld
,td0005.pk_td0005 dk_z01_vm_day_01
,td0005.td0005_key_ag1 z01_vm_day_01_key_ag1
,td0005.td0005_key_ag2 z01_vm_day_01_key_ag2
,td0005.td0005_key_ag3 z01_vm_day_01_key_ag3
,td0005.td0005_key_ag4 z01_vm_day_01_key_ag4
,td0005.td0005_key_ag5 z01_vm_day_01_key_ag5
,td0005.td0005_key_ag6 z01_vm_day_01_key_ag6
,td0005.td0005_key_ag7 z01_vm_day_01_key_ag7
,td0005.td0005_key_ag8 z01_vm_day_01_key_ag8
,td0005.td0005_key_ag9 z01_vm_day_01_key_ag9
from dbo.td0005 td0005
where level_col = 'detail'

So you can see that the aggergate keys are on the lookup table / view for each dimension that has aggregates possible.

So now lets look at the sql statement that is needed to perform this processing. I will put my comments inside the code.

Firstly, just truncate the work tables.


truncate table xxxxx.dbo.z01_vf_sale_txn_03_swk1 ;

truncate table xxxxx.dbo.z01_vf_sale_txn_03_swk2 ;

insert into xxxxx.dbo.z01_vf_sale_txn_03_swk1
(
pk_aggregate_number
,pk_z01_vm_day_01
,pk_z01_vm_product_01
,pk_z01_vm_customer_01
,pk_z01_vm_customer_demographic_01
,pk_z01_vm_geography_01
,pk_z01_vm_currency_01
,pk_z01_vm_sale_txn_type_01
,pk_z01_vm_sale_txn_status_01
,pk_z01_vm_unit_of_measure_01
,sale_extended_amount
,cost_extended_amount
,tax1_extended_amount
,tax2_extended_amount
,discount_extended_amount
,sale_units
,number_sales
)
select

-- select out the aggregate number as it goes on the target summary table

ctl_aggregation_control.pk_aggregate_number

-- this was the BIG innovation. Rather than moving the key forward in the generated code which is how this has
-- "always been done". I used a case statement that selected the right key at run time.
-- This was the idea that had always escaped me. It's so simple once you do it!
--
-- Notice that we are looking at the ctl_aggregation_control.level_dimension_1 to find the level of the key to use
-- based on that level we retrieve the correct key from the lookup table and send it forward.
--
,case ctl_aggregation_control.level_dimension_1
when 0 then coalesce(dk_z01_vm_day_01, 0 )
when 1 then coalesce(z01_vm_day_01_key_ag1, 0 )
when 2 then coalesce(z01_vm_day_01_key_ag2, 0 )
when 3 then coalesce(z01_vm_day_01_key_ag3, 0 )
when 4 then coalesce(z01_vm_day_01_key_ag4, 0 )
when 5 then coalesce(z01_vm_day_01_key_ag5, 0 )
when 6 then coalesce(z01_vm_day_01_key_ag6, 0 )
when 7 then coalesce(z01_vm_day_01_key_ag7, 0 )
when 8 then coalesce(z01_vm_day_01_key_ag8, 0 )
when 9 then coalesce(z01_vm_day_01_key_ag9, 0 )
else 0
end

-- Same for products

,case ctl_aggregation_control.level_dimension_2
when 0 then coalesce(dk_z01_vm_product_01, 0 )
when 1 then coalesce(z01_vm_product_01_key_ag1, 0 )
when 2 then coalesce(z01_vm_product_01_key_ag2, 0 )
when 3 then coalesce(z01_vm_product_01_key_ag3, 0 )
when 4 then coalesce(z01_vm_product_01_key_ag4, 0 )
when 5 then coalesce(z01_vm_product_01_key_ag5, 0 )
when 6 then coalesce(z01_vm_product_01_key_ag6, 0 )
when 7 then coalesce(z01_vm_product_01_key_ag7, 0 )
when 8 then coalesce(z01_vm_product_01_key_ag8, 0 )
when 9 then coalesce(z01_vm_product_01_key_ag9, 0 )
else 0
end

-- Same for partys

,case ctl_aggregation_control.level_dimension_3
when 0 then coalesce(dk_z01_vm_party_1001, 0 )
when 1 then coalesce(z01_vm_party_1001_key_ag1, 0 )
when 2 then coalesce(z01_vm_party_1001_key_ag2, 0 )
when 3 then coalesce(z01_vm_party_1001_key_ag3, 0 )
when 4 then coalesce(z01_vm_party_1001_key_ag4, 0 )
when 5 then coalesce(z01_vm_party_1001_key_ag5, 0 )
when 6 then coalesce(z01_vm_party_1001_key_ag6, 0 )
when 7 then coalesce(z01_vm_party_1001_key_ag7, 0 )
when 8 then coalesce(z01_vm_party_1001_key_ag8, 0 )
when 9 then coalesce(z01_vm_party_1001_key_ag9, 0 )
else 0
end

-- Same for demographics

,case ctl_aggregation_control.level_dimension_4
when 0 then coalesce(dk_z01_vm_party_demographic_1001, 0 )
when 1 then coalesce(z01_vm_party_demographic_1001_key_ag1, 0 )
when 2 then coalesce(z01_vm_party_demographic_1001_key_ag2, 0 )
when 3 then coalesce(z01_vm_party_demographic_1001_key_ag3, 0 )
when 4 then coalesce(z01_vm_party_demographic_1001_key_ag4, 0 )
when 5 then coalesce(z01_vm_party_demographic_1001_key_ag5, 0 )
when 6 then coalesce(z01_vm_party_demographic_1001_key_ag6, 0 )
when 7 then coalesce(z01_vm_party_demographic_1001_key_ag7, 0 )
when 8 then coalesce(z01_vm_party_demographic_1001_key_ag8, 0 )
when 9 then coalesce(z01_vm_party_demographic_1001_key_ag9, 0 )
else 0
end

-- Same for geography

,case ctl_aggregation_control.level_dimension_5
when 0 then coalesce(dk_z01_vm_geography_01, 0 )
when 1 then coalesce(z01_vm_geography_01_key_ag1, 0 )
when 2 then coalesce(z01_vm_geography_01_key_ag2, 0 )
when 3 then coalesce(z01_vm_geography_01_key_ag3, 0 )
when 4 then coalesce(z01_vm_geography_01_key_ag4, 0 )
when 5 then coalesce(z01_vm_geography_01_key_ag5, 0 )
when 6 then coalesce(z01_vm_geography_01_key_ag6, 0 )
when 7 then coalesce(z01_vm_geography_01_key_ag7, 0 )
when 8 then coalesce(z01_vm_geography_01_key_ag8, 0 )
when 9 then coalesce(z01_vm_geography_01_key_ag9, 0 )
else 0
end

-- And where there are no aggregation levels available we do not need the case statement

,coalesce(dk_z01_vm_currency_01, 0 )
,coalesce(dk_z01_vm_sale_txn_type_01, 0 )
,coalesce(dk_z01_vm_sale_txn_status_01, 0 )
,coalesce(dk_z01_vm_unit_of_measure_01, 0 )

-- And we send the data forward out of the #SeETL generated view.

,sale_extended_amount
,cost_extended_amount
,tax1_extended_amount
,tax2_extended_amount
,discount_extended_amount
,sale_units
,number_sales

-- This is an input view that is created in the #SeETL workbook

from xxxxx.dbo.z01_vf_sale_txn_03 z01_vf_sale_txn_03

-- We inner join to the aggregation control table using the mapping name.
-- This is repeated here for documentation and to make it obvious which table is being processed.

inner join xxxxx.dbo.ctl_aggregation_control ctl_aggregation_control on 1=1
and ctl_aggregation_control.fact_table_name = 'z01_vf_sale_txn_03'

-- We left join on the dimension table lookup views. These might also be extract out in to their own
-- small lookup tables to get better processing speeds.

left join xxxxx.dbo.z01_vm_day_01_at z01_vm_day_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_day_01 = z01_vm_day_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_product_01_at z01_vm_product_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_product_01 = z01_vm_product_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_party_1001_at z01_vm_party_1001_at
on z01_vf_sale_txn_03.char_key_z01_vm_party_1001 = z01_vm_party_1001_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_party_demographic_1001_at z01_vm_party_demographic_1001_at
on z01_vf_sale_txn_03.char_key_z01_vm_party_demographic_1001 = z01_vm_party_demographic_1001_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_geography_01_at z01_vm_geography_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_geography_01 = z01_vm_geography_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_currency_01_at z01_vm_currency_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_currency_01 = z01_vm_currency_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_sale_txn_type_01_at z01_vm_sale_txn_type_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_sale_txn_type_01 = z01_vm_sale_txn_type_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_sale_txn_status_01_at z01_vm_sale_txn_status_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_sale_txn_status_01 = z01_vm_sale_txn_status_01_at.pk_dim_char_ky_fld

left join xxxxx.dbo.z01_vm_unit_of_measure_01_at z01_vm_unit_of_measure_01_at
on z01_vf_sale_txn_03.char_key_z01_vm_unit_of_measure_01 = z01_vm_unit_of_measure_01_at.pk_dim_char_ky_fld

where 1=1

-- I have repeated the constraint on the fact table name just for documentation purposes

and ctl_aggregation_control.fact_table_name = 'z01_vf_sale_txn_03'

-- I have put the aggregate number in to the query just for documentation purposes.

and ctl_aggregation_control.pk_aggregate_number in ( 101 , 102 , 103 , 104 , 105 , 106 )

;

-- The above query will write all the detailed records to the sort work file but the integer keys at the front of the record will
-- be set to the correct aggregate key.

-- So now the records simply need to be summed as follows:

insert into xxxxx.dbo.z01_vf_sale_txn_03_swk2
(
pk_aggregate_number
,pk_z01_vm_day_01
,pk_z01_vm_product_01
,pk_z01_vm_customer_01
,pk_z01_vm_customer_demographic_01
,pk_z01_vm_geography_01
,pk_z01_vm_currency_01
,pk_z01_vm_sale_txn_type_01
,pk_z01_vm_sale_txn_status_01
,pk_z01_vm_unit_of_measure_01
,sale_extended_amount
,cost_extended_amount
,tax1_extended_amount
,tax2_extended_amount
,discount_extended_amount
,sale_units
,number_sales
)
select
pk_aggregate_number
,pk_z01_vm_day_01
,pk_z01_vm_product_01
,pk_z01_vm_customer_01
,pk_z01_vm_customer_demographic_01
,pk_z01_vm_geography_01
,pk_z01_vm_currency_01
,pk_z01_vm_sale_txn_type_01
,pk_z01_vm_sale_txn_status_01
,pk_z01_vm_unit_of_measure_01
,sum(sale_extended_amount)
,sum(cost_extended_amount)
,sum(tax1_extended_amount)
,sum(tax2_extended_amount)
,sum(discount_extended_amount)
,sum(sale_units)
,sum(number_sales)

from xxxxx.dbo.z01_vf_sale_txn_03_swk1
group by
pk_aggregate_number
,pk_z01_vm_day_01
,pk_z01_vm_product_01
,pk_z01_vm_customer_01
,pk_z01_vm_customer_demographic_01
,pk_z01_vm_geography_01
,pk_z01_vm_currency_01
,pk_z01_vm_sale_txn_type_01
,pk_z01_vm_sale_txn_status_01
,pk_z01_vm_unit_of_measure_01
;

Now, in sort work 2 you have the multiple levels of summary fact records that are the result of the current cycle of processing.

Where the key combination does not exist in the target summary fact table you can perform an insert.

Where the key combination does exist in the target summary fact table you need to consolidate the data in the target summary fact table with the data in the sort work 2 table and then perform an update back in to the summary fact table.

Those operations are so simply they don’t really justify a spot on the blog.

This new invention, at least I have never seen of it or heard of it before, means this.

It means that anyone can build multi-level summary fact tables using SQL using #SeETL for free as an open source tool.

This means that if you are having performance problems with your data warehouse and you would like to have more summary fact tables? Or if you have lots of cubes for your summaries and you would like to have summary fact tables to load your cubes? Or you would like to get rid of some of your cubes?

You can copy this model of processing and save your self a massive amount of query processing.

For those of you who have MicroStrategy? The multi-level models we implement based on Ralphs good help can be used, as is, with MicroStrategy. All we do is put views over the tables to create the needed MicroStrategy hierarchy for the dimension tables and we also create views over the summary fact table and alter the key column names so that the MicroStrategy schema designer can properly link up the views to the correct levels of the dimension hierarchy. Ok?

Well, Ladies and Gentlemen?

This has been one of the most EPIC posts I have ever done!

This “problem” of how to do multi-level summary fact tables in SQL has been playing on my mind for 9 years now and I could not figure out how to get it to work!

Now that I have gotten it to work and #SeETL is free?

I am pleased to put it “out there” because the more people who adopt the idea of multi-level summary fact tables the better our delivery capability to our customers will be.

Multi-level summary fact tables give you a BIG query performance boost for almost zero cost in disk. And now that the ETL for it is just SQL and the updates are incremental per batch cycle? There is very little extra processing required to create and maintain multi-level summaries inside the database. If you are a MicroStrategy user you are in luck because this is exactly the sort of data model MicroStrategy need, without all the extra code of one mapping per dimension level and one mapping per summary level.

So have fun working on implementing your first multi-level summary fact tables!

If you would like me to implement yours for you?

I am currently (2019-09-03) charging EUR50/USD60 for working from my home office.

I sincerely hope that rate will go back up to where it really belongs as I get busier!

Best Regards

Peter Nolan


Download #SeETL

IBI On You Tube


Carphone Warehouse Reference Video:


Comments

comments

LEAVE A REPLY

Please enter your comment!
Please enter your name here