I have not been posting for a while because I was a little disappointed that when I posted my series on how to ensure BI Projects were successful I received very little response.
I was pretty surprised that there was so little interest in making BI projects successful given how many of them fail. All those failures give those of us who deliver successful projects a bad name.
So today I was thinking about some work I did on the weekend around solving the “updated transaction” problem and I dropped in on my blog here for the first time in ages.
I went to check out the post on how to build staging area because I was not sure what I had in it. I mean this post:
Much to my surprise the post has had 608 views and the video itself has had 3,211 views!
And yes, that was MUCH to my surprise.
I have been doing internet marketing for a while and I am learning about how many hits a video can get if you do it right. But even I am surprised by 3,211 hits on a video as pedestrian as “Creating Staging Areas” for a data warehouse!
So I am much encouraged to put out some more blogs and over the weekend I have had an idea that I will flesh out in the coming weeks to create much more content. So look forward to that!
Today’s blog post is about one of the more pesky problems in a BI system.
This is the problem where a transaction that is not supposed to be updated is actually updated by the operational system, and how that effects summarized data.
We all know that to get the performance you need to get for dashboards, drillable reports, and all those nice sexy things that users like to see, you have to have summarized data somewhere.
Whether it is in a database or a cube or a cache, you know you need it.
However, when one, or hundreds, of these pesky transactions are updated by the operational systems and then sent in to the ETL processing with different numbers that effect the summaries? You have a problem. And this problem has been around in BI since day 1.
How do I adjust my summary level data to accept a change in the numerical values on a transaction so that the summaries stay in sync with the detailed records without having to re-summarise in each ETL cycle?
Good question and one that I was working on this weekend for a client.
A little while ago we invented a new way of generating staging areas for a client.
Namely, we gathered up all the data about the operational system, in this case MicroSofts NAVision ERP, and we put it in to the SeETL Dictionary.
We then placed a series of views over the dictionary tables and from this we were able to 100% generate all the tables, indexes, views and SQL Statements to extract data from NAVision and place that data in to the staging area.
We created this solution and generated the objects and code in about 2 weeks of effort.
The only tricky piece was writing functions to translate NAVision table and column names in to table and column names that would not need square brackets in the staging area and data warehouse.
Yes, you read that correctly, in less than 2 weeks of effort we were able to create a full staging area for an implementation of MicroSofts NAVision ERP.
This was 1,800 tables and 35,000 fields.
It actually took us 2 more weeks to run it for the first time because there were a couple of billion rows in the production NAV system and it took a while to bring them all across.
So just as a side comment, if you are in the market for building a staging area, either a new one or replacing an old one, SeETL now has the tools to do that automatically, which is a big improvement on the video on the blog post above.
So this weekend I was thinking and this question occurred to me:
Is there any way to leverage the new way we build staging areas using SQL to solve the “updated transaction” problem?
And it turns out there is.
In the new generated SQL delta detection subsystem we created a series of tables and views and then SQL to operate on them.
We create a “minus 1” table which is suffixed _m1. This table contains the prior version of the data. This table is only needed if it is desired to automatically detect deletes and there is no other way for deletes to be more easily detected other than the brute force approach of comparing yesterdays data with todays data. Not a good approach by it does work.
We can also create a view over the staging table which we call _m2 if we do not want to retain a full copy of the source system in the _m1 tables and we do not want to detect deletes automatically.
We also created a “deltas detected” table which we suffix _d1.
This table contains the data that is determined to have been inserted, updated, or deleted from todays version of data compared to yesterdays version of data.
Now, with MicroSofts NAVision product there is a timestamp column which is reliable and can be used to perform extract processing. This is what we are using with this client.
There are 7 steps of SQL generated in the delta detection process. The second step of the SQL detects rows that are updated in some way since the prior run of the ETL.
It turns out that if we create another table, and we called it “reversals table” and we suffix it with _r1, we can go to the staging area and retrieve the row as it was before the update in NAVision and send it to the _r1 table.
This is simply retrieving the transaction row from the staging area before the delta detection process delivers the new row that has just arrived from the NAV system to the staging area. A very simply piece of SQL to write because it is already written automatically. There are just a few table name changes needed.
Further, it is obvious we have already written the ETL processing for the transaction table to manage the updates into the detailed level tables and the summaries using SeETL.
Well, it turns out that if we just replicate the mapping and re-use the summarization processing we can send the old record in to the summarization processing again reversing the sign on the amount columns. This will negate the old row in all the summary records and then the new row will flow in to the detailed transaction table as an update against the old record and whola! You have maintained the integrity of the detailed transactions and the summaries even in the pesky case of an update to a transaction.
So let’s talk about an example.
Let’s say we have a sales record where we sold 2 widgets for USD100, being USD50 each.
Let’s say this was incorrect.
Let’s say that the ERP allows for an update to the transaction record rather than creating a reversal record and a new sales transaction record. It does happen.
So what we get from the ERP is just an alteration to an existing record where the USD100 changes to USD80.
If we just performed the update on the vf_sale_txn table the transactions would be correct.
However, the USD80 sale will flow in to the vf_sale_txn_summary table and contribute to all levels of summaries and so those summaries will double count the USD100 revenue from the prior record and the US80 from this updated record.
So, in the delta detection process we can see that the record has changed, though we don’t know WHAT has changed. So we can copy the staging area record to the reversals working table, _r1, and that staging area record will have USD100 on it because it has not yet been updated by the delta detection processing.
We then flip the signs on the USD100 record to show -USD100, -2 items, and -1 sale.
We run this through the attribution processing to get the keys and the aggregation levels.
We do NOT apply it to the detailed vf_sale_txn table because the record that is coming will perform an update.
And we then run the attributed record through the aggregation processing which will create all the rows needed to revers the 2 sales for USD100 from all levels of aggregation that it contributes to.
Then we run the consolidation process which will find each aggregate record and reverse out the 2 sales, USD100 revenue, and 1 sale numbers.
And finally we apply the updates to each summary level.
Then, the new row, which is 2 items, USD80, and 1 sale will flow through in to the detailed fact table vf_sale_txn and contribute to the summaries and the vf_sale_txn_summary table will be adjusted accordingly.
Now, I am sorry I can not do a video on this and show it to you actually happening on the screen in an example.
The reason is that I have obviously signed a non-disclosure agreement with the client and I can not show you what we have done for them without breeching that agreement. Or, at the very least, leaving way too many clues as to who the client is on the video.
So in this case you will just have to trust me that we have done this and it works fine! LOL! OK?
What we are actually working on is building a generalized BI Solution on top of MicroSoft NAVision for one of my SeETL customers that they will resell. And we have a development client who is a NAVision customer as our combined development partner.
My client is committed to MicroStrategy so we are building a MicroStrategy front end solution in this product. Of course, we have to do all the steps in order which is why we built the production staging area using SeETLS new Delta Detection SQL Generation Capability first. And now we are working out the kinks in building the BI4ALL data model for the data warehouse and marrying that up to MicroStrategy.
Just as a bit of a footnote.
We are not doing all this development on the SeETL code base that I support out to my customers.
My client is a BI Consulting company and has decided to adopt SeETL and rebrand it for their own company. They have a branded version of the source code that their staff can maintain.
So we are doing all this testing on their baseline source code and this will be retro-fitted in to a later version of SeETL.
For those of you who do not know. I also sell SeETL to smaller BI Consulting companies and they can rebrand the software as their own XYZ brand name and use it as part of their sales and consulting cycles.
This provides the benefit to my clients of “instant credibility” to my customers because they can talk about “their” ETL software and “their” data models and they can talk about “their” experience, all of which comes via myself.
This adoption of SeETL and BI4ALL allows small BI consulting companies, with limited experience and references, to adopt my experience and references as “their own” in the sales cycle and this gives them the chance to compete with the “big players” in the BI market place.
Further, the adoption of SeETL allows smaller consulting companies to enjoy as much as an 50% reduction in the cost of ETL development. If the smaller consulting company also uses BI4ALL for data models the cost reduction in the combined ETL and data model development can be up to 80% depending on many factors.
This means my clients, who are generally smaller BI consulting companies, are able to compete with the “big players” both on functionality and on lower pricing than the “big players” are willing to offer.
In summary, the “updated transaction” problem has been with us from day 1 in Business Intelligence.
We have had to code around it to make sure the summaries stay in sync with the detailed transactions for a very long time. None of the “work arounds” has been particularly simple.
With SeETLs new Delta Detection SQL Generation Capability we are able to deal with the “updated transaction” problem very simply and very easily. Somewhere in the order of 2 or 3 hours of work to create and test the few objects and code that are needed and put it into production.
Please note this solution only works when using the older SeETL C++ ETL engine and it is NOT supported in the newer Generated SQL version of SeETL.
In fact, in this project to build a BI Product for NAVision we have found that the older SeETL C++ ETL engine is so far ahead on functionality that we have decided to go with it for our new product rather than go with the newer SQL Generation Engine.
The amount of work needed on the SQL Generation Engine to get it up to the same functionality as the older SeETL C++ engine is very large and we just do not see the cost benefit in doing so at the moment.
If we had a LOT of customers demanding that upgrade that would be different!
Thank you very much for reading this blog post!
I really appreciate your time and attention!
I was thinking about this post this morning and actually had to go back to the code to remember what I did. What I did was copy the delta detection code and send the data into an _r1 table which I called a reversals table. This is the extra statement that is needed to store reversals which are then sent into a the ETL job stream as per normal.
You place this statement in the ETL stream after the first 3 steps of the Delta Detection Processing. Then once this statement runs the next 4 statements that applies changes to the staging area can be run. The _m2 view is a view over the current staging area. So what this statement does is retrieve the staging area row from “yesterday” that is about to be updated.
This row can then be used to generate the reversal transactions. Simple!
truncate table XXXXX.dbo.zxt_nav_trans_sales_entry_r1
insert into XXXXX.dbo.zxt_nav_trans_sales_entry_r1
from XXXXX.dbo.zxt_nav_trans_sales_entry zxt_nav_trans_sales_entry_src
and zxt_nav_trans_sales_entry_src.pk_cust_number = zxt_nav_trans_sales_entry_m2.pk_cust_number
and zxt_nav_trans_sales_entry_src.pk_ss_number = zxt_nav_trans_sales_entry_m2.pk_ss_number
and zxt_nav_trans_sales_entry_src.pk_store_no = zxt_nav_trans_sales_entry_m2.pk_store_no
and zxt_nav_trans_sales_entry_src.pk_pos_terminal_no = zxt_nav_trans_sales_entry_m2.pk_pos_terminal_no
and zxt_nav_trans_sales_entry_src.pk_transaction_no = zxt_nav_trans_sales_entry_m2.pk_transaction_no
and zxt_nav_trans_sales_entry_src.pk_line_no = zxt_nav_trans_sales_entry_m2.pk_line_no
not (zxt_nav_trans_sales_entry_src.timestamp_col = zxt_nav_trans_sales_entry_m2.timestamp_col or ( zxt_nav_trans_sales_entry_src.timestamp_col is null and zxt_nav_trans_sales_entry_m2.timestamp_col is null))
Join our mail list here: