Hi! Thanks for dropping by my blog! It is much appreciated.
In my last blog post…which was a while ago now, I said
“Today, with SeETL and BI4ALL we say “take everything in to the staging area”.
Why? Because the cost of disk/CPU is negligible. And with SeETL the cost of ETL to get data into the development staging area is also negligible. Once all data that MIGHT be useful is in the staging area is it much easier to decide what will go through.
On projects in recent years we have taken ALL data that is not purely operational data into the base layer of BI4ALL.
In the next blog post I will talk about how SeETL and BI4ALL now support the process of supporting the path to the “Vision for Outstanding Success” in a much better way than we were able to in the past, and usually in a much better way than anyone else does it today. ”
So today we are going to talk about BI4ALL and SeETL and why they are so effective for guaranteeing Outstanding Success in a Business Intelligence project. If you want to read details about these two products? Please just go to our videos page and downloads page.
There are hours of video and hundreds of pages of documentation on these two products.
This discussion will teach you how to all but eliminate the possibility of a data warehousing failure if you will listen. Ok?
SeETL – Why Is SeETL So Important?
Firstly, one of the most important things to do to guarantee that a data warehousing project is successful is to move ALL the data that MIGHT be useful in to the staging area. Not that data that is KNOWN to be useful. All the data that MIGHT be useful.
When in doubt? Take the data and put it in the staging area.
Now. Why would I say this?
True story.
In April 1991 I was in a meeting with the Director of Marketing for the Mutual Life Company (MLC) in Australia. I had been assigned to the account just a few weeks earlier. As it happened I knew the woman who was about to present. Her name was Ingrid Brady and my wife actually worked in her team at IBM.
Ingrid had been asked to present a new product called IBMs Data Interpretation System to the Director of Marketing, one Chris Pender. This was the IBM Version of the famous Metaphor product. Metaphor was a company that Ralph Kimball was a co-founder of. Quite a heritage.
Via my wife Ingrid knew that I had just been assigned to the MLC account. So she suggested I attend and the Account Manager, one Jim Groom, agreed.
In 1991 I was 27 and a very fresh faced new Systems Engineer just out of IBM Marketing School. I was told to sit and listen and not say a word. You can’t make a mistake if you don’t speak, right?
When asked what he really wanted to do Chris had this to say.
“We have 1 million customers and 1.1 million in force policies. Yet we sell over 500 products. Our cross sell ratio is terrible. What I want you guys at IBM to do is to help us figure out how to improve our cross sell ratios.”
He went on to say.
“I want to be able to ask any question that can possibly be answered by all the data you guys have in your computers. I don’t know what questions I am going to ask. I don’t know when I will ask them. But I do know I want the answer before I forgot why I asked the question.”
When pressed about “Well, how long does it take for you to forget why you asked a question? What sort of response time do you have in mind?”
Chris looked kind of thoughtful and said “If he (meaning his data analyst) can finish his cup of coffee before the computer answers the question, that’s too long. So I guess we are talking about 5 minutes.”
Of course, as someone who had been in IT 9 years at the time I knew the idea of being able to answer any question that one might think of from millions and millions of rows of data in a DB2 database in 1991 was a pipe dream. But I was not about to say that in front of such an important customer. This was a major IBM account!
What happened next changed my life.
The Account Manager, Jim Groom, asked Chris if he was willing to pay IBM AUD30,000 to do this work over the next 3 months. Chris agreed.
Jim, sitting next to me, then slapped me on the back and said those fateful words that have rung in my ears a million times since.
“And Peter Nolan is our expert in this area, he will make sure you get what you want Chris!”
Me? “Expert”? I had never even SEEN this software before!
That will teach me for going to a customer sales pitch! Right?
And the rest, as they say, is history.
But in 1991? How do you meet a customer requirement of:
“I want to be able to ask any question that can possibly be answered by all the data you guys have in your computers. I don’t know what questions I am going to ask. I don’t know when I will ask them. But I do know I want the answer before I forgot why I asked the question.”
I am pretty sure that most of you who have data warehouses can’t do this today, 24 years later.
How many of you can truly answer any question you can think up inside 5 minutes? Not many I would wager. But that’s what we did.
Inside 6 months, after one false start with 3NF modelling, I had invented a new way to store data in DB2 such that pretty much every question the data analyst could dream up had a response time of less than 10 seconds.
About 90% of all queries happened in less than 3 seconds. About another 5% took less than 10 seconds. And about 5% of queries ran for anything up to 5 or 10 minutes. But almost NOTHING ran longer than 10 minutes. This was compared to the failed 3NF effort where EVERY question took an hour to answer!
One of the BIG mistakes we made was to be selective of the data we took in to what would now be called a “staging area”. Of course, with disk being USD20,000 per GB in those days this “staging area” was on tapes.
Once we realised this mistake what we did was take ALL the data we could lay our hands on and put it on to a set of tapes. We then did sort merges of the tapes to create one massively long record for each policy the company had. From this one long record we could load up any data that we wanted from the tapes.
Right then we all agreed. You have to take ALL the data you can lay your hands on and get it on to tapes where you can read it.
Well, of course, 24 years later, disk is a LOT cheaper than USD20,000 per GB. Disk is, for all intents and purposes when measured against corporate incomes, free.
And you should rest assured that the process of taking ALL the data has been absolutely critical to the outstanding success of my projects.
Lots of times my clients have tried to tell me that I must be crazy and that all the data is not needed. Pretty much every one of them has commented later how changing requests from the business have made that decision one of the best decisions I forced them to make.
The thing is this. The marketing guys really DO NOT KNOW what questions they might need to ask in order to move the business forward. It is not a case of “they should know” or “the could know”. It is a case of THEY REALLY DO NOT KNOW and it is RIGHT that they do not know. So in the face of that the marketing people need to be given a database that can answer ANY QUESTION…fast.
My findings were confirmed in 1995 when Bill Inmon did his first seminar in Australia which I attended. He was explaining his proposed idea of “archival data models”. I had read about these in his book but it had not really sunk in.
What Bill was talking about, of course, was taking a snapshot of data every day so that not only could an analyst ask any question of any piece of data. He could ask any question of any piece of data that ever existed in the company for all the values that it had had based on daily snapshots.
This was a VERY radical idea for 1995. The volume of disk we were talking about is unimaginable in the large companies. This would include banks taking a copy of every bank account every day and storing the values of every field in every account every day!
Unimaginable!
But the versatility and value of such a data model was without question. It was, of course, much more robust and versatile than dimensional models. I was working on my first proper star schema model at the time and this new idea from Bill was stunning to say the least.
Now…Why talk about all that?
Well? Because SeETL makes it VERY easy to take ALL the data and put it in to the staging area.
SeETL can read all sorts of file formats because it is written in C++. If you have a file format that is not supported? No problems. We just break out the C++ compiler and we write what is needed.
If the files coming to you do not have field headers? Then you can load the files into tables based on column positions. Of course, if you have headers then you can load the files based on column name. The tables do not need to be replicas of the incoming data to load.
SeETL defaults the extra 4 fields placed on the back of each record.
SeETL can do inserts and updates on the staging area.
SeETL can generate “delta detection sql” so that if you are not getting deltas properly from your operational systems or other data sources then you can do your own delta detection. This is especially important to make work properly for deletes. Many people forget all about deletes when they are taking data in to a data warehouse.
Because the staging area can now be placed in a relational database you can query it. Sure, the performance might be very poor because the staging area tables should be very close in structure to the operational system tables to make auditing easier. But at least you can get an answer out of the staging area even if it takes a long time.
These tools that are included in SeETL to enable you to load up your data into your staging area very quickly means that you can use the features of the RDBMS to perform your data profiling and get to know and understand the data. It also means you can see full copies of your data so that you can find all the data anomalies during testing.
Back in the days when we could not afford to put all the staging data in to an RDBMS and analyse it the “data anomalies” often did not show up in the test data.
I still recall one time I was working in a telephone company and as we tried to load up production data we got errors. Sure enough, in some of the “telephone number” fields there were email addresses. These were very few and far between which is why they did not show up in our test data. And the fact they were there surprised even the customer. Obviously, we had a real problem having email addresses in telephone number fields.
As far as I am aware SeETL is the only tool I know about that makes it so fast and so easy to get data from where ever it is, in whatever format it is in, and get it in to an RDBMS to begin the process of analysing and understanding the data.
The speed of doing this is an order of magnitude faster than Informatica or DataStage was back when I was using those tools on a regular basis. I doubt that has changed because of the nature of those two tools.
So the #1 feature of SeETL that has made it such a valuable tool in enabling me to deliver outstanding success for data warehousing projects again and again is the ease and speed of getting data in to the staging area to analyse.
Data analysis and learning the data is one of the most difficult and important jobs of the data modeller. If you don’t understand what the data means and how it is structured you can’t put it in to the target data warehouse sensibly.
BI4ALL – Why Is BI4ALL So Important?
Back in 1995 I did a banking data warehouse for my old pals at IBM. This was for a banking system that was being developed by IBM called RB 2020. This was to be my second true star schema project. As we developed the data models I noticed that the structure of the data models was exactly the same as the structure of the data models for the MLC company that I had just completed.
I could see that if I had some generic code that populated fact and dimension tables I could cut the development time substantially. The MLC project had taken us 18 months to build a very modest prototype star schema data warehouse. Now we were going to do a full data warehouse for a banking system. We needed all the help we could get!
By 1997 I was working for Hitachi Data Systems and I was doing another large insurance company, Manulife, in Hong Kong. Again I was struck by the similarities in the data models. By this time we had the forerunner of SeETL working and we were able to generate the cobol code need to populate tables at the push of a button. This is one of the reasons I was so successful at Hitachi.
At that time we were being very successful. I approached my boss and asked if I could re-invest about AUD50,000 of our profits in to developing an industry standard data model for Insurance. I felt that now I had two large insurance companies under my belt that I could develop a template data model for insurance companies. I didn’t get the AUD50,000 approved and it was not long before I moved on from Hitachi.
In my next job at Price Waterhouse Coopers we actually had a very reasonable template data model for telco companies called “Knowledge Core”. My old mentor from Metaphor was at PwC. Indeed, he had spent 3 years trying to get me to join them in Australia! The work he had done was first class and we had very reasonable standard models. Just no ETL to load them.
Again, it was obvious to me that standard models were the way to go. If we just had a set of standard models we would solve a LOT of problems. But my time at PwC was quite short lived and so we never did get to really flesh out those models.
If you read the history of BI4ALL on this page you will see how it came to be developed.
What BI4ALL does for our projects is to tell us about 95% of what should be in the target data model and how it should be put in there.
It also tells us all the design techniques that are necessary to implement the data warehouse.
In my 14 years of putting BI4ALL style models in to production I have only managed to come up with ONE innovation in design techniques.
Count them.
ONE!
The prior version of the models were put in to more than 100 companies. So we know, in BI4ALL, that if we find a problem in a client that we have not seen before then there is already a technique to handle it in the data model. The likelihood that we will find a problem that we do not have a design technique for is very, very unlikely. Though we do look out for them because that would be kind of cool to find one.
The ONE innovation I came up with was back in 2002. So it has been 13 years now that I have been installing these models and not had a SECOND innovation in implementation techniques. That is pretty amazing.
So with BI4ALL as your toolkit? You KNOW that you have every technique you could ever need to implement this style of data warehouse. You KNOW there is nothing missing. And that gives you a lot of confidence. Certainly when I walk in to any client with BI4ALL as my data modelling toolkit I am as confident as they come. Even arrogant. I know that I have got this client covered.
The next big feature of BI4ALL that supports the guarantee of outstanding success is the sheer coverage of the models now. I mean, they do not cover everything, but they cover everything important, especially in telcos.
I recently did a job for a major telco who will remain nameless for the obvious reason that they had just spent a year and USD2 million on a failed data warehouse project. My job was to prepare the input for the tender document to be issued to 4 major players to do the job properly this time.
As part of this I used the BI4Telco model as a tool to get from the business analysts what data they felt they needed in their new data warehouse. The BI4Telco models, out of the box, covered about 95% of the data that they said they needed. The area that was not well covered was data from external parties that the telco wanted to include. The data analysts in the telco could not believe a data model like BI4Telco even existed.
It was to be used by the Telco to score each of the 4 vendors proposed telco data models. Their prior data warehousing consultants had tried to build the data model from scratch. Always a mistake now.
BI4ALL has excellent coverage and design techniques in campaign management, sales and marketing, customer profiling, and customer segmentation.
These are all the areas that are very high value area and high profit contribution areas.
Way back in 1991 the problem statement was:
“We have 1 million customers and 1.1 million in force policies. Yet we sell over 500 products. Our cross sell ratio is terrible. What I want you guys at IBM to do is to help us figure out how to improve our cross sell ratios.”
And for much of the last 24 years I have been figuring our how to “sell more stuff”.
We even developed a 2 year “road map” for insurance companies. We used to walk in to Insurance companies with a 6 phase 2 year roadmap and say:
“This is what we think you should do over the next 2 years, and if you do? You will be millions of dollars better off.”
It was a very compelling presentation at the time and we won a lot of insurance companies with it. And we did not even have a standard model at the time!
So, to be able to walk in to a retailer, a telco, a media company, and to be able to say:
“We have your data warehouse data model about 95% done. The other 5%? We can work that out.”
This is still a very compelling pitch.
Why?
Because 50% of data warehouse projects still fail and one of the biggest points of failure is the data model. The data model gets designed by “committee” and comes out the other end of that process looking like a camel and not a race horse. It has often been said that a camel is a racehorse designed by committee.
Anyone who is building their data warehouse data model from scratch today is running a very large risk that is not necessary to run.
Sadly. Many of the companies that offer “template data models” are offering garbage that does not work.
Just in case you are considering buying a data model? If you do not want to buy ours, fine. We would be more than happy to review and critique any data model you are considering buying and letting you know our honest opinion of it.
There is no one else in the world who is going to give you a better critique of a data model for a similar price than I will. And that would be money very well spent given that you are going to have that data model longer than most marriages last today.
In sunnary, BI4ALL is critical for us to deliver on our guarantee of outstanding success because.
- It embeds all the data modelling techniques we need.
- It covers about 90-95% of the data that our customers need to have in their data warehouse.
- It forces us to capture ALL the data that is necessary to have a very high value data warehouse that can be used by sales and marketing to drive cross sells and up sells thereby making lots more money for the company.
This removes a LOT of room for errors. Errors that might cause failure. Errors that when removed almost guarantees success.
BI4ALL is the key to success that allowed us to go from this best practice in 1996…
Best Practice Since 1996
To this. The New BI4ALL proposition.
The New BI4ALL Proposition
The New BI4ALL Proposition is to take ALL the data in to the data warehouse, store it in star schema models, and emulate the ability to archive all data as per Bill Inmons 1995 seminar. This is why I see no conflict between what Bill and Ralph has talked about for so long. I have been delivering projects based on the best of the ideas of both men since 1996. BI4ALL makes that faster, cheaper, more robust and more sustainable than what we were doing in 1996.
I have seen no better way to design a database than how BI4ALL works today. Dans Data Value is a different idea to BI4ALL and would be quite complementary if anyone with a Data Vault would like to use BI4ALL in conjunction with their Data Vault.
Next we are going to get back to the discussion where the rubber meets the road.
Getting the data in to the BI4ALL Models
Back to Talk About SeETL Some More
Now. If course, you are thinking:
“That’s all very good Peter, but another other BIG reason that data warehouse projects fail is that it is not possible to get the data model updated inside the batch window, so companies cut back on what they put in to the data warehouse due to ETL processing problems.
It is all well and good to say “take all the data” but if you can’t get it into your data warehouse? What good is that?”
Well? I am the absolute KING of “make it go faster”. My history of performance tuning large systems goes all the way back to the 80s. There are few people who know how to “make it go faster” than me. And most of them are retired now!
SeETL comes in 2 flavours. The C++ engine and the SQL Generation version. The workbook entries for the two versions are identical and so you can actually switch from one to the other.
The C++ version was written as a migration of the cobol version back in 2002. In 2002 the databases did not have enough power to perform ETL in the timeframes that were necessary.
That all changed for us in 2009 when we met Netezza.
The Netezza database was the first database we found that had the processing power to perform all the ETL inside the database. Indeed, at one telco we did 100% of the ETL processing inside Netazza in code generated by SeETL.
100% of ETL in SQL on Netezza.
When our client announced that at the 2010 Netezza users conference to the 400 strong audience in a key note address you could almost hear a pin drop. It was a stunning statement. I am pretty sure we were the first people to ever do that. If we were not first then we were not far away from it.
Sadly. Because I took so much business off IBM in the late 90s IBM likes to make sure they do everything they can to make my life difficult. And when IBM took over Netezza we were sidelined after 4 years of building that relationship and two successful projects in household name telcos. Thanks IBM!
Even though the SeETL C++ version is a bit out dated now it is still there and available to be used by customers who do not want to do their ETL inside SQL. When we were using SeETL in telcos like Saudi Arabia Telecom and Orange Telecom Romania, the SeETL ETL ran about 30% faster than we could get the same datastage jobs to run. This is even with the incredible flexibility built in to SeETL file IO that does cost a lot of cycles.
In SeETL C++ we are able to load dimension tables in to memory and share those in memory dimension tables with all the attribution processes that are running. The dimension tables are binary searched for keys to do the matching. We also have techniques to split up the input for the large fact tables to be processed like call records and to process them in parallel.
For telcos we typically split each call record file in to 100 sub files and then schedule the processing of the sub files in parallel groups so that we can max out the CPUs in the machine during the attribution processing. SeETL C++ can flood the CPUs with work in a well controlled manner so that the machine does not thrash. We have used up to 16 CPUs in this fashion and we have seen linear speed up for the processing of the largest of batch processing needs.
If you are processing anything less than 500 million rows per night? You can absolutely bank on the fact that SeETL C++ will get through that workload in under 8 hours. Over 500 million rows per night? We would wonder what you are doing.
Of course, SeETL C++ must take the data out of the staging area and place it in to files, process the files, and then put that data back in to the database. This was necessary in the day because the databases did not have the power to perform the processing. And even if they did the extra CPU licenses that would be necessary to run the ETL made the cost prohibitive.
SeETL was selling for USD20,000 per site back in those days which is less than half the cost of one CPU of an Oracle database and about 10% of the cost of DataStage or Informatica.
If you are using a database that has enough power to perform the ETL inside the database for your volumes? That is now a very good way to go.
SeETL generates the SQL from the workbook, places it in to the database itself and treats the SQL as another piece of data. It audits the SQL and always inserts a new version in to the database, never deleting the old version. So you have a full record of all the versions of all the pieces of SQL that have ever been used in your data warehouse. An auditors dream! And I used to be an IT systems auditor so I should know!
So, with SeETL, because you get the choice of C++ or SQL engines you can be absolutely assured that you have the tool to get ALL your data from the staging area in to the data warehouse inside your batch window. We have never failed in that area, and we don’t intend to start now.
Next…
Of course, there is the actual writing of the ETL to get the data from the staging area in to the BI4ALL data models.
We all know that the biggest portion of the development costs for a data warehouse is writing the ETL. We all know that the writing of the ETL has been on the critical path and that we have not been able to really develop the applications and reports until a good portion of the ETL was written.
This is all totally transformed with SeETL. We have a whole new way of working in the development of ETL as depicted in this diagram.
A New Way Of Developing ETL
The BIG breakthrough for SeETL came in 2004 as the result of quite a heated argument. Our “trainee” had been sold off on another project and the tedious work of cutting and pasting from the spreadsheet, which was documentation, to text which was code to give to the database fell to my DBA.
He was not a happy camper about this. He flatly refused to do this very boring, menial, time consuming job! But, like the great DBA he was, he came up with an idea. And that idea was to save the excel workbook as XML and read it with VB.NET and generate the code needed.
I wrote the first version of this VB.NET code that weekend and we have been adding to it ever since. It now stands at nearly 80,000 lines of code!
With SETL we got rid of the GUI interface for ETL development in favour of Excel. The GUIs only slow good people down. When building ETL what you want is speed of development for experienced professionals. Not ease of development for novices.
The idea of using novices to build ETL has always been insane and I have said that for more than 20 years now. ETL is far too important to allow novices who have no idea what they are doing to develop it.
When building ETL you need full access to whatever programming tools you are using right there at your fingertips and not hidden behind some GUI or some other programming language as an intermediate step.
Most importantly. When building ETL you want the “documentation” that the data modeller/ETL designer writes to actually be the “source code” for the generation of the ETL subsystem. By doing this you collapse the roles of “data modeller” and “ETL developer” in to one person and you get rid of the role of “ETL Programmer” as the person who accepts the ETL specifications and programs them in to the ETL tool.
By collapsing these roles and discarding the “ETL Programmer” we got a 50% productivity improvement over the best we have ever been able to achieve any other way. And since ETL development is always a big number in terms of cost? Saving 50% of a big number is always good.
Just as a teaser…we have found a way to go about TEN TIMES faster again. We are still working on it and we have not released it. But we were faced with a big challenge and we invented a new technique to develop ETL that is about 10x faster than what we have been doing.
So keep on my email list if you want to be among the first to know about that.
So…
In summary, SeETL is critically important to us to live up to our Guarantee of Outstanding Success on our projects for these reasons.
- It enables us to load the staging area data much faster than any other tools available.
- It provides data profiling tools to help us learn the data and understand the data much faster than anything else we know about.
- It has a C++ and an SQL Generation version so we can pick and choose which version we will use on each project to make sure we are making the ETL system run as fast as possible.We can even mix the two methods on one project if we wish to.
- It has full auditing of the SQL generated in the SQL Generation version.
- All the metadata in the workbooks are loaded in to the meta data dictionary.
- The workbooks are the “source code” for the generation of SQL delivering a proven reduction of ETL development of 50% or more.Direct comparisons with Oracle PS showed us to be 80% cheaper in pure dollars/euros than Oracle PS. That’s not bad since they used India based people to do the work and we were billing more than EUR1,000 a day for me to do the mapping work in the comparison.Even with the “developer in India” vs “EUR1,000+ a day senior consultant” pricing difference we came out 80% cheaper than Oracle PS. Not bad!
- And…we are working on a new idea that has already proven we can be up to 10x faster at ETL development again. A very exciting prospect if you ask me!
Between BI4ALL and SeETL and the way we approach projects it is well nigh impossible to “fail”.
Over the last 12 years, since we had the SeETL C++ engine, the only projects that we have done that I would call a “failure” were projects where the clients refused to do what we advised them to do.
That happens.
In one case I submitted written recommendations time and time again only for them to be over-ruled by client staff. The project was less than stellar. But when you write down what it is you feel your client should do and give it to them in writing so there can be no question as to your advice and they STILL choose to go their own way? Well, you can’t be blamed for the results of a project where your formal written advice was not taken.
So yes, it is possible to snatch failure from the jaws of outstanding success even if you have SeETL and BI4ALL available to you.
You can refuse to use them as they are intended to be used and as you are advised to use them!
Ok. This is a pretty long blog post. I hope that you enjoyed it.
I hope that I impressed on you the critical roles that SeETL and BI4ALL (and predecessors) have played in guaranteeing Outstanding Success for our projects over a very long period now.
A period of just on 20 years this year!
If you would like to go the next 20 years consistently having Outstanding Success on your data warehouse projects?
I can say that adopting the tools I have crafted over this 20 year period would be a very good place to start!
If you got this far? Thank you for reading and/or listening. I hope this was useful to you. If you take this advice to heart? I jut gave you the key to almost guaranteed Data Warehousing Success. The only way you can fail is if your company or client will not follow this advice.
Your Golden Key To Success
Best Regards
Peter