Hello and welcome to this blog post on Data Warehouse Automation.
A while ago I wrote about how to ensure BI projects are successful. These posts received very little attention.
I have long since claimed that there is very little interest in our industry in making BI projects successful for the business. I have long since claimed most people are only interested in making BI projects pay for the vendors and the people who work on the project, not the business who is paying for the project.
I have championed the idea that the BI Solution delivered to a customer should have a significant return on investment for more than 25 years now. Men like Bill, Ralph, Sean Kelly and I were pretty much alone in that call for return on investment.
Because of what some would call my “manic” desire for return on investment my clients have been outstandingly successful in terms of money made from ideas I have given them.
The two stand out successes I feel I can talk about are National Australia Bank who doubled their profit from AUD1B to AUD2B over 5 years based on ideas I invented and gave them as an IBM customer in 1992. The other is Australian Consolidated Press, the largest publisher of magazines in Australia, where I was a key contributor in turning that company around and doubling their gross profit in a period as short as 2 years. Of course, such massive profit improvements take a lot of people to make happen. They are not the result of just one or two people. It takes real team work.
These are just two of the more amazing success stories of the clients who have hired me and taken my advice. There are plenty more.
Most of my clients are none too keen for me to talk about the actual financial impact on their business from the work I have done for them.
Even so, companies keep hiring bozos to fail at building BI Solutions and no one seems to care enough to call me and hire me and ask my advice.
The only part of that which truly aggravates me is that I too am painted with the same brush of “50% or more of BI projects fail” when the fact is that no project I have ever undertaken has failed where my client took my advice. Some very well known companies have had failed projects where they refused to take my advice. Of course, I am not responsible for what happens when my advice is dismissed even though it was paid for.
And this brings me to the reason for this blog post.
There is some level of interest in the idea of “Data Warehouse Automation” and so I thought I would make comment on this area for the very reason that building ETL systems is often a point of failure of BI projects.
Quite frankly most people who are building ETL systems have no idea what they are doing.
I have seen many ETL systems that were complete garbage make it in to production only to be patched little by little for years to come.
ETL development and support is the single largest variable cost in a Business Intelligence Implementation. It is also a common point of failure.
And yet companies do almost no due diligence with their vendors or even staff to find out if they know what they are doing with respect to ETL.
In one of our most famous public cases, CarPhoneWarehouse, the BI Supremo, Brian Ganly, has publicly stated that we developed such a good way of building ETL we were 5 times faster than their previous best efforts. This is not because they were “bad” at developing ETL. It is because we are so good at it. And we delivered the finished ETL in Informatica which was the ETL tool they had previously committed to.
For us to be 5 times faster at delivering ETL than ever before, to one of the worlds most cost efficient retail companies, should have had people pop their heads up in interest.
But it didn’t.
Companies just kept spending good money after bad on ETL development even AFTER Brian Ganly went public with that statement.
I know it beggars belief, but it is true.
Now we have more charlatans claiming “Data Warehouse Automation is the new ETL silver bullet and with just a few clicks of a mouse and a graphical interface all your ETL problems will go away.”
I truly wish these charlatans would be run out of our business because they give us ALL a bad name.
That is what this post is about.
There is a LOT more to ETL than “a few point and clicks in a GUI to build your data warehouse”.
A LOT MORE.
I worked in extremely high performance batch processing projects in the 80s. I was a developer on the IBM Billing System called COBRA. This billing system was put in to both IBM Canada and IBM Japan. IBM Japan was the single largest billing systems image project in the world at the time. Even larger than AT&T because billing in AT&T was done in the regional operating centers.
In IBM Japan all billing was done on one computer through one system. A very big computer to be sure for the days, but one computer. And about USD25B of billing was done through that computer per year because IBM had an astonishing market share in Japan.
I was the man who got the job of cutting the Billing System run time for our system from 35 hours down to below 8 hours for the monthly billing run. It is no exaggeration to say I was one of the top 10 people in the world in performance tuning of such large MVS systems at that time.
As life happened I moved to “Data Warehousing” in 1991 and, of course, our ETL tool was called “Cobol” at the time. We found the ETL system was the most difficult part of the project.
After I invented a new way to model data in 1991 we thought more about the ETL and how that was working and we started making improvements. Once I was trained on Star Schemas and we implemented our first star schema data warehouse in 94-5 it became obvious to me that there were patterns in the processing.
By 1997 we had developed an ETL tool where we had template cobol programs, files of data models, and another cobol program that could read the templates, read the models, and produce a finished cobol program that would perform the needed processing for the needed table.
Yes. You read that correctly.
In 1997 we had a tool that I called the “Instant Data Warehouse” where we could generate about 95% of all the code needed for a data warehouse using templates and a program to modify the templates.
This is one reason why I was so successful in the 90s selling Data Warehouses. Because we could deliver an entire project for the same price as Bill Inmon was charging for his Prism ETL tool!
And, of course, companies were quite happy to accept cobol as the ETL tool because everyone had armies of cobol programmers in the 90s. I even once got a cab ride with a guy who was a retired cobol programmer cobol was so common!
To cut a long story short I know ETL.
I have been a level 5 expert in both DataStage and Informatica over the years. I know the insides of these tools better than most people who use it on a daily basis. I know and understand how to get data out of many different types of systems, how to build massive data warehouses, and how to make all that ETL run like a well oiled machine with high performance and high reliability.
Indeed, I would go so far as to claim I do not know another man who knows ETL as well as I do.
Now, if you dispute that claim? Please put up some blog posts and describe to me that which you think I might not know. Ok?
For my part? I have published a great deal of information about SeETL and how we do ETL. I have even published a full ETL sample system which you can download on this link.
https://www.instantbi.com/company/downloads/
You can watch an extended video about it on this link.
https://www.instantbi.com/2016/03/20/seetl31-prototype-data-warehouse-detail/
So what are some of the major issues with ETL that need to be addressed in order to deliver very high performance ETL that is faster, cheaper, easier to develop than is otherwise possible?
I will go through the major ones I am thinking about. If you think of others by all means leave comments below and discuss your thoughts. I will make this at least a 2 part series so that we can add your ideas to the next parts.
I01. Separation of Design and Coding
In my opinion, the #1 issue in developing ETL subsystems over the last 25 years is the fact that the person who is doing the coding of the ETL system in to whatever tool is being used is a different person to the person who is designing the ETL system. And in turn this is a different person to who is designing the data model and this is also a different person who has subject matter expertise of the source systems.
In most BI solution implementations you have.
1. People who are source system subject matter experts.
2. One or more people who are designing the target data models.
3. One or more people who are designing the ETL specifications.
4. One or more people who are coding the ETL subsystems.
Now. The only place you might see something different is where the person designing the data models also has source system subject matter expertise.
Why is this a problem?
Well, if you read your Mythical Man Month by Frederick Brooks (1975) or your Software Engineering Economics by Barry Boehm (1981), like I did, you would know that the number of nodes (people) in a software development project increases the number of communication lines exponentially.
This is one reason why Brooks recommended the “Surgical Team” model for software development, to reduce the number of lines of communication.
Why? Because an increased number of lines of communication not only take longer and cost more they also introduce increased opportunity for mis-communication.
We have all played the game “whisper” where a child whispers something in the ear of the next child in line and after 10 children are in the line the whisper is never what it starts out to be.
If that does not accurately describe many BI efforts I do not know what does! LOL!
The mere fact that the people who are designing the ETL, designing the data model, implementing the ETL and the subject matter experts are different people automatically multiplies the number of communications lines to 2 to the 4th which is 16 lines of communication of which 6 to 8 will be used regularly.
Because I DID ready my Brooks and Boehm and I knew that building tools to reduce nodes (people needed) and so too reduce communication lines and opportunities for error, I created a tool where the SAME PERSON performs the ETL design, the ETL coding, the data modelling AND is the source system subject matter expert.
AND I created the tool in a tool that this person would not have to learn. It’s called EXCEL.
So, in order to address the most fundamental issue of all in ETL development, the fact that too many cooks are in the kitchen, I invented a tool that allowed ONE COOK to be in the kitchen building the data warehouse so that it would have what Brooks described as “conceptual integrity”.
Further, that one person is able to deliver the largest and most complex of data warehouses in less time with more reliability than ANY NUMBER of people.
It is the old saying. You can not get 9 women to produce a baby in one month.
It does not matter HOW BIG or HOW COMPLEX the data warehouse data model is.
With SeETL, ONE MAN, ONE MIND, can do ALL the ETL development and ALL the data model development inside Excel as the tool and NEVER have to communicate to someone else what it is he is doing until he is finished.
Now. How does that compare with any of the ETL tools or these newly promoted “Data Warehouse Automation Tools”?
It doesn’t. The ETL/DWA tools require, by design, a “programmer” to run the tools. Now, that “programmer” might be someone who can also design the ETL on the fly using the tools. However, I would bet that it is not the same person who is designing the data warehouse data models or a source system subject matter expert.
So. Right at the beginning. ETL/DWA tools MUST be far less productive and far more expensive to use by mere virtue of the fact that more than one person must be involved.
I02. The End User Target Data Model Must Be Dimensional
Back in 1991 when I first saw a dimensional data model implemented on a relational database I was quite in awe. It was so simple and so elegant as to simply “make sense”.
I am talking about the Metaphor Computer Systems demonstration database that shipped with the product. I implemented the Data Interpretation System for an IBM customer back in 1991 and that was my foray in to Data Warehousing.
I studied this little database and went looking for documentation on how to build it…there was none.
In 1993 I was over at Metaphors head office and I asked if I could be trained on designing and building these dimensional databases and I was told no. That information was considered IBM Confidential and it was held within the Professional Services Team and would not be divulged to me even though I was an IBM employee.
So in 1994-95 we embarked on our first dimensional data warehouse and we hired one of the PS guys to come out and show us how it was done. That man became my good friend and mentor. I have only good things to say about him as he is a scholar and a gentleman to this day.
I was so completely sold on the idea of dimensional models that I bought and gave away more than 100 copies of Ralphs book when it came out and gained agreement from Ralph to use his books data models in demonstrations to our prospects in Australia.
I was commonly called “Mr. Star Schema” in Australia and I was the man who brought the whole idea of dimensional modelling in to the main stream in Australia. When I was first talking about these things I was actually kicked out of presentations for talking about such a ridiculous idea because “everyone knows” that 3NF is the only way to develop relational databases.
The point is that argument is over now. It has been for 20 years.
If you want your users to use your data warehouse in any sort of self service mode then you have to present a dimensional model. Further, many of the tools, like MicroStrategy, will not even run on a model that is not dimensional so strong is the need for dimensional models.
Operation Systems are not dimensional models.
The end user layer for good BI solutions is dimensional.
Period.
There is a mapping exercise in the middle.
Period.
It takes more than a “few points and clicks with a mouse” to build this mapping system.
All source system keys must be translated in to meaningless integer keys and those meaningless integer keys need to be maintained.
Whereas the source system maintains point in time data the data warehouse contains historical data and so the re-use of source system keys over time has to be considered and coded for.
The 3NF source data for dimensional data has to be joined and flattened out to dimension tables to get the needed performance. This takes considerable skill.
The 3NF source data for dimensional data has to have its hierarchies defined and navigation paths defined. This takes considerable skill.
The transactions that are coming through in to the data warehouse have to be properly connected with all these dimension tables. This means that source system keys have to be placed on the transaction records and looked up in the newly created dimension records to map the string keys to the integer keys.
Further, when it is not uncommon for a telco to have 100 million call detail records a day, the performance of this process has to be MASSIVELY FAST.
This is not achieved by “a few points and clicks of a mouse”.
There are very specific requirements for how to best make these joins happen with extreme speed. Today’s fast databases are doing a lot for this specific issue.
Every company that deals with mass numbers of customers, which is most companies that benefit the most from BI, NEED to keep a full history of the profile of the customers as they change over time.
The way in which PERFECT HISTORY of customers and their profiles can be maintained over time is non trivial and most certainly can not be done “in a few clicks”.
Even worse?
You can expect that data from many operational systems will be fed in to the data warehouse and that the same entities are present in these different systems (eg customers) and that cross reference tables will need to be built in order to map the same physical entities in the real work together in the data.
The building of these cross reference tables so that the same customer (person or company) can be uniquely identified in the data warehouse when they have different key in the different source systems that maintain data about them is not a “point and click” exercise. Period.
It is damned hard work to try and figure out who are really the same people, what are really the same addresses, and to create unique entities for parties and addresses and link them all properly so they are an accurate representation of what is happening in the real world. Damned hard work.
My customers and I have been working on that very point for decades now.
We are very good at it now and we can tell you it is NOT easy.
I03 – Bad Data Must Be Fixed – Codes Must Be Decoded
One of the easiest ways to determine if a data warehouse architect knows what the hell he is talking about is to ask him what his standard solution is for bad data entering the staging area.
If he says that he has no standard solution for that problem because data is always ok and always flows to the data warehouse?
DO NOT HIRE THAT MAN!
HE HAS NO IDEA WHAT HE IS TALKING ABOUT!
Data coming from operational systems is notoriously bad. I have even had email addresses in telephone numbers that is how bad data can be!
In general, the older the source system you are getting data from the worse condition you can expect that data to be in.
You MUST have a strategy for holding up bad data in the staging area until it can be addressed in some sensible way. One of the BEST ways to truly mess up your data warehouse is to not have mechanisms for stopping bad data getting in to it.
Bad data MUST be fixed and it MUST be held in the staging area before making it in to the data warehouse. Rules have to be defined and design patterns have to be implemented.
Further, we all know that the geniuses who develop operational systems just love to use their own versions of codes for things and then these codes evolve over time.
Just take the simplest of all pieces of data about customers.
Gender.
In the good old 90s there used to only be four values for gender.
1 – Male
2 – Female
3 – NA (for a company)
4 – Unknown (as in the system did not record it)
Of course, the geniuses who build operational systems like to code these in all sorts of imaginative ways to prove their genius, not to make maintenance and support easier.
So us ETL guys had to look at each code field, figure out the ranges of values, develop new consistent codes that made sense to real people in the business, and develop the mappings for them and code these mappings in to the ETL systems.
Nowadays? I think at last count Facebook was pretending there are 52 genders.
Spare me.
Apparently kids are so confused today they do not know if they are male or female. I can’t see things working out so well for them if they can not figure their sex out by the time they are in college or university.
And everyone who has done ETL for any length of time knows there are LOTS of codes used in operational systems that all have to be mapped like this.
Not a “few points and clicks with a mouse” exercise.
I04 – Naming Standards Have to Be Invented and Mapped
Every genius who gets to build an operational system just LOVES to stamp his own character on the system by using his own pet unique naming standards for fields and tables.
Just go and have a look at the naming standards for tables and fields in the wide variety of ERPs that are out there let alone the individual custom built systems that many companies are still running.
You can find as many naming standards as there are stars in the sky. Well, maybe its not THAT bad but you certainly will get my point, right?
The target data warehouse, because it is likely to out last any company ERP, must be developed to have an ERP neutral data model and an ERP neutral naming standard.
One of the biggest mistakes I see time and time again is the data warehouse is built using the same naming standard as the ERP and is so dependent on the ERP data model that if the ERP is changed over then the data warehouse has to be thrown out and a new one started.
This is a very common mistake and one that Bill warned about as early as 1993.
As early as 1993 Bill was saying:
“your data warehouse will last longer than your operational systems so you must make sure you use a subject oriented data model that is operational system independent and you must make sure you do not tie yourself to your operational system naming standards in the data warehouse.”
Bill was saying that in his books and in his presentations as early as 1993.
That’s 24 years ago now.
Yet, STILL people build data warehouses where the names of the source system fields flow through to the data warehouse because “it is just easier like that and heck, I will not be around when this thing needs to be replaced.”
One of the things that needs to happen in every project is that the names of the source system fields have to be cleaned up so that they are valid field names in an ansi compliant sql database without brackets or other delimiters around the name which are commonly needed for special characters in field names like blanks, percentage signs or hyphens. This is usually done for the staging area.
Then, once the staging area has a set of field names that are sensible the names in the target data warehouse should reflect industry standard names that are understandable by normal people and should NOT be the operational system name transferred through to the data warehouse.
If you really MUST use similar names in the data warehouse then at LEAST make the underlying field name in the table meaningless and use views over the top so that at least you can change the name of the field in the view at a later date and put multiple views over the top of the underlying tables so as to release versions of the data warehouse when the ERP change over happens.
BECAUSE THE ERP CHANGE OVER WILL HAPPEN!
And you as the Data Warehouse Architect had better plan for that to happen!
Again…if you have 20,000 fields in your ERP (and many have many more) it is not a point and click exercise to create a map to change all their names to a more industry standard data model naming standard.
Anyone who tells you it is a simple matter is lying to you.
I05 – It is FAR Better To Use An Industry Standard Data Model
Let me tell you a story that I hope is very instructive.
Back in 2006 Sean Kelly and I were invited to present our Telco Data Models and ETL to a company called NTL in the UK. This company has since been purchased by Virgin Media.
This was a complete re-write of a prior generation of Telco Data Models where Sean and I poured everything we had learned over the prior 15 years in to those data models. Both in content and in mechanisms for design and maintenance. By which you should understand we refused to use the industry standard data modelling tools because they were an exercise in “click to death”.
Indeed, those early versions of the new models were nothing more than views in a text document that we were selling for a very handsome price!
As was our practice we proposed that we would provide NTL with a truck load of information about our models under NDA. Then we would go to NTL for one full day and we would present for the full day from the high level business aspect all the way down to the nuts and bolts.
The business managers could stay as long as they liked but usually they just stayed for the morning sessions as the afternoon sessions were for the technical guys to learn enough to vote on their decision.
At the end of the day, we promised, the people with a vote would have enough information to cast their vote for which way they wanted to go. Sure, we would answer follow up questions, but we made sure that on the day enough information was provided for a well informed person to cast their vote.
In this case we were up against Accenture.
So, a small specialist consulting firm versus one of the worlds largest consulting firms!
The vote of the people who had a vote was 6 – 6 and in the end the decision was to go with Accenture. We were not told a reason but one presumes that it was the “safer choice”.
Fast forward to the Netezza Users Conference in late 2009 where Sean Kelly and Brian Ganly were speaking to a telco break out group about our Carphone Warehouse Project.
Brian Ganly gave the high level numbers to the audience. They were:
3 source systems fully mapped including Billing, CRM and provisioning.
4,000 fields mapped and placed in to the dimensional data warehouse.
Total project time from day one start up to in production – 8 months.
(Including Christmas and Easter breaks)
55 fact tables. 75 Dimension tables. 800 Key fields in the data warehouse.
The project hit the date proposed in the initial project discussions a year out.
The project came in GBP150,000 under budget on services.
He summarised the project by saying that on a field for field comparison with their previous best efforts the metrics showed the project was approximately 5 times faster.
One of the audience members came up to Sean Kelly and asked him if these numbers could possibly be true, had Brian misled the audience? Sean assured him that the numbers were true and asked the audience member where he was from.
Sure enough, he was working on the NTL project which Accenture and they had STILL not gotten to the same place Carphone had reached despite the fact they started 2 years earlier.
Sean explained to the good gentleman that we had, indeed, proposed our solution to NTL and we had lost the vote given our best efforts. The good gentleman told Sean that given what he had just heard and what he had seen on the NTL project it would seem that the decision had not been such a great idea.
Of course, Sean agreed and mentioned that in hindsight people who chose not to buy from us often felt that way however, we can only do our best and present our solution as best we can. It is up to the company to buy or nor buy from us. Sean mentioned that frequently big companies go for the Accentures of the world because they believe it to be the safer option at the time.
Sean was always very careful not to criticise our competitors no matter how deserving they might be because it is often seen as “negative”. Sadly our competitors had no such qualms criticising us!
The moral of that story is this.
In 2017? A company would have to be very, very ignorant of what is available on the marketplace to allow a project to go ahead to build a customised unique data model for a data warehouse. There may be a few companies out there who could make a case for doing so.
But if a company is working in an industry where they are selling products or services to a large number of individuals, especially if they know who the individual is, there can be no valid return on investment case made for building a data model from scratch. Especially when we have BI4ALL which can be purchased on a monthly license charge now.
Other companies have data models as well.
We stand by BI4ALL.
We believe we have the most cost effective data model available today in the areas of Telco, Retail, Web and Media. If anyone else thinks otherwise? We would love to see your arguments! LOL!
So, back to Data Warehouse Automation and ETL as a result of this story.
Given that most companies WILL have an ERP or major operational system change over during the life of the data warehouse they must make a decision and it should not be a decision by default. It should be a considered decision.
“Should we tie our data warehouse so close to our ERP such that if we change over ERPs we have to redo the data warehouse?”
If the answer is yes then the ERP change over will include a data warehouse change over. Many people say yes to this question because they can not even imagine building a data warehouse with an independent data model and an independent naming standard to the existing ERP.
If the answer is no then included in the cost of the development of the data warehouse is the cost of using an industry standard ERP neutral data model that has a different naming standard to the ERP.
Of course, my point is that if the answer is the later, no, that you will build a data warehouse that is independent of the ERP or major operational systems, then this is not a “few clicks with a mouse” to create the ETL via some mystical and magical “Data Warehouse Automation Tool”.
I06 – Taking Data Out of the RDBMS to Manipulate It
Of course in the 90s and 00s ETL tools read data out of the RDBMS, put it in to files and memory, and manipulated the data there because it was about 100 times faster than using the database.
As machines sped up the difference between manipulating data in files and manipulating data in tables became less. There was will a 100 fold difference, but the time to take the data out of the database and put it back in to the database was taking, proportionally, more and more of the ETL processing time.
With the advent of Netezza Brian Ganly managed to persuade me, in 2009, to see if we could perform the attribution of call detail records inside the Netezza database.
To say I was sceptical is to be very polite. We picked a set of 80M CDRs and we had 25 lookups to join. I was totally convinced this was not possible on any database manager. Something would go bang! Right?
To my astonishment this worked.
To my even greater astonishment it worked in 20 minutes!
As I sat there in early 2009 looking at the prompt telling me the job had finished I realised I had seen the future of ETL and that future was SQL.
If a DEVELOPMENT machine can perform the attribution processing of 80 million call detail records in 20 minutes, the most cpu intensive portion of ETL processing, there was nothing beyond a production machine. And as a side note on our new Netezza production machine we actually did a join with 2 BILLION CDRs to 25 tables and it worked.
AMAZING!
It was clear to me, in 2009, that any ETL style tool that relied on taking data out of the database, manipulating it, and putting it back in, would go the way of the dodo.
ETL was going in to the database and the new language of ETL would be SQL and all those high priced ETL tools would become SQL generators and schedulers.
Of course, we too had a C++ ETL engine and we decided to stabilise it in 2009 and move ahead with tools that generated SQL. We still use portions of our C++ ETL engine from time to time because we have functionality in it that the SQL Language version does not have. However its not very much any more.
So now, in 2017, taking data out of the RDBMS and placing it in files to manipulate it, and then putting it back in to the database is a mistake that is easily corrected.
I can’t see too much reason for doing that any more. Of course there will be exceptions, and I would like to see people write about them.
I07 – Audit and Control
It never ceases to amaze me that companies allow ETL systems to go in to production without sufficient auditability and controls over the data that they may very well make major decisions based on.
As someone who came out of Billing Systems and who reviewed systems for Audit and Control certification I know and understand the importance of Audit and Control. If you do not know why audit and control is important just google “Bearing Bank”. A 200 year old bank brought down by one rogue trader because of the lack of audit and control.
In my 26 years in BI I have consistently seen pre-meditated theft by employees that would have been caught with even the most basic of audit and control features.
I have seen accidental damage done by people that would have been caught by even the simplest audit and control processing.
And I have seen systems where, when damage is done, however it is done, the audit and control capability is so poor that databases have to be restored rather than fixed.
Audit and controls have to be added and they have to cover the following as a bare minimum.
- Batch Numbers for processing
- Batch date independent of system date.
- File Cycle Numbers to tell you how many times an input file has been processed.
- Record counts of rows/records to be processed.
- Each row to contain batch number, file cycle number, processing timestamp, and at least three flags for row valid indicator, row deleted from source system indicator, and row sent to edw indicator in staging. If you have an ODS another flag for row sent to ODS.
This is the absolute MINIMUM you need. You might also have hash totals or balancing totals in reports to make sure no rows get lost in processing and no rows are duplicated.
All these audit and control functions have to be written in to any tool that is going to be used.
This is not “a few efforts for point and click”.
I08 – Security of ETL Code Against Malicious Attack.
For some reason companies place an unrealistic expectation of honesty on the people who are allowed access to production systems. I have never understood this because in any population of people you will find a few dishonest ones who would rather steal than work for their money.
It’s normal.
So it has to be guarded against. My days in IBM taught me many things about how to secure systems that process representations of money against collusion and fraud. After all, an IBM Billing System billed hundreds of millions of dollars for a small country and billions of dollars for a large country.
When you are making USD40,000 a year and working on a system billing BILLIONS the temptation to find a way to steal is very, very, high.
ETL is no different. If those who would rather steal their money than to work for it can find a way to do so you can be pretty sure they will try it. It is a question of risk vs reward. If there is little perceived risk with significant reward you can be sure some man will try his luck!
So one feature that is missing from pretty much every ETL tool I have seen, and I would imagine also missing from these DWA tools, is security of ETL code against malicious attack.
Now, how does one go about securing SQL code against a rogue DBA who is intent on stealing?
Simple. You make it impossible for him to cover his tracks. Since he knows he can not cover his tracks he knows he is very likely to be identified and caught. Higher risk means less men will try and steal.
So how do we approach this problem? We have a unique key for the SeETL tool that has the windows authenticated user as part of the input to generate the key. So the SQL generation can only be done by people with keys.
Sure, the output can then be edited if the file system is not secure.
However, when the SQL statement is put in to production it can only be put there by an SQL user who has update access to the SQL Statements Table in our dictionary. This user can be denied delete access to the table. The automated process does not allow the user to delete SQL statements. The old statement has its “current flag” altered from 1 to 0 and the new SQL statement from the file system is inserted to the SQL Statements table and it has its current flag set to 1.
This means that in order to maliciously alter the SQL statement the DBA who is the person loading the statement, who should not be a member of the BI team, he should be a support DBA, would have to edit the SQL statement directly inside the SQL Statement Table because the tools he has available do not allow him to edit the SQL statements. They only allow him to insert them.
Since the SQL statement inside the SQL Statements table has all newlines removed it is pretty hard to read or edit it which also helps.
Lastly, when an SQL statement is executed it is logged to a log table that the support DBA does not have update access to.
So. If the support DBA decides to try his hand to update the SQL statement in production in place inside the SQL Statements table? That statement will be logged where he can not change it. When he then goes back and corrects the statement to be what it should be he is not able to remove the log record.
If he is the ONLY person who has the ability to alter the SQL Statements table, which he should be, then the fact that the statement that was logged is now different to the statement that is in the SQL Statements table points a very long finger directly at the support DBA as having maliciously altered the statement.
In our projects we make sure that the people who have access to the production SQL Statements Table to update statements know very well that there is a log table they don’t have access too.
The message is clear to anyone who is thinking of malicious intent for the ETL system.
“We have an audit trail that you can’t change so your chances of being caught are very high.”
All the ETL tools I have seen have very little security around the code that is executed in production. In general many people can change it. There is little versioning. There is little logging. There is little clear attempt to make sure that those few dishonest people with malicious intent are discouraged from attempting any stealing.
These things all take time to think about and to implement.
I would actually be interested in blog posts talking about what features are in place in other tools to help discourage attempts at stealing via manipulation of code in ETL systems.
If you got this far?
Thanks for listening or reading!
That’s a wrap for this blog post.
Thanks for dropping by!
I really appreciate it!
Until next time!
Peter