Note: You can listen to the blog post on the video or read the blog post.
Hello and Welcome.
I am Esther.
I am Peters A I Assistant to create voice overs.
I will simply read Peters blog posts, so that you have a choice of reading the blog post, or listening to my voice.
Hello and welcome Gentlemen.
In this blog post I am going to talk a little about the failure rates of what can loosely be called data projects.
I am prompted to do this because Mike over at the data janitor channel sent out a newsletter with some claims about data science projects.
In a quote from his newsletter, which was a quote from a Forbes article we find the following.
Quote.
In two thousand and sixteen Gartner estimated that sixty percent of data science projects were failing. In two thousand and seventeen analyst Nick Heidecker said it was likely closer to eighty five percent.
End quote.
Mike says that in the machine learning area only four percent of data science projects end up with a published usable machine learning model.
I have not heard this four percent number before.
And even I think that is really bad.
So.
A good question to ask is what the heck is going on?
And perhaps an even better question to ask is this one.
Is this new?
Or have data style projects had high failure rates historically?
Well?
I have been in the data analysis space for thirty four years.
If I wanted to be a bit flexible about that?
In nineteen eighty nine I wrote and sql language parser that could read sql like queries and then go to an IMS database and execute the query and send the resultant data to a file.
Yes. You heard me correctly.
I wrote an SQL language parser that could read IMS databases using the database definitions for IMS in nineteen eighty nine.
It was a very interesting little program to make a point on a project I was the system architect for.
That point was that we should not convert the system from IMS to DB2 just to allow querying.
So if you wanted to count that little exercise I have been in the data query area for thirty six years.
And not even I have seen it all.
I can tell you, for certain, that the failure rates of all sorts of data query style projects has been above fifty percent since nineteen ninety one.
Indeed, the very first one I did that was a Metaphor implementation was something of a disaster in the first ninety days as well.
We created a small database that was queryable, yes.
But each query took an hour and cost four hundred dollars to execute because CPU time was consumed on a charge back basis.
The customer said that they loved what we did but were not willing to pay the price per query.
So, I spent the next three months independently inventing the idea of multi level summary tables for query access.
After those three months from July to September nineteen ninety one, most queries were completed in under one CPU second.
A good 90 percent were completed in under ten CPU seconds.
And less than one percent of the queries we had to run took over a minute CPU time.
With zero queries running for the likes of an hour as they had done just three months earlier.
When I had experienced DB two people in for demonstrations they could not believe what they were seeing.
I could present national summary level data in less than a second.
Then I could drill down on states, on cities, on postcodes.
All with sub second response times.
I could drill down on products.
I could drill down on demographics.
Many of the experienced DB two people who saw these demonstrations likened them to magic.
These were people from inside IBM who handled our other large clients.
I was asking them if this seemed useful to them for their accounts.
The excitement was overwhelming from my peers.
Then in late September we handed this semi stable prototype over to the customers business people.
They were allowed to use it as they wished until the end of the year.
If they liked it they could buy it.
If they didn’t like it we would turn it off.
Please remember this was a customer who paid IBM ten million Australian dollars a year so these sorts of try and buy product trials were quite common.
We were selling the Metaphor Data Interpretation System.
This is the system that Ralph Kimball was the primary designer for.
Ralph Kimball was a co-founder of Metaphor Computer Systems in nineteen eighty two.
This is how I became friends with Ralph.
Even to the extent that he asked me to review his first book and write a guest article for him for his column in the data base management system magazine.
At the end of the year the customer invited me to a very nice restaurant for a Christmas lunch to say thank you for all my hard work.
He told me they had already made back the three hundred thousand dollars that we were charging them for the system.
He told me they made quite a bit more and that is why the lunch was very nice and very expensive.
He was not allowed to tell me what they had done as part of normal IBM business practices.
Our customers were told to never reveal to us any information that they did not want public so that IBM employees could not be accused of leaking information to competitors.
So I was left wondering what they had done.
But one thing I knew was this.
The preparation of the data for querying was a key component.
In June nineteen ninety four I resigned from IBM and went and worked for this life insurance company.
As a single person company I was allowed to sign a non disclosure agreement and, of course, I would honour it.
So in July nineteen ninety four I found out one of the other key elements of the success of this company.
It was described to me in words close to the following.
Quote.
The way I use the system is like this.
I have an idea, a theory, a hypothesis.
Some idea of how we might improve the profitability of a segment of our business.
I don’t know if it’s good or bad or otherwise.
I go to the system and I start asking it questions to prove or disprove my theory.
On average about three ideas are bad, three ideas are about what I expected, and three ideas are better than expected.
But about one time in ten the idea is far better than I expected.
About one time in ten I will turn up a diamond of an idea.
If I had to liken it to anything I would liken it to diamond mining.
You know there are diamonds in the mine somewhere.
You don’t know where.
So you just have to keep digging.
And it’s not like it’s immediately obvious that something is a diamond.
It’s more like each question you answer generates two or three more questions.
And you answer those question and follow the lead and eventually, one time in ten or so, you come up with a diamond.
End quote.
I asked him what he considered a diamond to be worth.
He said that they had set that level to be three hundred thousand Australian dollars because that is what the system cost them in the first place.
Just to add to this conversation.
Right at the time I joined they had closed the biggest marketing campaign ever.
That campaign came out of an idea by this client.
And once they had the results in buy the end of August they had added four hundred and forty million dollars of new funds under management from that one campaign.
This translates into four point four million dollars of new profit annually.
So let me repeat that again in case you don’t get it or can’t believe it.
The customer had used the Metaphor Data Interpretation System to analyse data and come up with an idea for a campaign.
The campaign itself, and the changes needed, were budgeted at one million dollars in nineteen ninety three.
The predicted return on the investment was two point two million dollars annually.
The actual return on investment once the dust settled was four point four million dollars annually.
It was the largest and most successful marketing campaign in the companies one hundred years of history.
And it was only possible because of the Metaphor Data Interpretation System and the database I built.
Of course, the guy who did the analysis was a genius as well.
In fact, to point out how stunning this was?
No other company in Australia ran a similar campaign at a similar time because no other company could figure out the business opportunity.
The second half of the four point four million dollars in profit came from funds we got from other companies that we didn’t even know about.
So, by nineteen ninety four I knew two things that were absolutely necessary to make money out of data style projects.
I held the theory that it would be very hard, to impossible, to make money out of data projects without these two things.
One.
The data had to be in a multi level style data structure so that drill down from very high levels of summarised information down to detailed levels of information was fast and not excessively expensive.
Remember, this is nineteen ninety four we are talking about. Long before the days of columnar databases.
Even so, multi level data storage encourages more questions to be asked because the cost of asking a question is only a few cents.
There is no large bill that you get for asking lots of questions.
A lot of people comment about the large bills they get from cloud database vendors that were unexpected.
A lot of people comment on how you might want to limit the length of query and not just let anyone ask any questions they like on cloud computers.
If you look back to nineteen ninety four we were encouraging business users to ask any and all questions they wanted to ask and the cost was pennies per question.
The same applies today.
The difference in cost of query between data that is highly summarised and then stored versus summarised on execution is still significant.
It depends on a lot of things.
But it is still significant.
Two.
And this is more controversial today.
If you want your data project to make your company more profit?
Then the people who should have access to the data, and who should be analysing it, are the five smartest business people in your business.
Not your IT people.
Now I know that is really controversial because IT people like to think that they are as smart as business people, and maybe they are.
What business people have that IT people do not have is this.
They understand the business.
IT people, on a very fundamental level, have very little idea of how a multi billion dollar business really works.
IT people find that highly insulting and highly controversial.
It just happens to be true.
Ever since I saw this campaign that brought in four hundred and forty million dollars of new funds under management into the Mutual Life Company of Australia in nineteen ninety four?
And it was explained to me what was done to make that come about?
I have been a believer in the idea that data is prepared by the IT people for query by the business people.
I have promoted that division of labour ever since.
And almost no company has ever taken my advice on that one.
In many of my projects I ended up being the man who performed the role of the business person.
And often I ended up doing it with SQL and Excel because the customer bought Business Objects because it looked pretty.
I can’t remember the number of customers where they have bought Business Objects to create pretty reports and I did the actual, real, valuable data analysis in SQL and Excel.
Those of you who know me know that I like to present examples where I can.
So here is a really good example I can talk about.
I did a project for the richest man in Australia in two thousand.
They published magazines and had thirty seven percent of the magazine market in Australia.
Their share had been falling like a stone for five years and the company was expected to fail over the coming years.
We were the fourth team to be given the job of building a data warehouse and trying to save the company.
We had lost all three prior proposals to cheaper solutions.
Those cheaper solutions had proven very expensive in lost revenues and lost profits.
I designed and built the data warehouse.
To cut the story short the Chief Information Officer was going to hire one of the university professors to come in and run all sorts of complex statistical analysis to see what he could find.
The man I left in charge ran the analysis that one of their senior managers had asked for.
This required nothing more complicated than junior high math.
What they wanted to test was to change the forecasting algorithms for how many magazines to deliver to each of the 7,000 outlets from being based on all magazine sales to being based on a manual selection of magazine sales.
In our very first meeting to talk about this business this lady asked me if I understood the magazine business.
I said that I never read their magazines but my wife did.
So I openly invited her to “please treat me like a complete idiot who knows nothing about your business” and everyone laughed at me.
She explained that they segment Australian women into many different groups and each week the magazines content, and therefore paid advertising, changes to be more appropriate to the segment of women chosen for that week.
They did issues for older women who were monarchists.
They did issues for younger women who were stay at home mums with babies.
They did issues for younger women who were professional women with no children.
They did issues for Chinese women, for Greek women, for Italian women.
The lady said that they do not sell the same magazine each week.
They sell a different magazine each week to a different target audience and this was their formula for maintaining interest.
She literally said in the very first meeting.
We have long had the theory that because women from the different segments buy their magazines from different outlets we should forecast sales based on similar magazine sales.
But you see, our computer system can not do this.
So we must do our forecasts based on total sales of all magazines, not by comparison to a manual selection of magazines that are similar in content to the magazine we are selling this week.
She said this in our very first meeting.
Me, having no clue if this was important or not, nodded and said that sounded interesting.
We took notes to test that case when we had the data warehouse built in three months time or so.
It turned out that the lady was spot on.
It turned out that if we selected similar prior editions on which to base sales forecasts to calculate the shipping quantities of magazines to the 7,000 outlets that we got much better results.
And when I say much better I mean we increased market share from 37% to 42% in two years and doubled the profit of the company.
The richest man in Australia then mandated all his companies have a similar data warehouse and bought three copies of my ETL software to do it. And that was very nice of him.
But the point of the story is this.
The Chief Information Officer was planning on getting one of the top mathematics professors in the country to come in and see what he could find.
And we used junior school level math and doubled the profit of the company.
I have a feeling that a lot of so called Machine Learning and Data Science that is being conducted is being conducted at the level of advanced mathematics degrees or even P H D level.
When what is needed is a good understanding of the business and junior high school level mathematics.
I can say, without a word of a lie, that in my entire thirty four years I have used first year university level mathematics in only one forecasting algorithm for a retailer.
In my experience it has not been the high level mathematics that makes the difference.
It has been understanding the business and how it works that makes the difference.
Now before all the actuaries shriek at me?
Yes. I lived in life insurance for five years at the Mutual Life Company in Australia.
I do know what actuaries do. And I do know how smart they are.
In some cases actuaries have used the data warehouses I have built.
They have turned up some very amazing results.
There was one time an actuary in a bank came to me and she told me that my data in my data warehouse must be wrong.
This was because the data warehouse said they were operating below the legal level of fluidity. Meaning they had too little reserves covering their loans and that they were below the legal level in Australia.
She said this could not possibly be true because they had a reporting system that showed the liquidity levels among many other things and that showed they were well above the legal limit.
You guessed it.
When we went through and confirmed her calculations the bank was, indeed, operating below the legislated lower limit for liquidity ratios.
The reporting system was shut down and the project manager was fired.
Better to have no data at all than to have bad data.
In that case the issue with operating below the legislated liquidity ratio was only found because one of the smartest business people in the company was given direct access to the underlying data warehouse.
If you asked me, as an IT person, would I care to calculate the liquidity ratio for a bank across it’s portfolio of products?
I would not know where to start.
But an actuary not only knows where to start. An actuary can actually do that without any help from an IT person.
They just need access to the data and the right tools to do it with.
And I can assure you the right tool is not Excel or Power BI.
From Mikes newsletter I would like to quote the following section.
Quote.
Data in the real-world isn’t data in academia. It’s not pre-cleansed or fabricated for toy modeling problems. It’s a horrific mess.
It’s stored at a hundred places all over sundry drives. It’s stored in relational databases, from different vendors on disparate systems. Worse still, a lot of it was stored in Excel spreadsheets or inside this horrible product from Microsoft called Access.
I thought to myself, I suck at cleaning and massaging data and I’ve been doing it for a decade. How the hell is someone who has never worked with it going to manage? I didn’t know at the time but this would be one of the core reasons for the roles collapse.
End quote.
Mike has gotten this just about right.
Data in billion dollar companies is very messy and very complex.
There is a lot of missing data.
There is a lot of bad data.
There are many operational systems.
They were all built by different people at different times.
None of the developers talked to each other.
In a billion dollar plus company the proper development of the data model and ETL system is difficult.
That is the technology area I have lived in for most of the last thirty four years.
The idea that you can give an academic or technical person access to raw data from a wide array of systems in even a modestly sized company and get the right answers out of any data analysis is something of a joke.
You will get answers. The chances they will be right are very low.
It takes people who know what they are doing to build these ETL systems.
My now free ETL software is the most productive ETL software available today.
I know this because that is exactly what I wrote it to do initially in ninety ninety five.
I have updated it over the last thirty years to be the most productive ETL software.
There are many other things that can happen in a data project that will cause failure of the data project.
These two are just the most important two critical success factors.
We have published an entire methodology for the development of a data warehouse project along with our data models and ETL software.
But at the end of the day?
Every customer of mine who has taken my advice over the last thirty four years has made a lot more profit.
Quite a few of my customers who chose not to take my advice had failed projects.
And a very large percentage of projects that I was not involved in failed as well.
Personally I would like to see more projects be successful in our industry segment.
The problem is that the vast majority of people in my industry segment do not want to take my advice.
So I think we will continue to see more failures.
And with that?
I hope you found this blog post interesting and informative.
Thank you very much for your time and attention.
I really appreciate that.
Best Regards.
Esther.
Peters A I Assistant.










vardenafil hydrochloride
vardenafil hydrochloride
ciprofloxacin antibiotic
ciprofloxacin antibiotic
mirtazapine 15 mg tablet for sleep
mirtazapine 15 mg tablet for sleep
dutasteride topical uk
dutasteride topical uk
doryx 500mg
doryx 500mg
sertraline 100mg
sertraline 100mg
dexlansoprazole 30 mg generic
dexlansoprazole 30 mg generic
lasix 40 mg buy online
lasix 40 mg buy online
Comments are closed.