Hello and welcome back to our latest post.
Since we have made SeETL open source and free we have already picked up three new customers who have studied the code and realized that SeETL is a better way to build ETL systems than their previous methods.
It has been great to pick up new customers for SeETL.
If you are looking for a faster, better, cheaper more reliable ETL system?
Download the SeETL code and demo databases and take a look for your self!
I am also pleased to be able to let you know that my new customers are now trained in SeETL and are available to build (or convert) and support your ETL systems for the long term.
If you migrate to SeETL from Informatica and DataStage you should expect a reduction in annual costs of ETL of around 30-50% in total. With those sort of savings on the table it is well worth looking in to SeETL as the next ETL tool you use.
On to the topic for this post!
This post is more technical than most of our posts because it is a change in recommendation as to how we have typically implemented SeETL in the past. We have changing our recommendation on having multiple SQL Servers and placing the control table in the data warehouse. This is what we have recommended in the past. And I just wanted to put up a post to explain why we have changed our position on something we have held steady on for 16 years.
In the past, remembering that C++ SeETL is now 16+ years old, we designed SeETL C++ to be executed on the target production data warehouse machine because that allowed the workload for the staging processing to be performed on a second machine.
Back in the early 00s many customers, especially SQL Server customers, wanted to place their staging area on one SQL Server machine and the data warehouse on another SQL Server machine. There were a number of reasons why this was desirable, not least among them was that licensing SQL Server for the staging area only required one CAL license.
The way that this worked very nicely was the scheduler would execute on the data warehouse machine and it would send the request to read data from the staging area view to the staging area machine.
Back in the early 2000s no one thought that databases would be likely to be able to perform ETL processing any time soon.
Informatica and DataStage were the subjects of massive development investment in the full view that SQL would not be replacing their functionality any time soon.
But now it is 2019. We have recently done an upgrade for a client using just SQL server standard edition. What we found on the new production machine was quite stunning. The prior machine was just under 3 years old when we performed the upgrade. And we saw a 4x performance improvement in disk subsystem performance on the new hardware.
We were expecting some improvement in disk performance but a 4x improvement was simply stunning. Simply put, with that sort of disk performance improvement SQL Server, even at the standard edition level, has become a workable data warehouse machine for quite a reasonably sized data warehouse.
As a part of experimenting with this new machine over the last 6 months, and the recent invention of the mechanism of how to build multi-level summaries using SQL and not the SeETL C++ code, we have changed our position on the placement of the staging area and control tables.
We are now altering our standard recommendation from having two SQL Serves with one being the staging area and one being the data warehouse machine to the standard recommendation of having both the staging area and the data warehouse on the one SQL server.
Of course, in our deployments on Netezza this has always been the case, so the recommendation is not so “radical”.
On Oracle we have varied the implementations. Often on large Sun machines we used one machine but started up two virtual machines to run one on staging and one for the data warehouse.
From today forward we are announcing that on SQL Server and Oracle our standard recommendation will be to place the control tables in the staging area and place the staging area and the data warehouse on the one physical machine so that the ETL processing can happen on the one machine.
We are still recommending using the older SeETL C++ programs for dimension table processing because these programs handle multi-level aggregated much more simply than the sql generation version.
However, now that it is very simple to take SeETL generated SQL for fact tables and implement multi-level summaries we are now recommending SQL as the code to use for processing multi-level fact tables.
If you wish to treat the control tables as living in the target data warehouse, which the C++ programs insist on, then we are recommending views be created that point to the control tables in the staging area.
In this way it will appear to the Data Warehouse Architect like the control tables exist in both the staging area and the data warehouse where they, in fact, only reside in the staging area.
With the performance improvements we are seeing in hardware and disk technologies the future of ETL is clearly SQL.
The time is up for Informatica and DataStage.
As much as these were great products in their day, if you are still paying significant licensing fees and paying developers to code informatica and datastage? You are spending much more money than you need to spend.
Using SeETL C++ for multi-level dimension tables and SQL for multi-level fact tables you can remove all of your Informatica and DataStage processing (and licenses) and run a leaner, cheaper, faster, better more reliable ETL system out in to the future.
We have another post coming out about that subject shortly.
For now? This post is to announce that we are changing our standard advice to placing the control tables in staging and using SQL for multi-level fact tables.
The world of ETL is getting faster, better, cheaper and more reliable every day as we put more and more processing in to the database!
Best Regards
Peter