Hello and welcome back to our latest post.
In this blog post we are going to talk about a new feature implemented recently for one of our customers who has reseller rights to our software.
This means they have the source code and they have their own branded version of our software that they present as their software to their clients. We added this feature to their version of the software. It has not been added to SeETL yet.
So, if you download the demonstration database on the button below you will find a fully worked example of how SQL ETL is loaded in to the SQL Statements Table and executed from there using CTLU023.
If you want to follow along with this blog post it might be useful to download the example data warehouse so you can see exactly that I am talking about.
Many years ago one of my customers asked me to build a scheduler to run the ETL subsystems we were developing. I did my best to explain that there were many other schedulers out there and it might not be the best use of my time to create something where there were many already existing inexpensive products.
But my customer offered to pay me and that’s what business is about, right?
So I built the very first version of the scheduler all the way back in 2003.
Almost surprisingly, it has been the backbone of our implementations around the world.
The scheduler is very reliable.
It just runs and runs and runs like the energiser bunny.
The only two problems we have encountered are:
1. If it is installed on a different machine to the database the schedule is in and there is a network error we were never able to get it to automatically return to it’s desired state efficiently. The code to do that was quite significant and we decided it was too much overhead for not enough benefit.
2. If the machine it is installed on crashes so hard the scheduler did not have enough time to record the fact it was going down. In this case the schedule is left in a “waiting” state and the operator has to restart the batch properly.
Given the reliability of data warehousing machines now? We very, very rarely have crashes mid processing that are so severe the scheduler does not have time to issue an abend message.
When we moved to SQL generation for ETL the number of steps in a schedule went up because there were at least 3 pieces of SQL for each dimension table mapping as opposed to just one step previously.
When we moved in to the area of delta detection using SQL the number of steps exploded.
The delta detection generation creates 7 SQL files per table having delta detection applied.
Then when we moved in to ERP BI Development, where the number of tables exploded, the number of SQL Statements to perform the delta detection processing exploded yet again.
It has now gotten to the point where it is not at all unusual to have more than 10,000 separate sql files just to perform the delta detection processing for an ERP.
Back when we were deciding how to process SQL statements inside the scheduler we created CTLU023 and the SQL Statements Table in the dictionary. The idea was that all SQL that was going to be a part of the schedule would be placed in to the SQL Statements Table via an insert. The old statement would be retained and archived but it would not be deleted.
In this way the person who was loading the ETL SQL Statements could be given update access to the SQL Statements Table, not delete access, and it would become impossible for statements to be deleted. In many databases it is also possible to log the actual update to the SQL Statements table and who made that change in such a way that the person can not remove the log.
This storing of each version of an SQL Statement that can be made available to the ETL Subsystem was meant to discourage malicious attempts at fraud or manipulation of data. We also store the first 8,000 characters of each statement executed plus the date and time the statement was loaded in to the SQL Statements Table. So the logging and auditability of the SQL Statements is excellent in my opinion.
We really liked this approach as it meant that the likelyhood of malicious attempts to modify the SQL Statements for personal gain could be all but eliminated if the customer so desired.
Of course, when you have more than 10,000 files with SQL statements in them and they have to be loaded in to the SQL Statements Table before they can be executed? That’s is an extra step in the maintenance process.
So our reseller asked us if we could give him a version of CTLU023 that would execute the statements from the file system rather than the SQL Statements Table. Our reseller, of course, taking responsibility for any malicious manipulation of such SQL Statements.
In many smaller companies that use SeETL the support staff have direct update access to the data warehouse database. So the SQL Statements Table does not offer any real protection from malicious fraud because the support staff have update access to the tables themselves.
So for our customer we created a program which we will call CTLU026 in the SeETL version. We also updated the SeETL Run Time VB application to be able to set a processing directive to mark SQL Code as to be executed from the file system.
The customer performs the same “load sql” processing using the SQL Statement Load workbook one time only for each statement. The SQL Statement Table records only the first line of the SQL Statement which is a directive to execute the statement from the file system.
So this new program reads the SQL Statements Table, reads the directive that it is to execute the statements from the file system, and then goes to the file system to read the SQL statements to execute them.
This, of course, has the obvious benefit that when there is an error in the SQL the support staff can go to the individual file, test, correct the error, and then restart the scheduler. No other manual alteration is needed to adjust the schedule for the processing to complete. Most notably, the corrected statement does not have to be reloaded in to the SQL Statements Table.
This means that as columns are added and SQL changes there is no step that can be “forgotten” to reload the updated SQL Statement to the SQL Statements Table.
It also means that if someone accidentally damages an SQL Statement File then that might fail in the ETL processing in the next cycle of ETL.
You can not have your cake and eat it too!
Our customer has been running this feature for a few weeks now and his feedback is that he likes it. He prefers to be running SQL statements from files.
One of the added benefits is this.
A single SQL statement being sent in to SQL Server (and most databases) via ODBC is limited to 64K. So the SQL Statements Table had 8 chunks of 8K for a total of 64K for as many SQL Statements as desired. This could be one statement or it could be 100 statements. The total must not pass 64K when using CTLU023.
With the version the file itself can be of any size, including much larger than 64K. The limitation is that a single statement can not be larger than 64K.
This means that areas in the ETL processing where you want to run a LOT of statements but do not want to split them in to different files can now be supported.
The most normal one being the processing to turn off the row sent to EDW indicator.
This requires one statement per table and when you have 1,800+ staging area tables that is a lot of statements.
These statements can now be placed in one file and processed via CTLU026 (or equivalent) and not need many files loaded in to the SQL Statements Table.
As I said at the top. At the time of writing this feature is not added to SeETL.
It will be added when customers need it.
I hope this blog post was informative for you.
What is your opinion?
Do you think it is a good idea to be able to execute files containing SQL statements from the file system with pretty much unlimited statements in one file? Or should they be loaded in to the SQL Statements Table for extra auditability?
Feel free to share your opinion in the comments section!
Lastly, since I am no longer welcome on linkedin?
Please share this blog post via your linkedin and twitter if you have one!
Your sharing of my content would be greatly appreciated!
I want to get lots and lots of people using SeETL now that it is free and open source.
Best Regards
Peter