I decided to revisit my framework to see if I could improve the performance and possibly increase the number of scenarios this framework could be used in. Turns out I was able to do both. In this post I will outline the changes I've made and go through those new scenarios.
Version 3 of the framework relies less on the components provided by SSIS and more on actual C# coding. The new framework is rooted in Script Task, .NET 3.5, System.IO, and the System.Thread libraries.
The core of the framework is parallel distribution of work. To do this we will enlist the help of C# threads. In simple terms a thread is a way to assign work to one of the processors your program is to run on. SSIS does this behind the scenes by assigning threads to various components in your package. If you have ever created 2 or more components and executed the package on a system with multiple processors you would see many of those components turn yellow and be active at the same time. The number of components active simultaneously would depend upon a few SSIS settings and the number of logical processors on your machine. We are going to bypass those settings and limitations by creating our own pool of threads. This new design was put to the test by IMPLAN.
Implan
They have a need to import data contained in Microsoft Access databases. To further complicate this process, each database contains 162 different tables.
One year's worth of data amounts to 3,757 Microsoft Access databases, each containing 162 tables for a total of 3.1 billion records. At the time I was designing this, they had a backlog of 12 years of data, with the caveat that any year could be reloaded at any time.
Current Process
When ingesting even a single year the process took between 72 and 96 hours to complete, but failed often.
ETF solution:
SQL Server and SSIS were running on separate machines, but with identical configurations.
- 16GB memory
- 64-bit Windows operating system
- Intel Xenon CPU E5-2690 v2 @ 3GHz (6 processors)
Results
The custom solution can run 1 year's worth of data (3.1 billion records) in approximately 11 hours; averaging 79,000 records/second. The process runs to completion without error and can be rerun whenever needed.
To note: The 79,000 records/second limitation seems to be the maximum number of records the SQL Server configuration could handle. The framework itself is limited by only three factors:
- Maximum amount of data the destination server can ingest.
- Maximum rate the source data can be read.
- Maximum number of threads the framework can run on the SSIS machine.
If you think this framework would be of benefit to your company, please feel free to send me an email.