I had a very interesting bug that had me running around for over a week. Turns out the bug was in my code due to a misunderstanding of how SSIS processes the Data Flow Task. I've found in that time that SSIS not only evaluates, but actually 'runs' code even if there are no qualifying records. Let me break this down for you.
I have a Data Flow task that has 4 possible paths that are mutually exclusive. This means that at any given time only one of the four paths is a valid. The Data Flow is listed below:
I have highlighted in red the Script Destination components that were my unwitting accomplices. The Data Flow is one of several clones that determine 4 pieces of information. First is the document number I am about to process a statement or an invoice. That will push is down the left path (invoice) or the right path (statement). Second does that document already exist. If the document exists, count it and throw it away. If it doesn't exist, count it and insert it into the table.
When I run this package and specifically this Data Flow task, all boxes run green, but only 1 of them has a row attached. Yes, this package works on a single input at a time per Data Flow task, 8 tasks simultaneously (email me if you are curious why I did this, I might blog it if enough people are interested). This package imports .pdf files. Now that you understand what I was seeing, let's open up the Script Destinations and see what they look like and why they contributed to the problem.
If you look at either side you will see on branch where the document is a duplicate (already exists in the table) or new. Listed below is a snippet from the Script Destination from the path of a new document:
The import part of this script has been highlighted. I insert a record into a table and get back an auto-increment number (int32) as an id to further process this. I actually have a unique 20-byte binary (hash), but it seems the Web program they use to display .pdfs I store cannot properly pull/store the unique identify so I had to create another unique ID, but I digress (always wanted to say that 🙂 ). That number is transfered from the local variable webunique to the package variable Variables.Engine0webunique. This works fine EXCEPT when I decided to do what most programmers do; assign values for every path.
The code listed below is corrected, but has an annotation of what I actually had.
When I had the code snippet 'Variables.Engine0webunique=0' a problem arose, one that required me to think in parallel.
The problem was that since all paths in the dataflow, 4 in mine, must run to completion (all green), and there is no way to guarantee the order they will complete, the Variables.Engine0webunique value was being set at random. This could be the actual unique id pulled from the table insert or the 0 set by the 'document already exists' path. What do I do to solve this?
I needed a way to determine if I actually did get a value from the insert of a new document. I moved away from checking the actual value (is it >0) and added a flag. This flag would be set to true if an inserted a record or false if I did not. This enabled me to skip a step only needed on insert (for improved efficiency). Listed below is the outer framework code that uses this flag:
As you can see, the framework checks the state of the flag (true/false) and determines at each run if there is an existing or new document and runs the ancillary procedure needed by new documents, while bypassing that call if the document exists.
Moral of this post: When setting variables in a Data Flow Task, remember that all paths are run in unknown order so find a mechanism to know which paths have actual data on them.