It’s been a while since I’ve used SSIS for anything of substantial complexity, and a lot longer than that since I went through the SSIS class that Brian Knight taught back when we were in the training business together. Recently I had a chance to take on a small project with some twists and it’s reminded me, again, of the difference between knowledge and experience, and how beginners often get trapped in the details of syntax in a way that obscures higher level thinking. It has also been a case study on the merits of tenacity (which sounds better than stubbornness) and the challenge of doing what you know versus venturing into the unknown.
The basics were to process a bunch of Excel files, most of which had more than one sheet, ranging from a few thousand rows up to 900k rows. Files could contain unused sheets, and each sheet was supposed to have – but not guaranteed of course – to have the same columns. From there I needed to validate several columns that would match to foreign keys, do a more complicated lookup based on multiple values, and a few other things, and then insert the rows, omitting duplicates and with a hierarchy to which row was kept if there was a dupe. What made it more interesting was that I had to return the Excel files with a results column added and add a result for every row, including the duplicates. Along the way we discovered a few new requirements, some easy, some of the head scratching type.
My goal was to try to do it 100% in SSIS. An easier approach (for me anyway) would be to use SSIS to import the data, run a series of queries to validate/tag the data, do the insert, and then export it all back to a new Excel file. Not much learning doing it the easy way, and it didn’t look too terrible in SSIS, so that made me stick to the 100% SSIS goal for a while, maybe too long!
With that long intro out of the way, some notes/thoughts:
- Excel imports as unicode, no easy way to change it short of a follow up cast, ugly
- OpenSchema to check sheet header seemed very slow on large wookbooks
- Union ALL gets ugly when you have 10 inputs and you change metadata. Union late in design if you can
- I started off using the full files so I wouldn’t have a design that failed on large files. That took too long. I switched to small files and yes, later, had a problem with big files.
- Suffered from lack of good test files with obvious problems
- Renewed my appreciation for how much easier to explain process in SSIS vs a pure code solution
- Didn’t see a better way, so tagged dupes using script and a hashet. SORT will remove them, I needed to keep the entire row
- My plan was to do a series of tests, at at each test take the bad rows and direct to a catch-all union, adding a result column and reason in flight. Worked ok.
- I tested first column of each sheet and if it matched, good enough. The few odd failures after that weren’t a big deal, and weren’t worth more effort right then
- Working with script is both great and horrible, waiting on VS to load each time is excruciating, 5, 10, 15 seconds each time. What the heck.
- Layout is a pain, as much as I see the value in the flowchart model.
- Lookup perf is very very good if you cache all
- I needed to add a column with a static value. I did it early because it was easy, should have done late so I didn’t have to carry it through the whole pipeline
- For the final export I had to convert back to unicode, was easier to do there because I could do the cast in the SELECT
I finally got stuck when the only way I could see to solve one problem was to use a SORT transform and it just stalled on the biggest pages. Running out of time, patience, I did some minor changes to the package to use most of the work, then land it it in a table near the end, do the final ops, and then pick it back up with a final data flow for the export.
The final lesson? Hard to say. It’s always a challenge to use the right tool for the job, yet know when you can substitute a chainsaw for the circular saw to speed things up, or vice versa. Learned a lot, think I could do it faster if I started over, and remembered that knowing when to stay the course/change continues to be the hardest thing to know. I know that there’s nothing like new challenges to drive learning.