I have gotten a number of emails over the past few days asking about how I import binary files into SSIS as well as how to improve throughput by making tasks parallel. I have, scattered throughout this blog, articles which show bits and pieces. I have articles on how to use the import file task, articles on how to use the Enhanced Threading Framework I put together, articles on using SHA-1 to find duplicates, etc. This is all well and good for the person who is looking for a specific piece of the puzzle, but what about the person who wants the whole puzzle? I put together a sample SSIS package that has all of the components (including sql scripts to build the database pieces):
- Script task to create a list of files to import (.pdf, but can easily be modified)
- Data Flow Task to import that list into a table
- Structures to show the Enhanced Threading Framework (2 engines in the example)
- Within each ETF Engine:
- Call to the procedure that extracts a single file to be imported
- Script Component that sources the file to be imported
- Import File Component
- Script Component that calculates the SHA-1 hash
- Lookup to only push files that do not exist (based upon hash)
- Data destination to push the file and hash into the table
The source can be downloaded here.