Read multiple files of same name and process in DB

Question

Read multiple files of same name and process in DB

LearnSQL

SSC-Addicted

Points: 407
More actions
August 8, 2022 at 11:56 am

#4073868

Hi All,
I have a requirement like, we will receive multiple csv files in a shared path and those files needs to be processed to the corresponding tables.
Say for eg: if Employee file has 3 files like employee_2022_08_08_11_11 / employee_2022_08_07_12_00 / employee_2022_08_08_16_07 i need to process these three files to my employee table and say for Department Department_2022_08_08_12_08 to Department table in SQL Server.
Previously I had a reference table where the file name has stored as
1, employee
2,Department
3,fact_salary
4,fact_revision
So we were processing the file names by seeing the reference table, but the requirement has changed , need to process whatever file is there in the path irrespective of dates, after processing it will be moved to another folder(this part is working fine). But how to take multiple files and process without altering the existing logic with the reference table.
Bulk insert option has been used to load the data to read from Shared path in SSIS package.
Is there any approach or any suggestions.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

planetmatt SSCrazy Points: 2208 More actions · Answer 1

I don't really understand what this reference table is doing.

To load the files, can't you have 2 x ForEach Loops, with their collections set to each set of files (wildcard matching) so the Employee loop picks up all the employee prefix files, and the Department loop picks up all the Department prefix files.

Then within each loop, add a dataflow task. Make the source component dynamic by adding an expression to the connection string property and pointing it at the variable that the Foreach loop populates with the current filename on each iteration of the loop.

The Destination of the dataflow will be your static employee or department tables.

LearnSQL SSC-Addicted Points: 407 More actions · Answer 2

Reference table is the place where the actual file name was. Initially it has only one file for each table, so we have a reference table and ensured that we will receive the i/p file name with the particular name, but later it has changed now.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

I guess I don't understand the issue here. Just load up all the files name for, say, the employee table and process them in date order. Then, move the processed files out to the "processed" folder either en masse when the whole roll is done or as each file is processed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeffrey Williams SSC Guru Points: 90351 More actions · Answer 4

I see a further problem - the files obviously have timestamps associated with the file. Processing them in 'any order' could be problematic if there could be actions/updates for the same employee(s) in subsequent files.

For example - let's say employee A was updated in all three files and the correct value for that employee is in the latest file. If the loop processes the last file first - and the 'earliest file' last then the data in the table is now incorrect.

Processing that correctly is going to need more than just an SSIS foreach loop.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

planetmatt SSCrazy Points: 2208 More actions · Answer 5

The reference table makes little sense because as I understand it, each row represents a target table. You can't really feed that into a single Dataflow because each table will have different schema/metadata.

So then I'd expect to see hardcoded dataflows per target table. Each of those can sit inside its own ForEach loop to only pick up all the files for each table, or a single ForEach with a script task to check each filename and send the file to the relevant dataflow. So you can add new files without changing the package, but if you add a new table, then I'd expect to see the package changed to add a new data flow.

Or are you using a Script task as the data source and dynamically creating output paths based on the file name?