Loading files in parallel

Question

Loading files in parallel

richardmgreen1

SSChampion

Points: 10383
More actions
April 13, 2016 at 8:01 am

#400014

Hi all
I've got an SSIS package that loads 140+ files sequentially.
I want to make it process the files in parallel (for example, 4 at a time) by checking the file-name for certain characters.
All the files start SR and then the next letter is scattered throughout the alphabet with one file starting SR18.
I've currently got a For Each loop processing the files and, on the Collection section, it's set to pick up any CSV file.
To give an example, I'd want one For Each loop to process any file starting SRA*.csv to SRF*.csv (I'd probably include the file starting SR18*.csv in this list).
I'd then want a second For Each loop processing any files starting SRG*.csv to SRM.csv.
etc, etc.
Has anyone had any success in processing files in parallel in this manner and, if so, can someone please point me in the right direction?
I could, as a workaround, make 27 For Each loops to handle each file individually and run those in parallel but that seems overkill.
I'd appreciate any help anyone could give.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247141 More actions · Answer 1

richardmgreen1 (4/13/2016)
Hi all
I've got an SSIS package that loads 140+ files sequentially.
I want to make it process the files in parallel (for example, 4 at a time) by checking the file-name for certain characters.
All the files start SR and then the next letter is scattered throughout the alphabet with one file starting SR18.
I've currently got a For Each loop processing the files and, on the Collection section, it's set to pick up any CSV file.
To give an example, I'd want one For Each loop to process any file starting SRA*.csv to SRF*.csv (I'd probably include the file starting SR18*.csv in this list).
I'd then want a second For Each loop processing any files starting SRG*.csv to SRM.csv.
etc, etc.
Has anyone had any success in processing files in parallel in this manner and, if so, can someone please point me in the right direction?
I could, as a workaround, make 27 For Each loops to handle each file individually and run those in parallel but that seems overkill.
I'd appreciate any help anyone could give.

Does each file have its own target table? Trying to load more than one file into the same table at the same time will cause you locking issues.

richardmgreen1 SSChampion Points: 10383 More actions · Answer 2

Hi Phil

Yes, each file goes into its own table and is further processed by SQL separately.

Phil Parkin SSC Guru Points: 247141 More actions · Answer 3

richardmgreen1 (4/13/2016)
Hi Phil
Yes, each file goes into its own table and is further processed by SQL separately.

That's a good start.

Do the source files and target tables have the same formats? I mean column names and data types ...

What I'm wondering here is can you (potentially) parameterise the source and target details for all files using multiple identical (parallel) data flow tasks?

richardmgreen1 SSChampion Points: 10383 More actions · Answer 4

The files contain different data (and headers) and are loaded via a BCP routine.

The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.

Phil Parkin SSC Guru Points: 247141 More actions · Answer 5

richardmgreen1 (4/13/2016)
The files contain different data (and headers) and are loaded via a BCP routine.
The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.

As you did not mention BCP earlier, I assumed that the file imports were via SSIS data flows. Or perhaps that is what you are considering doing.

As the file formats are different, you will need separate data flows (at a minimum, one per file format).

These data flows can all run in parallel, once you have gone through the tedium of setting them up.

Jeff Moden SSC Guru Points: 1004582 More actions · Answer 6

140+ files... each file goes to a separate table... etc...

It sounds more like a version of "poor man's replication" of a whole or nearly whole database. I'd recommend talking with the folks at the source of the data and see if you could get a backup file to restore from instead of the pain you're currently going through.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

richardmgreen1 SSChampion Points: 10383 More actions · Answer 7

Hi Jeff

If only......

We get the CSV files as a data extract (I haven't got a clue what format the back end of the application runs on).

This is also a national system with information from various sources that are collated by the vendors. We are only allowed to see our data and that's what they send us.

Jez-448386 SSCarpal Tunnel Points: 4596 More actions · Answer 8

Probably the best solution would be to put all 140+ file names into a queue table and then have 4 separate For Each Loops taking one record at a time from the queue (with appropriate controls to ensure each file is processed once etc).

A simpler version, would be to put all 140+ file names into a table and then have 4 separate For Each Loops which select rows from this list based on different WHERE clauses. The downside to this is that you may end up with unbalanced for each loops with one loop processing lots of large file and another processing lots of small files.

Jeremy

Phil Parkin SSC Guru Points: 247141 More actions · Answer 9

Jez-448386 (4/15/2016)
Probably the best solution would be to put all 140+ file names into a queue table and then have 4 separate For Each Loops taking one record at a time from the queue (with appropriate controls to ensure each file is processed once etc).
A simpler version, would be to put all 140+ file names into a table and then have 4 separate For Each Loops which select rows from this list based on different WHERE clauses. The downside to this is that you may end up with unbalanced for each loops with one loop processing lots of large file and another processing lots of small files.
Jeremy

How would you propose handling the fact that the file formats are different, doing it this way?

Jez-448386 SSCarpal Tunnel Points: 4596 More actions · Answer 10

The OP already has a SSIS package that loads 140+ files sequentially.

I doubt that there are 140+ separate tasks I guess that this has already been solved.

Jez

Phil Parkin SSC Guru Points: 247141 More actions · Answer 11

Jez-448386 (4/15/2016)
The OP already has a SSIS package that loads 140+ files sequentially.
I doubt that there are 140+ separate tasks I guess that this has already been solved.
Jez

You mean this?

richardmgreen1 (4/13/2016)
The files contain different data (and headers) and are loaded via a BCP routine.
The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.

OK, how does the BCP routine work with your idea?

frederico_fonseca SSCoach Points: 16210 More actions · Answer 12

you can do it - although I would not use SSIS for the controlling of the executiong of the loading - for this I would consider instead a C# or powershell that would retrieve the list of files to load and execute the loads by spawning processes in parallel - the load process itself could be a SSIS package or a BCP command. I have this working on a client with Powershell and is used extensively - last process we had to do with this framework was processing 200+ files each requiring a different SSIS package to be executed to process it. All done through config files.

to do it in SSIS I would consider this approach.

Setup a table on your staging db to hold the filenames and sort priority

Sort priority would be used to determine the order the files are processed - if sizes vary a lot and 1 day one file takes 5 min to load and on another day it takes 30 min to load I would instead consider the time last loaded as the sort priority but this is a minor detail.

The idea of the sort order is to allow us to spread the files that take the longest to process over all the threads instead of having then eventually allocated to the same thread. With SSIS this is still going to be a problem but can be minimized.

within the package itself setup 10 execute sql tasks and a foreach loop container attached to each of the sql tasks.

each sql task would be setup to retrieve a list of filenames based on a thread ID different for each task

so sql task 1 would retrieve thread 1, sql task 2 would retrieve thread 2 and so on.

Package would also have defined as a parameter the number of thread to run concurrently so you can run anything from 1 to 10 threads based on your load pressure.

sample sql for each task

select filename

from (

select t.filename

,row_number() over (order by t.sort) % @num_threads + 1 as thread_num

from (

values

('file1', 3)

, ('file2', 1)

, ('file3', 1)

, ('file4', 1)

) t(filename, sort)

) t2

where t2.thread_num = @requested_thread

the sql task would pass as parameters the desired number of thread to process (@num_threads) and which thread list to process (@requested_thread)

output onto a object variable per thread which is then used on the loop container to process each filename retrieved

The issue with doing this with pure SSIS is what do you do if one file fails? should you stop the remaining files from being processed? normally one would still process the remaining files and then try to reprocess the failling one once error is fixed. But this also means your SSIS needs to allow for errors and not stop processing on first error.

The sql statement could also be used to retrieve more information regarding how to process the file - e.g. could supply a package to execute or a bcp command to execute over the file.

Jeff Moden SSC Guru Points: 1004582 More actions · Answer 13

richardmgreen1 (4/14/2016)
Hi Jeff
If only......
We get the CSV files as a data extract (I haven't got a clue what format the back end of the application runs on).
This is also a national system with information from various sources that are collated by the vendors. We are only allowed to see our data and that's what they send us.

Ah. Got it. Thanks for the feedback.

Something that you should be made aware of... your good efforts in this area may be totally in vain and may actually cause your loads to take longer. How is that possible?

The same thing can happen to you as when people try to improve performance of backups by splitting the backup to multiple files. If you can GUARANTEE that the files you're writing to are on separate spindles and, therefor, on separate read/write heads/arms, only then will you realize a performance gain. If not, you'll actually slow things down because the read/write heads/arms will actually have to move from one file to another and such movement takes a relatively HUGE amount of time compared to writing just one file in a sequential manner.

Splitting your 140 files to parallel processes will only bring more CPUs into play and CPU time is NOT the bottleneck when it comes to writing to disk. Before you spend any more time trying for parallel loads, you need to find out if you have separate/parallel disk spindles that will come into play and it's best if you don't leave it up to the luck of the draw as to which spindles will be used on the SAN or whatever storage device you may have.

If you do want to take such a chance, then start out with an experiment. Create 4 large files that take at least a minute each to load. They can even be the same file data with different file names just to make life easier. Try the serial load and measure the duration. Then try the parallel load and see if you get any performance gain. Then, try it 3 or 4 more times and see if the performance gain is 1) large enough to be worth it and 2) is repeatable.

I could be wrong but I'd bet that the performance gain won't happen or will be small enough to not make it worth it unless you can guarantee writes to 4 separate spindles. Might not even be worth it then because there's a real chance that your ETL system isn't the only show in town and other processes will be using the same read/write heads/arms. You can't change physics.

And, yes... I'm absolutely suggesting that this test be executed on the production box using the same drive "letters" as would be used if you modified the current system to run in parallel. It's the only way that you'll know if the other traffic on the disks will interfere or not or... if the parallel system will "freeze" or otherwise interfere with normal traffic on the disks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

richardmgreen1 SSChampion Points: 10383 More actions · Answer 14

@frederico - we did have a think about the load balancing option by using some sort of load table to pick up the next available file but couldn't quite figure it out. I'll see what I can do with your idea.

@jeff - I didn't even think about the disks/spindles issue. I'm not sure where on the SAN our drives sit so that's another variable to put in the mix. Just a thought, if we could guarantee the use of SSDs on the SAN (I think we've got some) would that help?

Loading files in parallel

Cookies on SQLServerCentral