April 13, 2016 at 8:01 am
Hi all
I've got an SSIS package that loads 140+ files sequentially.
I want to make it process the files in parallel (for example, 4 at a time) by checking the file-name for certain characters.
All the files start SR and then the next letter is scattered throughout the alphabet with one file starting SR18.
I've currently got a For Each loop processing the files and, on the Collection section, it's set to pick up any CSV file.
To give an example, I'd want one For Each loop to process any file starting SRA*.csv to SRF*.csv (I'd probably include the file starting SR18*.csv in this list).
I'd then want a second For Each loop processing any files starting SRG*.csv to SRM.csv.
etc, etc.
Has anyone had any success in processing files in parallel in this manner and, if so, can someone please point me in the right direction?
I could, as a workaround, make 27 For Each loops to handle each file individually and run those in parallel but that seems overkill.
I'd appreciate any help anyone could give.
April 13, 2016 at 8:10 am
richardmgreen1 (4/13/2016)
Hi allI've got an SSIS package that loads 140+ files sequentially.
I want to make it process the files in parallel (for example, 4 at a time) by checking the file-name for certain characters.
All the files start SR and then the next letter is scattered throughout the alphabet with one file starting SR18.
I've currently got a For Each loop processing the files and, on the Collection section, it's set to pick up any CSV file.
To give an example, I'd want one For Each loop to process any file starting SRA*.csv to SRF*.csv (I'd probably include the file starting SR18*.csv in this list).
I'd then want a second For Each loop processing any files starting SRG*.csv to SRM.csv.
etc, etc.
Has anyone had any success in processing files in parallel in this manner and, if so, can someone please point me in the right direction?
I could, as a workaround, make 27 For Each loops to handle each file individually and run those in parallel but that seems overkill.
I'd appreciate any help anyone could give.
Does each file have its own target table? Trying to load more than one file into the same table at the same time will cause you locking issues.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 13, 2016 at 8:15 am
Hi Phil
Yes, each file goes into its own table and is further processed by SQL separately.
April 13, 2016 at 8:34 am
richardmgreen1 (4/13/2016)
Hi PhilYes, each file goes into its own table and is further processed by SQL separately.
That's a good start.
Do the source files and target tables have the same formats? I mean column names and data types ...
What I'm wondering here is can you (potentially) parameterise the source and target details for all files using multiple identical (parallel) data flow tasks?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 13, 2016 at 8:40 am
The files contain different data (and headers) and are loaded via a BCP routine.
The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.
April 13, 2016 at 9:14 am
richardmgreen1 (4/13/2016)
The files contain different data (and headers) and are loaded via a BCP routine.The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.
As you did not mention BCP earlier, I assumed that the file imports were via SSIS data flows. Or perhaps that is what you are considering doing.
As the file formats are different, you will need separate data flows (at a minimum, one per file format).
These data flows can all run in parallel, once you have gone through the tedium of setting them up.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 13, 2016 at 6:08 pm
140+ files... each file goes to a separate table... etc...
It sounds more like a version of "poor man's replication" of a whole or nearly whole database. I'd recommend talking with the folks at the source of the data and see if you could get a backup file to restore from instead of the pain you're currently going through.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 14, 2016 at 3:32 am
Hi Jeff
If only......
We get the CSV files as a data extract (I haven't got a clue what format the back end of the application runs on).
This is also a national system with information from various sources that are collated by the vendors. We are only allowed to see our data and that's what they send us.
April 15, 2016 at 1:54 am
Probably the best solution would be to put all 140+ file names into a queue table and then have 4 separate For Each Loops taking one record at a time from the queue (with appropriate controls to ensure each file is processed once etc).
A simpler version, would be to put all 140+ file names into a table and then have 4 separate For Each Loops which select rows from this list based on different WHERE clauses. The downside to this is that you may end up with unbalanced for each loops with one loop processing lots of large file and another processing lots of small files.
Jeremy
April 15, 2016 at 3:16 am
Jez-448386 (4/15/2016)
Probably the best solution would be to put all 140+ file names into a queue table and then have 4 separate For Each Loops taking one record at a time from the queue (with appropriate controls to ensure each file is processed once etc).A simpler version, would be to put all 140+ file names into a table and then have 4 separate For Each Loops which select rows from this list based on different WHERE clauses. The downside to this is that you may end up with unbalanced for each loops with one loop processing lots of large file and another processing lots of small files.
Jeremy
How would you propose handling the fact that the file formats are different, doing it this way?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 15, 2016 at 7:57 am
The OP already has a SSIS package that loads 140+ files sequentially.
I doubt that there are 140+ separate tasks I guess that this has already been solved.
Jez
April 15, 2016 at 9:33 am
Jez-448386 (4/15/2016)
The OP already has a SSIS package that loads 140+ files sequentially.I doubt that there are 140+ separate tasks I guess that this has already been solved.
Jez
You mean this?
richardmgreen1 (4/13/2016)
The files contain different data (and headers) and are loaded via a BCP routine.The targets are also dealt with in a similar fashion (the target SQL table is starts tbl and then uses the filename (without the extension) to complete it.
OK, how does the BCP routine work with your idea?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 17, 2016 at 6:34 am
you can do it - although I would not use SSIS for the controlling of the executiong of the loading - for this I would consider instead a C# or powershell that would retrieve the list of files to load and execute the loads by spawning processes in parallel - the load process itself could be a SSIS package or a BCP command. I have this working on a client with Powershell and is used extensively - last process we had to do with this framework was processing 200+ files each requiring a different SSIS package to be executed to process it. All done through config files.
to do it in SSIS I would consider this approach.
Setup a table on your staging db to hold the filenames and sort priority
Sort priority would be used to determine the order the files are processed - if sizes vary a lot and 1 day one file takes 5 min to load and on another day it takes 30 min to load I would instead consider the time last loaded as the sort priority but this is a minor detail.
The idea of the sort order is to allow us to spread the files that take the longest to process over all the threads instead of having then eventually allocated to the same thread. With SSIS this is still going to be a problem but can be minimized.
within the package itself setup 10 execute sql tasks and a foreach loop container attached to each of the sql tasks.
each sql task would be setup to retrieve a list of filenames based on a thread ID different for each task
so sql task 1 would retrieve thread 1, sql task 2 would retrieve thread 2 and so on.
Package would also have defined as a parameter the number of thread to run concurrently so you can run anything from 1 to 10 threads based on your load pressure.
sample sql for each task
select filename
from (
select t.filename
,row_number() over (order by t.sort) % @num_threads + 1 as thread_num
from (
values
('file1', 3)
, ('file2', 1)
, ('file3', 1)
, ('file4', 1)
) t(filename, sort)
) t2
where t2.thread_num = @requested_thread
the sql task would pass as parameters the desired number of thread to process (@num_threads) and which thread list to process (@requested_thread)
output onto a object variable per thread which is then used on the loop container to process each filename retrieved
The issue with doing this with pure SSIS is what do you do if one file fails? should you stop the remaining files from being processed? normally one would still process the remaining files and then try to reprocess the failling one once error is fixed. But this also means your SSIS needs to allow for errors and not stop processing on first error.
The sql statement could also be used to retrieve more information regarding how to process the file - e.g. could supply a package to execute or a bcp command to execute over the file.
April 17, 2016 at 8:27 am
richardmgreen1 (4/14/2016)
Hi JeffIf only......
We get the CSV files as a data extract (I haven't got a clue what format the back end of the application runs on).
This is also a national system with information from various sources that are collated by the vendors. We are only allowed to see our data and that's what they send us.
Ah. Got it. Thanks for the feedback.
Something that you should be made aware of... your good efforts in this area may be totally in vain and may actually cause your loads to take longer. How is that possible?
The same thing can happen to you as when people try to improve performance of backups by splitting the backup to multiple files. If you can GUARANTEE that the files you're writing to are on separate spindles and, therefor, on separate read/write heads/arms, only then will you realize a performance gain. If not, you'll actually slow things down because the read/write heads/arms will actually have to move from one file to another and such movement takes a relatively HUGE amount of time compared to writing just one file in a sequential manner.
Splitting your 140 files to parallel processes will only bring more CPUs into play and CPU time is NOT the bottleneck when it comes to writing to disk. Before you spend any more time trying for parallel loads, you need to find out if you have separate/parallel disk spindles that will come into play and it's best if you don't leave it up to the luck of the draw as to which spindles will be used on the SAN or whatever storage device you may have.
If you do want to take such a chance, then start out with an experiment. Create 4 large files that take at least a minute each to load. They can even be the same file data with different file names just to make life easier. Try the serial load and measure the duration. Then try the parallel load and see if you get any performance gain. Then, try it 3 or 4 more times and see if the performance gain is 1) large enough to be worth it and 2) is repeatable.
I could be wrong but I'd bet that the performance gain won't happen or will be small enough to not make it worth it unless you can guarantee writes to 4 separate spindles. Might not even be worth it then because there's a real chance that your ETL system isn't the only show in town and other processes will be using the same read/write heads/arms. You can't change physics.
And, yes... I'm absolutely suggesting that this test be executed on the production box using the same drive "letters" as would be used if you modified the current system to run in parallel. It's the only way that you'll know if the other traffic on the disks will interfere or not or... if the parallel system will "freeze" or otherwise interfere with normal traffic on the disks.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 18, 2016 at 4:14 am
@frederico - we did have a think about the load balancing option by using some sort of load table to pick up the next available file but couldn't quite figure it out. I'll see what I can do with your idea.
@jeff - I didn't even think about the disks/spindles issue. I'm not sure where on the SAN our drives sit so that's another variable to put in the mix. Just a thought, if we could guarantee the use of SSDs on the SAN (I think we've got some) would that help?
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply