Can I parallelise Foreach tasks?

  • Part of our ETL package loads a number of large data files into one table. The files all have the same format, but the file names change each run, and are given in a list loaded in a previous step. So the obvious way to do this is in a data flow in a Foreach file loop, which works fine.

    Except that it's a serial process, and taking longer as the size and number of files grow. Our server has fast disks and 2 4-core hyperthreaded Xeon processors so it can run 16 threads at once - but while this load is running one thread runs at 100% and all others sit doing nothing. SSIS is great at multi-threading when a dataflow has multiple paths, but not in a Foreach.

    Is there a practical way to split the data flow within the Foreach into 4 or 8 threads, or to fire off a dataflow task asynchronously? I've thought of using [font="Courier New"]ExecProc(CodePath)[/font] to fire off multiple completely independent [font="Courier New"]DTExec.exe[/font] tasks, but that seems rather top-heavy.

    The other question is whether there is an easy way to re-synchronise after all the files are loaded, before starting a step which uses the loaded data.

  • The answer is yes, there is a lot to it though. One of the bigger issues is breaking up the list into chunks if there are no natural breaks.

    I'll be able to write more later.

    CEWII

  • Thanks. I'm about to read Part 2 (which was highlighted in today's email from SQLServerCentral).

    I solved my problem after thinking about it more, and being inspired by Part 1 of your article. I built a system that saved the list of files in a table of ([Filename], [InUse]), which was then read by 8 identical copies of my data flow. Each copy starts with a script component which acts as the data source. It gets a file name to read in, does some processing on some fields, and sends the result to the output buffer. I had one copy of that data flow already; the new part is that each now runs a While loop that gets files to process using EXEC GetFILELIST, which runs this SQL to get the next name that's not already in use:

    WITH Nextfile AS

    (SELECT TOP 1 [Filename] AS Next FROM [dbo].[FILELIST] WHERE [InUse]=0)

    UPDATE [dbo].[FILELIST]

    SET [InUse] = 1

    OUTPUT INSERTED.[Filename]

    FROM [dbo].[FILELIST] fl WITH (READCOMMITTEDLOCK,READPAST)

    INNER JOIN Nextfile ON fl.[Filename] = Next;

    At first, I was getting lots of deadlocks, so I put the whole thing in a Catch/Try block to retry on deadlock. I then added the WITH (READCOMMITTEDLOCK,READPAST) which makes the query skip rows that have locks on them, thus reading the next row that is NOT already being read/updated by another thread.

    This works great, as each thread processes as many files as it can until they're all done. Now I'm going to read your article Part 2 and see if there's an even better way to do it!

    My full SQL GetFILELIST procedure, including retry and logging (my ToLOG proc) has grown a bit; if anyone's interested, here's the full thing:

    -- Create Procedure GetFILELIST - gets next file from the list (or nothing), and handles deadlocks

    CREATE PROCEDURE [dbo].[GetFILELIST]

    ( @ClientId nchar(2) )

    AS

    -- Get the next file name from the FILELIST table

    -- Retry on error 1205 based on http://msdn.microsoft.com/en-us/library/ms179296.aspx

    DECLARE @retry INT = 5;-- Try up to 5 times (1 + 4 retries)

    WHILE (@retry > 0)-- Keep trying to update table if this task is selected as the deadlock victim.

    BEGIN

    BEGIN TRY

    BEGIN TRANSACTION;

    WITH Nextfile AS

    (SELECT TOP 1 [Filename] AS Next FROM [dbo].[FILELIST] WHERE [InUse]=0)

    UPDATE [dbo].[FILELIST]

    SET [InUse] = 1

    OUTPUT INSERTED.[Filename]

    FROM [dbo].[FILELIST] fl WITH (READCOMMITTEDLOCK,READPAST)

    INNER JOIN Nextfile ON fl.[Filename] = Next;

    SET @retry = 0;-- All done

    COMMIT TRANSACTION;

    END TRY

    BEGIN CATCH

    -- Error! If it was a Deadlock, try again - otherwise die now

    IF (ERROR_NUMBER() = 1205)

    BEGIN

    SET @retry -= 1

    IF XACT_STATE() <> 0 ROLLBACK TRANSACTION;

    IF @retry > 0

    BEGIN

    DECLARE @msg varchar(100) = 'Deadlock! Will retry times ' + CONVERT(VARCHAR(6),@retry)

    EXEC [dbo].[ToLOG] @ClientId,0,'GetFILELIST',@msg;

    WAITFOR DELAY '00:00:01';-- Wait before trying again

    END

    ELSE

    EXEC [dbo].[ToLOG] @ClientId,6,'GetFILELIST','Deadlock retry limit reached';

    END

    ELSE

    BEGIN

    SET @retry = -1;

    IF XACT_STATE() <> 0 ROLLBACK TRANSACTION;

    SET @msg = 'Error ' + CONVERT(VARCHAR(6), ERROR_NUMBER()) + ' ' + ERROR_MESSAGE();

    EXEC [dbo].[ToLOG] @ClientId,6,'GetFILELIST',@msg;

    END

    END CATCH;

    END; -- End WHILE loop

    GO[/Code]

    All that just to get a file name for my script component to use - but it does it reliably, and I learnt a lot doing it!

  • I generally use the control structures to manage concurrency where I can, I only lock them long enough to get the work to do and mark that I have it. I try to lock the pages in the table for as short a time as possible for maximum concurrency.

    And I treat the control structures seperate from my app structures. I get the work from the control structure and perform it on the app structures then I go back to the control structure and say that I completed the work.

    CEWII

  • If by Control and App you mean control flow and data flow, I more or less do that. The file list is made and loaded into the FILELIST table by a script in the control flow, then the components in the data flow do the SELECT+UPDATE transaction to each get a unique next file name - and the query should indeed be fast.

    I'm now getting 50-60% CPU occupancy on our server, which means all 8 cores are busy, sometimes with more than one thread. Before, I was getting 6-7% (just one core busy). And in practical terms, having each thread load as many files as it can works nicely too. The files are of varying sizes, so one thread might load just one big file, and another several small ones, but the total numbers of rows each thread loads are about the same. Total elapsed time is reduced by 3x - this is because the other steps in the process are not affected.

  • I was unclear then.. I was talking about app structures like what you are taking actions on, there can and probably will be locking/blocking on these as each thread runs. You reference the app structures by records in the control structure. I use the control structure to manage concurrency, in my example you have a table with an ID in it along with some datetime columns, this is a really simplified view but accurately portrays what I'm talking about. The control table does NOTHING but dispenses work to do, keeps track of what has completed, and makes sure that a failed process gets retried. All of these events are very discrete. You get a unit of work by locking it in the control table, this provides for restartability of a failed id, the locking of the record would be considered a semaphore lock, what that means is that the table/record is only locked briefly to record the fact it is taken and all processes honor this scheme.

    An example would be that the id would be a list of customer ids to run these processes against, a process that set based processing isn't good for. The id value is the link between the control structure and the app structures.

    Is that any clearer?

    Also, I'm glad to see that your utilization of your H/W has gone up so significantly, how have the run-times been affected?

    CEWII

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply