Asynchronous processing in SSIS

  • Elliott W (9/22/2009)


    Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..

    Take the TXT off the end, DTSX isn't normally an allowed type..

    CEWII

    Thanks! I'll check it out.

    Strick

  • Elliott W (9/22/2009)


    Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..

    Take the TXT off the end, DTSX isn't normally an allowed type..

    CEWII

    Right, but how do you plan to divide the working data set for the data flows?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (9/22/2009)


    Right, but how do you plan to divide the working data set for the data flows?

    The package includes 10 variables, StartId and EndId for each of 5 dataflows. It is crude but it provides the basis to show how it could work. There are other ways. The package doesn't make any decisions, the caller says how many "threads" and how they will be broken up.

    CEWII

  • Ah, I didn't see that. OK, I just wanted to see if that was being accounted for. It seemed like we were on different pages there for a bit :-).

    That certainly would work.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (9/22/2009)


    Elliott W (9/22/2009)


    Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..

    Take the TXT off the end, DTSX isn't normally an allowed type..

    CEWII

    Right, but how do you plan to divide the working data set for the data flows?

    Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:

    1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.

    2)In a execute SQL task I'd retrieve the @rowcount from the sp.

    3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.

    4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion

    Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.

    Thanks,

    Strick

  • You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.

    Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • stricknyn (9/22/2009)


    Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:

    1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.

    2)In a execute SQL task I'd retrieve the @rowcount from the sp.

    3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.

    4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion

    Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.

    Thanks,

    Strick

    First I'm not a big fan of Modality... as in I pass in this value and get one return and then I pass in this value and get another..

    Ok, so lets look at this a bit further.. I see that you are proposing an auto method to do this.

    I see a stored procedure that gives you all of this information:

    1. Number of threads up to 10.

    2. Start and end id for each thread.

    We then assign these values. Let me see what I can do. I'll post again after bit..

    CEWII

  • John Rowan (9/22/2009)


    You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.

    Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.

    Hmmm..kinda like this one. Alot less work for SSIS to do. Correct me if I'm wrong, but the only thing with this is that each data flow would query the entire 2 million rows and assign the chunk group instead of just what it needs.

    Thanks

    Strick

  • Ok, this is kind of rough but it generally does what we want.. The big thing is the query near the beginning, this would be where you would put YOUR query to fill in the variables. The rest calculates start/end for up to 5 flows. It assumes 100,000 record flows, but don't worry, if you have too for that, like 2M rows.. It handles that.. You can map the output directly into the theadcount and start/end variables.

    IF EXISTS (SELECT * FROM sysobjects WHERE name = 'GetPackageParameters' AND type = 'P')

    DROP PROCEDURE dbo.GetPackageParameters

    GO

    CREATE PROCEDURE dbo.GetPackageParameters

    AS

    /*

    *

    * GetPackageParameters

    *

    * Author(s): Me

    *

    * Change Log: Initial Release, 09/22/2009, 1.00.0000

    *

    */

    BEGIN

    SET NOCOUNT ON

    SET XACT_ABORT ON

    DECLARE @TTLRowCount int,

    @MinId int,

    @MaxId int,

    @ThreadCount int,

    @Range int

    ---- Some query to generate a count with min/max ids..

    SELECT @TTLRowCount = COUNT(*),

    @MinId = MIN(RecId),

    @MaxId = MAX(RecId)

    FROM dbo.SomeDataTableWithLotsOfRows

    WHERE SomeField = SomeValue

    -- -- Test code

    --SELECT @TTLRowCount = 500000, --499999,

    -- @MinId = 5500,

    -- @MaxId = 550000

    -- Calc threadcount

    SELECT @ThreadCount = @TTLRowCount / 100000

    -- Handle case of small count, but some rows..

    IF ( @ThreadCount 0 ) SET @ThreadCount = 1

    -- If calculated threadcount is high force to max

    IF ( @ThreadCount > 5 ) SET @ThreadCount = 5

    IF ( @ThreadCount = 1 )

    BEGIN -- No additional work..

    SELECT ThreadCount = 1,

    Flow1StartId = @MinId,

    Flow1EndId = @MaxId,

    Flow2StartId = NULL,

    Flow2EndId = NULL,

    Flow3StartId = NULL,

    Flow3EndId = NULL,

    Flow4StartId = NULL,

    Flow4EndId = NULL,

    Flow5StartId = NULL,

    Flow5EndId = NULL

    END

    ELSE

    BEGIN -- Calc start and end for each..

    -- Calc the range

    SELECT @Range = ( @MaxId - @MinId ) /@ThreadCount

    -- Build little work table

    DECLARE @IDS TABLE ( Flow1StartId int,

    Flow1EndId int,

    Flow2StartId int,

    Flow2EndId int,

    Flow3StartId int,

    Flow3EndId int,

    Flow4StartId int,

    Flow4EndId int,

    Flow5StartId int,

    Flow5EndId int )

    -- Build the work data

    INSERT @IDS

    SELECT @MinId,

    @MinId + @Range,

    @MinId + @Range + 1,

    @MinId + ( @Range * 2 ),

    @MinId + ( @Range * 2 ) + 1,

    @MinId + ( @Range * 3 ),

    @MinId + ( @Range * 3 ) + 1,

    @MinId + ( @Range * 4 ),

    @MinId + ( @Range * 4 ) + 1,

    @MinId + ( @Range * 5 )

    -- Make sure we know the end..

    UPDATE @IDS

    SET Flow1EndId = CASE WHEN @ThreadCount = 1 THEN @MaxId ELSE Flow1EndId END,

    Flow2EndId = CASE WHEN @ThreadCount = 2 THEN @MaxId ELSE Flow2EndId END,

    Flow3EndId = CASE WHEN @ThreadCount = 3 THEN @MaxId ELSE Flow3EndId END,

    Flow4EndId = CASE WHEN @ThreadCount = 4 THEN @MaxId ELSE Flow4EndId END,

    Flow5EndId = CASE WHEN @ThreadCount = 5 THEN @MaxId ELSE Flow5EndId END

    FROM @IDS

    -- Clear out end fields

    UPDATE @IDS

    SET Flow2StartId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2StartId END,

    Flow2EndId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2EndId END,

    Flow3StartId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3StartId END,

    Flow3EndId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3EndId END,

    Flow4StartId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4StartId END,

    Flow4EndId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4EndId END,

    Flow5StartId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5StartId END,

    Flow5EndId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5EndId END

    FROM @IDS

    -- Return it all

    SELECT ThreadCount = @ThreadCount,

    Flow1StartId,

    Flow1EndId,

    Flow2StartId,

    Flow2EndId,

    Flow3StartId,

    Flow3EndId,

    Flow4StartId,

    Flow4EndId,

    Flow5StartId,

    Flow5EndId

    FROM @IDS

    END

    END

    GO

    CEWII

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply