Asynchronous processing in SSIS

Question

Asynchronous processing in SSIS

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply

stricknyn SSCertifiable Points: 6560 More actions · Answer 1

Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..
Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII

Thanks! I'll check it out.

Strick

John Rowan SSC Guru Points: 56440 More actions · Answer 2

Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..
Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII

Right, but how do you plan to divide the working data set for the data flows?

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 3

John Rowan (9/22/2009)

Right, but how do you plan to divide the working data set for the data flows?

The package includes 10 variables, StartId and EndId for each of 5 dataflows. It is crude but it provides the basis to show how it could work. There are other ways. The package doesn't make any decisions, the caller says how many "threads" and how they will be broken up.

CEWII

John Rowan SSC Guru Points: 56440 More actions · Answer 4

Ah, I didn't see that. OK, I just wanted to see if that was being accounted for. It seemed like we were on different pages there for a bit :-).

That certainly would work.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

stricknyn SSCertifiable Points: 6560 More actions · Answer 5

John Rowan (9/22/2009)
Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..
Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII
Right, but how do you plan to divide the working data set for the data flows?

Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:

1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.

2)In a execute SQL task I'd retrieve the @rowcount from the sp.

3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.

4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion

Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.

Thanks,

Strick

John Rowan SSC Guru Points: 56440 More actions · Answer 6

You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.

Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 7

stricknyn (9/22/2009)
Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:
1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.
2)In a execute SQL task I'd retrieve the @rowcount from the sp.
3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.
4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion
Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.
Thanks,
Strick

First I'm not a big fan of Modality... as in I pass in this value and get one return and then I pass in this value and get another..

Ok, so lets look at this a bit further.. I see that you are proposing an auto method to do this.

I see a stored procedure that gives you all of this information:

1. Number of threads up to 10.

2. Start and end id for each thread.

We then assign these values. Let me see what I can do. I'll post again after bit..

CEWII

stricknyn SSCertifiable Points: 6560 More actions · Answer 8

John Rowan (9/22/2009)
You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.
Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.

Hmmm..kinda like this one. Alot less work for SSIS to do. Correct me if I'm wrong, but the only thing with this is that each data flow would query the entire 2 million rows and assign the chunk group instead of just what it needs.

Thanks

Strick

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 9

Ok, this is kind of rough but it generally does what we want.. The big thing is the query near the beginning, this would be where you would put YOUR query to fill in the variables. The rest calculates start/end for up to 5 flows. It assumes 100,000 record flows, but don't worry, if you have too for that, like 2M rows.. It handles that.. You can map the output directly into the theadcount and start/end variables.

IF EXISTS (SELECT * FROM sysobjects WHERE name = 'GetPackageParameters' AND type = 'P')

DROP PROCEDURE dbo.GetPackageParameters

GO

CREATE PROCEDURE dbo.GetPackageParameters

AS

/*

*

* GetPackageParameters

*

* Author(s): Me

*

* Change Log: Initial Release, 09/22/2009, 1.00.0000

*

*/

BEGIN

SET NOCOUNT ON

SET XACT_ABORT ON

DECLARE @TTLRowCount int,

@MinId int,

@MaxId int,

@ThreadCount int,

@Range int

---- Some query to generate a count with min/max ids..

SELECT @TTLRowCount = COUNT(*),

@MinId = MIN(RecId),

@MaxId = MAX(RecId)

FROM dbo.SomeDataTableWithLotsOfRows

WHERE SomeField = SomeValue

-- -- Test code

--SELECT @TTLRowCount = 500000, --499999,

-- @MinId = 5500,

-- @MaxId = 550000

-- Calc threadcount

SELECT @ThreadCount = @TTLRowCount / 100000

-- Handle case of small count, but some rows..

IF ( @ThreadCount 0 ) SET @ThreadCount = 1

-- If calculated threadcount is high force to max

IF ( @ThreadCount > 5 ) SET @ThreadCount = 5

IF ( @ThreadCount = 1 )

BEGIN -- No additional work..

SELECT ThreadCount = 1,

Flow1StartId = @MinId,

Flow1EndId = @MaxId,

Flow2StartId = NULL,

Flow2EndId = NULL,

Flow3StartId = NULL,

Flow3EndId = NULL,

Flow4StartId = NULL,

Flow4EndId = NULL,

Flow5StartId = NULL,

Flow5EndId = NULL

END

ELSE

BEGIN -- Calc start and end for each..

-- Calc the range

SELECT @Range = ( @MaxId - @MinId ) /@ThreadCount

-- Build little work table

DECLARE @IDS TABLE ( Flow1StartId int,

Flow1EndId int,

Flow2StartId int,

Flow2EndId int,

Flow3StartId int,

Flow3EndId int,

Flow4StartId int,

Flow4EndId int,

Flow5StartId int,

Flow5EndId int )

-- Build the work data

INSERT @IDS

SELECT @MinId,

@MinId + @Range,

@MinId + @Range + 1,

@MinId + ( @Range * 2 ),

@MinId + ( @Range * 2 ) + 1,

@MinId + ( @Range * 3 ),

@MinId + ( @Range * 3 ) + 1,

@MinId + ( @Range * 4 ),

@MinId + ( @Range * 4 ) + 1,

@MinId + ( @Range * 5 )

-- Make sure we know the end..

UPDATE @IDS

SET Flow1EndId = CASE WHEN @ThreadCount = 1 THEN @MaxId ELSE Flow1EndId END,

Flow2EndId = CASE WHEN @ThreadCount = 2 THEN @MaxId ELSE Flow2EndId END,

Flow3EndId = CASE WHEN @ThreadCount = 3 THEN @MaxId ELSE Flow3EndId END,

Flow4EndId = CASE WHEN @ThreadCount = 4 THEN @MaxId ELSE Flow4EndId END,

Flow5EndId = CASE WHEN @ThreadCount = 5 THEN @MaxId ELSE Flow5EndId END

FROM @IDS

-- Clear out end fields

UPDATE @IDS

SET Flow2StartId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2StartId END,

Flow2EndId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2EndId END,

Flow3StartId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3StartId END,

Flow3EndId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3EndId END,

Flow4StartId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4StartId END,

Flow4EndId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4EndId END,

Flow5StartId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5StartId END,

Flow5EndId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5EndId END

FROM @IDS

-- Return it all

SELECT ThreadCount = @ThreadCount,

Flow1StartId,

Flow1EndId,

Flow2StartId,

Flow2EndId,

Flow3StartId,

Flow3EndId,

Flow4StartId,

Flow4EndId,

Flow5StartId,

Flow5EndId

FROM @IDS

END

GO

CEWII