September 22, 2009 at 12:00 pm
Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII
Thanks! I'll check it out.
Strick
September 22, 2009 at 12:18 pm
Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII
Right, but how do you plan to divide the working data set for the data flows?
September 22, 2009 at 12:47 pm
John Rowan (9/22/2009)
Right, but how do you plan to divide the working data set for the data flows?
The package includes 10 variables, StartId and EndId for each of 5 dataflows. It is crude but it provides the basis to show how it could work. There are other ways. The package doesn't make any decisions, the caller says how many "threads" and how they will be broken up.
CEWII
September 22, 2009 at 12:54 pm
Ah, I didn't see that. OK, I just wanted to see if that was being accounted for. It seemed like we were on different pages there for a bit :-).
That certainly would work.
September 22, 2009 at 12:57 pm
John Rowan (9/22/2009)
Elliott W (9/22/2009)
Ok, here is a really quick and dirty POC. How many Dataflows run depends on the variable "ExecutionThreadCount", it can be up to 5. I even added some extra variables so that the start and end ids for each flow could be passed. You could fill those in in the placeholder. Which is really important because it handles the precendence.. Give it a look..Take the TXT off the end, DTSX isn't normally an allowed type..
CEWII
Right, but how do you plan to divide the working data set for the data flows?
Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:
1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.
2)In a execute SQL task I'd retrieve the @rowcount from the sp.
3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.
4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion
Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.
Thanks,
Strick
September 22, 2009 at 1:44 pm
You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.
Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.
September 22, 2009 at 1:50 pm
stricknyn (9/22/2009)
Just an idea so bear with me I'm talking it out loud but here's how I think I'm going to do it:1)In my stored procedure there will be a parameter. If its 1 it will just return @rowcount, if its 2 it will actually return the data. This is so I don't have to query all that data.
2)In a execute SQL task I'd retrieve the @rowcount from the sp.
3) In a script task (still before the data flows) variable, I'd divide this by the number of "chunks" to get an amount for each chunk.
4) Haven't thought through the query part, thoroughly but each data flow would use this info in its source to query its portion
Again just thinking out loud. I haven't thought through the "how" part. I'm more or less been thinking high level on this so far.
Thanks,
Strick
First I'm not a big fan of Modality... as in I pass in this value and get one return and then I pass in this value and get another..
Ok, so lets look at this a bit further.. I see that you are proposing an auto method to do this.
I see a stored procedure that gives you all of this information:
1. Number of threads up to 10.
2. Start and end id for each thread.
We then assign these values. Let me see what I can do. I'll post again after bit..
CEWII
September 22, 2009 at 2:01 pm
John Rowan (9/22/2009)
You'll need to be careful with #1. SSIS is very metadata dependant and a stored procedure that returns a different result set based on parameter values can cause problems. How about passing the thread count into the SP and then have the SP break the master result set into chunks, or groups, and assign the chunk/group number as part of the result set. That group number can then be used to route the data to the proper data flow.Alternativly, use Elliot's suggestion and have the SP return the MIN/MAX key values for the rows for each data flow.
Hmmm..kinda like this one. Alot less work for SSIS to do. Correct me if I'm wrong, but the only thing with this is that each data flow would query the entire 2 million rows and assign the chunk group instead of just what it needs.
Thanks
Strick
September 22, 2009 at 3:20 pm
Ok, this is kind of rough but it generally does what we want.. The big thing is the query near the beginning, this would be where you would put YOUR query to fill in the variables. The rest calculates start/end for up to 5 flows. It assumes 100,000 record flows, but don't worry, if you have too for that, like 2M rows.. It handles that.. You can map the output directly into the theadcount and start/end variables.
IF EXISTS (SELECT * FROM sysobjects WHERE name = 'GetPackageParameters' AND type = 'P')
DROP PROCEDURE dbo.GetPackageParameters
GO
CREATE PROCEDURE dbo.GetPackageParameters
AS
/*
*
* GetPackageParameters
*
* Author(s): Me
*
* Change Log: Initial Release, 09/22/2009, 1.00.0000
*
*/
BEGIN
SET NOCOUNT ON
SET XACT_ABORT ON
DECLARE @TTLRowCount int,
@MinId int,
@MaxId int,
@ThreadCount int,
@Range int
---- Some query to generate a count with min/max ids..
SELECT @TTLRowCount = COUNT(*),
@MinId = MIN(RecId),
@MaxId = MAX(RecId)
FROM dbo.SomeDataTableWithLotsOfRows
WHERE SomeField = SomeValue
-- -- Test code
--SELECT @TTLRowCount = 500000, --499999,
-- @MinId = 5500,
-- @MaxId = 550000
-- Calc threadcount
SELECT @ThreadCount = @TTLRowCount / 100000
-- Handle case of small count, but some rows..
IF ( @ThreadCount 0 ) SET @ThreadCount = 1
-- If calculated threadcount is high force to max
IF ( @ThreadCount > 5 ) SET @ThreadCount = 5
IF ( @ThreadCount = 1 )
BEGIN -- No additional work..
SELECT ThreadCount = 1,
Flow1StartId = @MinId,
Flow1EndId = @MaxId,
Flow2StartId = NULL,
Flow2EndId = NULL,
Flow3StartId = NULL,
Flow3EndId = NULL,
Flow4StartId = NULL,
Flow4EndId = NULL,
Flow5StartId = NULL,
Flow5EndId = NULL
END
ELSE
BEGIN -- Calc start and end for each..
-- Calc the range
SELECT @Range = ( @MaxId - @MinId ) /@ThreadCount
-- Build little work table
DECLARE @IDS TABLE ( Flow1StartId int,
Flow1EndId int,
Flow2StartId int,
Flow2EndId int,
Flow3StartId int,
Flow3EndId int,
Flow4StartId int,
Flow4EndId int,
Flow5StartId int,
Flow5EndId int )
-- Build the work data
INSERT @IDS
SELECT @MinId,
@MinId + @Range,
@MinId + @Range + 1,
@MinId + ( @Range * 2 ),
@MinId + ( @Range * 2 ) + 1,
@MinId + ( @Range * 3 ),
@MinId + ( @Range * 3 ) + 1,
@MinId + ( @Range * 4 ),
@MinId + ( @Range * 4 ) + 1,
@MinId + ( @Range * 5 )
-- Make sure we know the end..
UPDATE @IDS
SET Flow1EndId = CASE WHEN @ThreadCount = 1 THEN @MaxId ELSE Flow1EndId END,
Flow2EndId = CASE WHEN @ThreadCount = 2 THEN @MaxId ELSE Flow2EndId END,
Flow3EndId = CASE WHEN @ThreadCount = 3 THEN @MaxId ELSE Flow3EndId END,
Flow4EndId = CASE WHEN @ThreadCount = 4 THEN @MaxId ELSE Flow4EndId END,
Flow5EndId = CASE WHEN @ThreadCount = 5 THEN @MaxId ELSE Flow5EndId END
FROM @IDS
-- Clear out end fields
UPDATE @IDS
SET Flow2StartId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2StartId END,
Flow2EndId = CASE WHEN @ThreadCount < 2 THEN NULL ELSE Flow2EndId END,
Flow3StartId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3StartId END,
Flow3EndId = CASE WHEN @ThreadCount < 3 THEN NULL ELSE Flow3EndId END,
Flow4StartId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4StartId END,
Flow4EndId = CASE WHEN @ThreadCount < 4 THEN NULL ELSE Flow4EndId END,
Flow5StartId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5StartId END,
Flow5EndId = CASE WHEN @ThreadCount < 5 THEN NULL ELSE Flow5EndId END
FROM @IDS
-- Return it all
SELECT ThreadCount = @ThreadCount,
Flow1StartId,
Flow1EndId,
Flow2StartId,
Flow2EndId,
Flow3StartId,
Flow3EndId,
Flow4StartId,
Flow4EndId,
Flow5StartId,
Flow5EndId
FROM @IDS
END
END
GO
CEWII
Viewing 9 posts - 16 through 23 (of 23 total)
You must be logged in to reply to this topic. Login to reply