January 26, 2006 at 7:37 am
First of all, I'm looking for pointers and am anxious to work out the answer to my problem myself if only I can just get some help getting started in the right direction. This post is a plea for help, not someone to do it for me. My situation is as follows:
We have 80+ source systems that upload data for our data warehouse once a week. Currently, they are handled as a single group for ETL processing but due to the way various checking occur it would be a lot more efficient to handle the records in groups by source system, or at least in small batches of source systems. Testing by hard-coding ID sets in a testing environment has shown the ETL server works at it best with 7 parallel threads. It also works much better to pre-segment the records into those threads based on source ID than to round-robin the records without regard to incoming source. So you can see, some amount of testing and investigating has already occurred.
Load jobs kick off at regular intervals on upload day. The load process doesn't care ahead of time how many files from which sources are getting ready to load but just grabs all the files and throws them in a staging table. The trick is, there is no schedule so that source X always gets their file in the noon job, for example. Source X might come in at 2AM one week, and 5PM the next week. To further complicate matters, 1 of the sources may get their file in by 2AM but 30 get their files in for the noon job and vice-versa the next week. Please note that this question is about selecting from the staging table onwards, NOT about how to deal with the raw files.
It is easy to select a distinct list of the current batch's source system ID numbers from the stage table. So how to select out of the stage table, without knowing ahead of time what ID numbers are in the list, a set of 7 groups of reasonably distributed ID numbers? Assume that the source systems send up reasonably equal amounts of records. The problem is that given the random nature of the uploads, any hard coded set of ID ranges is going to be wildly unbalanced in practice. One last thing, the source_ID numbers are integers between 1 and 140 with gaps; only about 80 are actually used.
The ideal end result is simply an add-on to the WHERE clause like: "source_id between (calculate range_start) and (calculate range_end)" for ranges 1 to 7.
I've already tried a convoluted set of selects using "top 14 percent" but it turns into a mess after the 5th level.
Any ideas on an algorythm I can use to calculate the range_start and range_end that can be easily dropped in place in a select's where clause? I'll have 7 different selects, one for each process headed for the ETL server, and each needs its own range values calculated on the fly.
Thanks for reading all the way to the end and thanks for any pointers.
January 26, 2006 at 7:51 am
How about using the modulo operator?
The first group would be "where id % 7 = 0". The next group "where id % 7 = 1" and so on up to "where id % 7 = 6".
I hope this makes sense...
January 26, 2006 at 11:43 am
It makes sense but it also makes very uneven groupings. Trying that on a test case put 4 sources into the first thread, three in the second, none in the third, and one each into the remaining 4. And if the ID numbers fall right it could even be more lopsided than that.
January 26, 2006 at 12:07 pm
A variant on this idea does work nicely. However, it requires being bounced through a temp table:
create table #foo (id_key int identity(1,1), source_id_key int);
insert into #foo select distinct source_id_key from dbo.Stage_Source;
select source_id_key from #foo where id_key % 7 = 0
select source_id_key from #foo where id_key % 7 = 1
select source_id_key from #foo where id_key % 7 = 2
select source_id_key from #foo where id_key % 7 = 3
select source_id_key from #foo where id_key % 7 = 4
select source_id_key from #foo where id_key % 7 = 5
select source_id_key from #foo where id_key % 7 = 6
Alas, this assumes I can make a temp table which I can't because the ETL program allows additions to the 'where' clause of its select but not anything preemptive. I also really, really, really want to avoid maintaining a permanent table. Any ideas on how to get these results but without bouncing through an intermediary table?
January 27, 2006 at 3:08 am
You could try the following:
declare @Foo table (source_id_key int);
insert into @Foo select 1
insert into @Foo select 2
insert into @Foo select 4
insert into @Foo select 6
insert into @Foo select 7
insert into @Foo select 8
insert into @Foo select 10
insert into @Foo select 13
insert into @Foo select 19
select source_id_key from @Foo f where
(select count(*) from @Foo f2 where f2.source_id_key < f.source_id_key) % 7 = 0 -- or 1, 2, 3, 4, 5, 6
January 27, 2006 at 8:03 am
Can you call a function? If yes, just put all the logic of finding the limits from-to for each chunk of data into a function and call it in the WHERE clause. I'm not sure whether it will perform better than Jesper's solution, it's just an idea how you could approach the problem. After all, you were asking more for ideas than solutions, so I guess it should be OK. It could be something like this:
/*environment*/
CREATE TABLE divide (number int)
DECLARE @loop int
SET @loop = 1
WHILE @loop < 200
BEGIN
insert into divide(number) select @loop
SET @loop = @loop+1
END
insert into divide (number) values (1)
insert into divide (number) values (1)
insert into divide (number) values (99)
insert into divide (number) values (50)
insert into divide (number) values (9)
insert into divide (number) values (99)
insert into divide (number) values (99)
delete from divide where number = 24
CREATE INDEX idx_nbr on divide(number)
/*function*/
/*parameters:
@cnt - count of chunks you want to create - i.e. 7 in your example
@chunk - sequence number of data chunk (batch)
@type - 'MIN' or 'MAX'*/
CREATE FUNCTION dbo.divide_parts(@cnt int, @chunk INT, @type char(3))
RETURNS INT
AS
BEGIN
DECLARE @chunksize INT, @limit INT
SET @chunksize = (select count(distinct number) from divide)/@cnt
IF @type = 'MIN' SET @chunk = @chunk-1
IF @chunk = @cnt AND @type = 'MAX' SET @limit = (select MAX(number) from divide)
ELSE
BEGIN
IF @chunk = 0 AND @type = 'MIN' SET @limit = 0
ELSE
SELECT @limit = MAX(number)
FROM
(SELECT n1.number, count(*) as pocet
FROM (select distinct number from divide) n1
JOIN (select distinct number from divide) n2 ON n2.number < n1.number
GROUP BY n1.number) q
WHERE pocet < @chunksize * @chunk
END
RETURN @limit
END
/*call in WHERE clause : divide into 7 parts, this is part No. 4*/
SELECT *
FROM divide
WHERE number > dbo.divide_parts(7,4,'MIN')
AND number <= dbo.divide_parts(7,4,'MAX')
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply