June 7, 2010 at 6:08 am
Thanks Jeff, I'll fix that in the production code too!
darn flies!
declare @rangeTable table (start int, [end] int)
insert @rangeTable
(Start, [End])
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30 UNION ALL
select 32, 32 union all
select 34, 35
select s.start
, IsNull((
select MIN(e.[end])
from @rangeTable e
where e.[start] > s.[end]
and not exists (
select 1
from @rangeTable e1
where e1.[start] = e.[end] + 1
)
),s.[end]) as [end]
from @rangeTable s
where not exists (
select 1
from @rangeTable s1
where s1.[end] = s.start - 1
)
June 7, 2010 at 7:34 am
cgreathouse (5/31/2010)
HelloI'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.
So for example (simplified version) I'm trying to go from...
START END
1 2
3 4
5 6
8 8
9 9
10 10
20 22
23 25
26 30
to...
START END
1 6
8 10
20 30
Any suggestions? Here's some code to get it setup...
declare @rangeTable table (start int, [end] int)
insert @rangeTable
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30
Thanks!
Can the ranges have multiple overlaps? So could your source data have this
START END
1 2
1 5 --?? this
3 4
5 6
8 8
9 9
10 10
20 22
23 25
26 30
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537June 7, 2010 at 7:56 am
I have not seen this case happen yet. The data comes from a an outside source so it wouldn't surprise me if this eventually happens. But it isn't currently happening.
If it does, I my be back to consult with you 😀
June 8, 2010 at 7:55 pm
paul_ramster (6/7/2010)
darn flies!
Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...
startend
16
810
2030
3235
3435
--Jeff Moden
Change is inevitable... Change for the better is not.
June 9, 2010 at 2:24 am
Jeff Moden (6/8/2010)
paul_ramster (6/7/2010)
Thanks Jeff, I'll fix that in the production code too!darn flies!
Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...
startend
16
810
2030
3235
3435
This should do it
WITH Starts(start) AS (
SELECT a.start
FROM @rangeTable a
WHERE NOT EXISTS(SELECT * FROM @rangeTable b
WHERE a.start = b.[end]+1)),
Ends([end]) AS (
SELECT a.[end]
FROM @rangeTable a
WHERE NOT EXISTS(SELECT * FROM @rangeTable b
WHERE a.[end]+1 = b.start))
SELECT start,
MIN([end]) AS [end]
FROM Starts
INNER JOIN Ends ON [end]>=start
GROUP BY start
ORDER BY start;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537June 10, 2010 at 3:04 am
Thanks for the suggestion.
However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.
I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem Jeff noted.
This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.
The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.
select s.start
, IsNull((
select MIN(e.[end])
from @rangeTable e
where e.[end] >= s.[end]
and not exists (
select 1
from @rangeTable e1
where e1.[start] = e.[end] + 1
)
),s.[end]) as [end]
from @rangeTable s
where not exists (
select 1
from @rangeTable s1
where s1.[end] = s.start - 1
)
June 10, 2010 at 7:33 am
paul_ramster (6/10/2010)
Thanks for the suggestion.However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.
I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem you noted.
This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.
The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.
Did you retest for performance?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 10, 2010 at 7:50 am
I've done some performance checks.
I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.
Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).
Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.
The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).
June 11, 2010 at 5:55 am
paul_ramster (6/10/2010)
I've done some performance checks.I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.
Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).
Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.
The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).
I just did my own performance testing (see below). My recommendation would be that if you don't want to use the Quirky Update, then use a Cursor or While loop because the performance of the current code is horrible thanks to the triangular join and I strongly recommend that you don't use it. Yeah... I know... you say it's "good enough" because of a supposedly low row count. Famous last words.
Here's a simple test with [font="Arial Black"]only 4500 rows [/font]with a range breaks setup so [font="Arial Black"]only 9 rows are in each range[/font]. The code takes almost 6 seconds on my box and produces more than 13,000 reads. Remember... that for [font="Arial Black"]only [/font]4,500 rows.
Here's the code to create the test table...
CREATE TABLE #RangeTable (start int, [end] int, MinStart INT);
WITH
cteTally AS
( --=== Build a quick little "Tally" table as a row source
SELECT TOP 10000
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.All_Columns ac2
) --=== This will actually only create 4500 rows because of the "ranges"
INSERT INTO #RangeTable
(Start, [End])
SELECT t.N, t.N+1
FROM cteTally t
WHERE t.N%2 = 1
AND (t.n+1)%20 <> 0;
Here's the "current" code you're using... it takes 5.6 seconds (on my box) to process only 4,500 rows...
select s.start
, IsNull((
select MIN(e.[end])
from #RangeTable e
where e.[end] >= s.[end]
and not exists (
select 1
from #RangeTable e1
where e1.[start] = e.[end] + 1
)
),s.[end]) as [end]
from #RangeTable s
where not exists (
select 1
from #RangeTable s1
where s1.[end] = s.start - 1
)
The Quirky Update method (below) takes 16 MILLI-SECONDS and produces 15 reads. Even the worse cursor would only be 8 to 10 times slower meaning that even a cursor would blow the doors off the current method.
DECLARE @MinStart INT,
@PrevEnd INT
;
UPDATE #RangeTable
SET @MinStart = MinStart = CASE
WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL
THEN Start
ELSE ISNULL(@MinStart,Start)
END,
@PrevEnd = [End] --Anchor
OPTION (MAXDOP 1) --Absolutely required
;
SELECT MinStart AS Start, MAX([End])
FROM #RangeTable
GROUP BY MinStart
ORDER BY MinStart
;
"Set based" doesn't mean "all in one query". The "current" code you're using has a triangular join in it and the RBAR it produces is thousands of times worse than that of even a Cursor. I, again, strongly urge you to reconsider before you put this computational time bomb into your production code. "Good enough" never is if it's based on low row counts... 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 9 posts - 31 through 38 (of 38 total)
You must be logged in to reply to this topic. Login to reply