Collapsing a Table

Question

Post reply

Collapsing a Table

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply

paul_ramster Ten Centuries Points: 1382 More actions · Answer 1

Thanks Jeff, I'll fix that in the production code too!

darn flies!

declare @rangeTable table (start int, [end] int)

insert @rangeTable

(Start, [End])

select 1, 2 union all

select 3, 4 union all

select 5, 6 union all

select 8, 8 union all

select 9, 9 union all

select 10, 10 union all

select 20, 22 union all

select 23, 25 union all

select 26, 30 UNION ALL

select 32, 32 union all

select 34, 35

select s.start

, IsNull((

select MIN(e.[end])

from @rangeTable e

where e.[start] > s.[end]

and not exists (

select 1

from @rangeTable e1

where e1.[start] = e.[end] + 1

)

),s.[end]) as [end]

from @rangeTable s

where not exists (

select 1

from @rangeTable s1

where s1.[end] = s.start - 1

)

Mark Cowne One Orange Chip Points: 26954 More actions · Answer 2

cgreathouse (5/31/2010)
Hello
I'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.
So for example (simplified version) I'm trying to go from...
START END
1 2
3 4
5 6
8 8
9 9
10 10
20 22
23 25
26 30
to...
START END
1 6
8 10
20 30
Any suggestions? Here's some code to get it setup...
declare @rangeTable table (start int, [end] int)
insert @rangeTable
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30
Thanks!

Can the ranges have multiple overlaps? So could your source data have this

START END

1 2

1 5 --?? this

3 4

5 6

8 8

9 9

10 10

20 22

23 25

26 30

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

cgreathouse SSC Eights! Points: 855 More actions · Answer 3

I have not seen this case happen yet. The data comes from a an outside source so it wouldn't surprise me if this eventually happens. But it isn't currently happening.

If it does, I my be back to consult with you 😀

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

paul_ramster (6/7/2010)

Thanks Jeff, I'll fix that in the production code too!

darn flies!

Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...

startend

16

810

2030

3235

3435

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mark Cowne One Orange Chip Points: 26954 More actions · Answer 5

Jeff Moden (6/8/2010)

paul_ramster (6/7/2010)
Thanks Jeff, I'll fix that in the production code too!
darn flies!
Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...
startend
16
810
2030
3235
3435

This should do it

WITH Starts(start) AS (

SELECT a.start

FROM @rangeTable a

WHERE NOT EXISTS(SELECT * FROM @rangeTable b

WHERE a.start = b.[end]+1)),

Ends([end]) AS (

SELECT a.[end]

FROM @rangeTable a

WHERE NOT EXISTS(SELECT * FROM @rangeTable b

WHERE a.[end]+1 = b.start))

SELECT start,

MIN([end]) AS [end]

FROM Starts

INNER JOIN Ends ON [end]>=start

GROUP BY start

ORDER BY start;

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

paul_ramster Ten Centuries Points: 1382 More actions · Answer 6

Thanks for the suggestion.

However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.

I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem Jeff noted.

This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.

The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.

select s.start

, IsNull((

select MIN(e.[end])

from @rangeTable e

where e.[end] >= s.[end]

and not exists (

select 1

from @rangeTable e1

where e1.[start] = e.[end] + 1

)

),s.[end]) as [end]

from @rangeTable s

where not exists (

select 1

from @rangeTable s1

where s1.[end] = s.start - 1

)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

paul_ramster (6/10/2010)
Thanks for the suggestion.
However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.
I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem you noted.
This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.
The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.

Did you retest for performance?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

paul_ramster Ten Centuries Points: 1382 More actions · Answer 8

I've done some performance checks.

I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.

Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).

Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.

The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 9

paul_ramster (6/10/2010)
I've done some performance checks.
I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.
Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).
Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.
The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).

I just did my own performance testing (see below). My recommendation would be that if you don't want to use the Quirky Update, then use a Cursor or While loop because the performance of the current code is horrible thanks to the triangular join and I strongly recommend that you don't use it. Yeah... I know... you say it's "good enough" because of a supposedly low row count. Famous last words.

Here's a simple test with [font="Arial Black"]only 4500 rows [/font]with a range breaks setup so [font="Arial Black"]only 9 rows are in each range[/font]. The code takes almost 6 seconds on my box and produces more than 13,000 reads. Remember... that for [font="Arial Black"]only [/font]4,500 rows.

Here's the code to create the test table...

CREATE TABLE #RangeTable (start int, [end] int, MinStart INT);

WITH

cteTally AS

( --=== Build a quick little "Tally" table as a row source

SELECT TOP 10000

ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM Master.sys.All_Columns ac1

CROSS JOIN Master.sys.All_Columns ac2

) --=== This will actually only create 4500 rows because of the "ranges"

INSERT INTO #RangeTable

(Start, [End])

SELECT t.N, t.N+1

FROM cteTally t

WHERE t.N%2 = 1

AND (t.n+1)%20 <> 0;

Here's the "current" code you're using... it takes 5.6 seconds (on my box) to process only 4,500 rows...

select s.start

, IsNull((

select MIN(e.[end])

from #RangeTable e

where e.[end] >= s.[end]

and not exists (

select 1

from #RangeTable e1

where e1.[start] = e.[end] + 1

)

),s.[end]) as [end]

from #RangeTable s

where not exists (

select 1

from #RangeTable s1

where s1.[end] = s.start - 1

)

The Quirky Update method (below) takes 16 MILLI-SECONDS and produces 15 reads. Even the worse cursor would only be 8 to 10 times slower meaning that even a cursor would blow the doors off the current method.

DECLARE @MinStart INT,

@PrevEnd INT

;

UPDATE #RangeTable

SET @MinStart = MinStart = CASE

WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL

THEN Start

ELSE ISNULL(@MinStart,Start)

END,

@PrevEnd = [End] --Anchor

OPTION (MAXDOP 1) --Absolutely required

;

SELECT MinStart AS Start, MAX([End])

FROM #RangeTable

GROUP BY MinStart

ORDER BY MinStart

;

"Set based" doesn't mean "all in one query". The "current" code you're using has a triangular join in it and the RBAR it produces is thousands of times worse than that of even a Cursor. I, again, strongly urge you to reconsider before you put this computational time bomb into your production code. "Good enough" never is if it's based on low row counts... 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)