Collapsing a Table

Question

Post reply

Collapsing a Table

cgreathouse

SSC Eights!

Points: 855
More actions
May 31, 2010 at 7:45 am

#221686

Hello
I'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.
So for example (simplified version) I'm trying to go from...
START END
1 2
3 4
5 6
8 8
9 9
10 10
20 22
23 25
26 30
to...
START END
1 6
8 10
20 30
Any suggestions? Here's some code to get it setup...
declare @rangeTable table (start int, [end] int)
insert @rangeTable
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30
Thanks!

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

I'm not sure if this is the fastest way, but at least it's an option... 😉

;

-- unpivot and group the values

WITH upvt AS

(

SELECT Val

FROM

(SELECT START,[END]

FROM @rangeTable) p

UNPIVOT

(Val FOR Dat IN

(START,[END])

)AS unpvt

GROUP BY Val

),

-- divide into subgroups holding consecutive numbers

numbered AS

(

SELECT Val - ROW_NUMBER() OVER(ORDER BY Val) AS grp, Val

FROM upvt

)

-- select min and max value per subgroup

SELECT MIN(val) AS START, MAX(val) AS [END]

FROM numbered

GROUP BY grp

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 2

lmu92 (5/31/2010)
I'm not sure if this is the fastest way, but at least it's an option... 😉

That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.

STARTEND

16

810

2020

2223

2526

3030

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 3

Jeff Moden (5/31/2010)
lmu92 (5/31/2010)
I'm not sure if this is the fastest way, but at least it's an option... 😉
That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.
STARTEND
16
810
2020
2223
2526
3030

Ok, see your point... Seems like I misunderstood the question. Working on an alternative code.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 4

Here's my take on the problem... sometimes ya just gotta cheat to keep it both simple and fast. 😛

declare @rangeTable table (start int, [end] int PRIMARY KEY CLUSTERED, MinStart INT)

DECLARE @MinStart INT,

@PrevEnd INT

;

insert @rangeTable

(Start, [End])

select 1, 2 union all

select 3, 4 union all

select 5, 6 union all

select 8, 8 union all

select 9, 9 union all

select 10, 10 union all

select 20, 22 union all

select 23, 25 union all

select 26, 30

;

UPDATE @RangeTable

SET @MinStart = MinStart = CASE

WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL

THEN Start

ELSE ISNULL(@MinStart,Start)

END,

@PrevEnd = [End] --Anchor

OPTION (MAXDOP 1) --Absolutely required

;

SELECT MinStart AS Start, MAX([End])

FROM @rangetable

GROUP BY MinStart

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 5

I've been trying to come up with a self referencing solution, but I'll give up now since I won't be able to come even close to your solution. :crying:

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 6

Heh... dammit. I almost didn't post my solution for fear of just that. I really wanted to see a good self referencing solution.:pinch:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 7

Ok, here's what I came up with: It's a "double self referencing" table.

The first one (sub) is used to eliminate consecutive rows and the second one (final select) gets the first and last value of the remaining pairs.

Like I said: not even close to what you came up with. Neither in terms of performance nor in terms of readability...

;

WITH cte AS

(

SELECT ROW_NUMBER() OVER(ORDER BY START ) ROW, *

FROM @rangeTable

),

sub AS

(

SELECT

ROW_NUMBER() OVER(ORDER BY cte1.start ) rowx,

cte1.[end],

cte2.start,

ISNULL(cte2.start-cte1.[end],0) AS diff

FROM cte cte1

FULL JOIN cte cte2

ON cte1.row=cte2.row-1

WHERE ISNULL(cte2.start-cte1.[end],0)<>1

)

SELECT sub1.start,sub2.[end]

FROM sub sub1

INNER JOIN sub sub2

ON sub1.rowx=sub2.rowx-1

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 8

Not to worry... I was going down the same path before I gave in to the lure of the Quirky Update.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 9

This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.

This method is based on one of the Gap solutions provided by Itzik Ben-Gan in the excellent book SQL SERVER MVP Deep Dives[/url]. This solution is also based upon the sample code that Jeff provided in his solution. See the relevant comments in the code.

DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)

DECLARE @MinStart INT,

@PrevEnd INT;

INSERT @rangeTable

(Start, [End])

SELECT 1, 2 UNION ALL

SELECT 3, 4 UNION ALL

SELECT 5, 6 UNION ALL

SELECT 8, 8 UNION ALL

SELECT 9, 9 UNION ALL

SELECT 10, 10 UNION ALL

SELECT 20, 22 UNION ALL

SELECT 23, 25 UNION ALL

SELECT 26, 30 ;

-- get the maximum end value to restrict the work the tally table needs to do.

DECLARE @max-2 INT

SELECT @max-2 = max([end]) FROM @rangeTable

-- Note that I'm using an in-line tally table here. You should use a

-- permanent tally table for better performance!

-- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

-- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.

;WITH

TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),

MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),

TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM MILLIONS),

-- the following select statement gets all of the numbers between the start/end columns.

NumList AS (SELECT t.N

FROM TALLY t

JOIN @rangeTable r

ON t.N BETWEEN r.start AND r.[end]

WHERE t.N <= @max-2)

-- get the starting and ending range

-- cloned from Itzek Ben-Gan's gaps solution #3

-- in SQL Server MVP Deep Dives

SELECT [Start] = min(N),

[End] = max(N)

FROM (SELECT A.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)

FROM NumList A) D

GROUP BY grp

ORDER BY [Start]

Edit: removed erronous leading comments

Edit2: corrected name misspelling

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 10

WayneS (5/31/2010)
This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.

For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.

Guess I'll have to build a million row test an find out for sure.

Thanks for the great post, Wayne. Alternatives are always good.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 11

BWAA-HAA!!!! I just realized the "Deep Dive" method is nothing more than the now old "difference between two sets of ROW_NUMBER()'s trick. Stupid me for not thinking of it here. :blush:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 12

Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:

Here's the test table I used...

DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)

DECLARE @MinStart INT,

@PrevEnd INT;

INSERT @rangeTable

(Start, [End])

SELECT 100000, 299999 UNION ALL

SELECT 300000, 499999 UNION ALL

SELECT 500000, 699999 UNION ALL

SELECT 800000, 899999 UNION ALL

SELECT 900000, 999999 UNION ALL

SELECT 1000000, 1099999 UNION ALL

SELECT 2000000, 2299999 UNION ALL

SELECT 2300000, 2599999 UNION ALL

SELECT 2600000, 3099999

;

Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...

-- Note that I'm using an in-line tally table here. You should use a

-- permanent tally table for better performance!

-- See Jeff Modem's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

-- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.

;WITH

TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),

MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),

TRILLIONS (N) AS (SELECT 1 FROM MILLIONS t1 CROSS JOIN MILLIONS t2),

TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM TRILLIONS),

-- the following select statement gets all of the numbers between the start/end columns.

NumList AS (SELECT t.N

FROM TALLY t

JOIN @rangeTable r

ON t.N BETWEEN r.start AND r.[end]

WHERE t.N <= @max-2 --This is essential to limit the max number of rows returned by cteTally

)

-- get the starting and ending range

-- cloned from Itzek Ben-Gan's gaps solution #3

-- in SQL Server MVP Deep Dives

SELECT [Start] = min(N),

[End] = max(N)

FROM (--==== This little gem does the now classic "Difference between two sets of row numbers" to form the groups

SELECT a.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)

FROM NumList a) D

GROUP BY grp

ORDER BY [Start]

I suppose that if you understand how the "Deep Dive" method works (like I do now), the following results for performance and resources used to solve a 9 row problem won't be too shocking :pinch::sick:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 13

Jeff Moden (5/31/2010)
Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:

I like your play on words here Jeff! (emphasis mine)

Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...

Well, let me take the appropriate blame here.

The "Deep Dive" method starts with the list of numbers to determine the islands of numbers - which is just the last select statement in the method that I submitted.

The in-line TALLY table that I presented is the "standard" in-line tally table that I use.

The NumList CTE I had to create to get the unique numbers into a dataset to run the values against.

It is after this point that the "Deep Dive" method starts.

Since Jeff changed to using not only larger numbers, but also larger ranges, there will be a lot more thrashing on the in-line tally table to create the NumList CTE. As I mentioned in my earlier post, going with a disk-based Tally table in this case will be a LOT better than an in-line, non-indexed tally table generator.

It's good to see my intuition about the "Quirky Update" prevailing is true...

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95460 More actions · Answer 14

Jeff Moden (5/31/2010)
WayneS (5/31/2010)
This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.
For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.
Guess I'll have to build a million row test an find out for sure.
Thanks for the great post, Wayne. Alternatives are always good.

Jeff - your mis-spelling, though accidental, is still entirely accurate! :w00t:

Maybe is should be re-named to the Quiky Quirky Update???

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2