May 31, 2010 at 7:45 am
Hello
I'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.
So for example (simplified version) I'm trying to go from...
START END
1 2
3 4
5 6
8 8
9 9
10 10
20 22
23 25
26 30
to...
START END
1 6
8 10
20 30
Any suggestions? Here's some code to get it setup...
declare @rangeTable table (start int, [end] int)
insert @rangeTable
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30
Thanks!
May 31, 2010 at 12:11 pm
I'm not sure if this is the fastest way, but at least it's an option... 😉
;
-- unpivot and group the values
WITH upvt AS
(
SELECT Val
FROM
(SELECT START,[END]
FROM @rangeTable) p
UNPIVOT
(Val FOR Dat IN
(START,[END])
)AS unpvt
GROUP BY Val
),
-- divide into subgroups holding consecutive numbers
numbered AS
(
SELECT Val - ROW_NUMBER() OVER(ORDER BY Val) AS grp, Val
FROM upvt
)
-- select min and max value per subgroup
SELECT MIN(val) AS START, MAX(val) AS [END]
FROM numbered
GROUP BY grp
May 31, 2010 at 2:03 pm
lmu92 (5/31/2010)
I'm not sure if this is the fastest way, but at least it's an option... 😉
That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.
STARTEND
16
810
2020
2223
2526
3030
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 3:39 pm
Jeff Moden (5/31/2010)
lmu92 (5/31/2010)
I'm not sure if this is the fastest way, but at least it's an option... 😉That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.
STARTEND
16
810
2020
2223
2526
3030
Ok, see your point... Seems like I misunderstood the question. Working on an alternative code.
May 31, 2010 at 4:08 pm
Here's my take on the problem... sometimes ya just gotta cheat to keep it both simple and fast. 😛
declare @rangeTable table (start int, [end] int PRIMARY KEY CLUSTERED, MinStart INT)
DECLARE @MinStart INT,
@PrevEnd INT
;
insert @rangeTable
(Start, [End])
select 1, 2 union all
select 3, 4 union all
select 5, 6 union all
select 8, 8 union all
select 9, 9 union all
select 10, 10 union all
select 20, 22 union all
select 23, 25 union all
select 26, 30
;
UPDATE @RangeTable
SET @MinStart = MinStart = CASE
WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL
THEN Start
ELSE ISNULL(@MinStart,Start)
END,
@PrevEnd = [End] --Anchor
OPTION (MAXDOP 1) --Absolutely required
;
SELECT MinStart AS Start, MAX([End])
FROM @rangetable
GROUP BY MinStart
;
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 4:19 pm
I've been trying to come up with a self referencing solution, but I'll give up now since I won't be able to come even close to your solution. :crying:
May 31, 2010 at 4:22 pm
Heh... dammit. I almost didn't post my solution for fear of just that. I really wanted to see a good self referencing solution.:pinch:
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 4:45 pm
Ok, here's what I came up with: It's a "double self referencing" table.
The first one (sub) is used to eliminate consecutive rows and the second one (final select) gets the first and last value of the remaining pairs.
Like I said: not even close to what you came up with. Neither in terms of performance nor in terms of readability...
;
WITH cte AS
(
SELECT ROW_NUMBER() OVER(ORDER BY START ) ROW, *
FROM @rangeTable
),
sub AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY cte1.start ) rowx,
cte1.[end],
cte2.start,
ISNULL(cte2.start-cte1.[end],0) AS diff
FROM cte cte1
FULL JOIN cte cte2
ON cte1.row=cte2.row-1
WHERE ISNULL(cte2.start-cte1.[end],0)<>1
)
SELECT sub1.start,sub2.[end]
FROM sub sub1
INNER JOIN sub sub2
ON sub1.rowx=sub2.rowx-1
May 31, 2010 at 5:45 pm
Not to worry... I was going down the same path before I gave in to the lure of the Quirky Update.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 7:13 pm
This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.
This method is based on one of the Gap solutions provided by Itzik Ben-Gan in the excellent book SQL SERVER MVP Deep Dives[/url]. This solution is also based upon the sample code that Jeff provided in his solution. See the relevant comments in the code.
DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)
DECLARE @MinStart INT,
@PrevEnd INT;
INSERT @rangeTable
(Start, [End])
SELECT 1, 2 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 8, 8 UNION ALL
SELECT 9, 9 UNION ALL
SELECT 10, 10 UNION ALL
SELECT 20, 22 UNION ALL
SELECT 23, 25 UNION ALL
SELECT 26, 30 ;
-- get the maximum end value to restrict the work the tally table needs to do.
DECLARE @max-2 INT
SELECT @max-2 = max([end]) FROM @rangeTable
-- Note that I'm using an in-line tally table here. You should use a
-- permanent tally table for better performance!
-- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
-- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.
;WITH
TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),
THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),
MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),
TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM MILLIONS),
-- the following select statement gets all of the numbers between the start/end columns.
NumList AS (SELECT t.N
FROM TALLY t
JOIN @rangeTable r
ON t.N BETWEEN r.start AND r.[end]
WHERE t.N <= @max-2)
-- get the starting and ending range
-- cloned from Itzek Ben-Gan's gaps solution #3
-- in SQL Server MVP Deep Dives
SELECT [Start] = min(N),
[End] = max(N)
FROM (SELECT A.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)
FROM NumList A) D
GROUP BY grp
ORDER BY [Start]
Edit: removed erronous leading comments
Edit2: corrected name misspelling
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 31, 2010 at 7:39 pm
WayneS (5/31/2010)
This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.
For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.
Guess I'll have to build a million row test an find out for sure.
Thanks for the great post, Wayne. Alternatives are always good.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 7:45 pm
BWAA-HAA!!!! I just realized the "Deep Dive" method is nothing more than the now old "difference between two sets of ROW_NUMBER()'s trick. Stupid me for not thinking of it here. :blush:
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 8:22 pm
Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:
Here's the test table I used...
DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)
DECLARE @MinStart INT,
@PrevEnd INT;
INSERT @rangeTable
(Start, [End])
SELECT 100000, 299999 UNION ALL
SELECT 300000, 499999 UNION ALL
SELECT 500000, 699999 UNION ALL
SELECT 800000, 899999 UNION ALL
SELECT 900000, 999999 UNION ALL
SELECT 1000000, 1099999 UNION ALL
SELECT 2000000, 2299999 UNION ALL
SELECT 2300000, 2599999 UNION ALL
SELECT 2600000, 3099999
;
Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...
-- Note that I'm using an in-line tally table here. You should use a
-- permanent tally table for better performance!
-- See Jeff Modem's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
-- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.
;WITH
TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),
THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),
MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),
TRILLIONS (N) AS (SELECT 1 FROM MILLIONS t1 CROSS JOIN MILLIONS t2),
TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM TRILLIONS),
-- the following select statement gets all of the numbers between the start/end columns.
NumList AS (SELECT t.N
FROM TALLY t
JOIN @rangeTable r
ON t.N BETWEEN r.start AND r.[end]
WHERE t.N <= @max-2 --This is essential to limit the max number of rows returned by cteTally
)
-- get the starting and ending range
-- cloned from Itzek Ben-Gan's gaps solution #3
-- in SQL Server MVP Deep Dives
SELECT [Start] = min(N),
[End] = max(N)
FROM (--==== This little gem does the now classic "Difference between two sets of row numbers" to form the groups
SELECT a.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)
FROM NumList a) D
GROUP BY grp
ORDER BY [Start]
I suppose that if you understand how the "Deep Dive" method works (like I do now), the following results for performance and resources used to solve a 9 row problem won't be too shocking :pinch::sick:
--Jeff Moden
Change is inevitable... Change for the better is not.
June 1, 2010 at 11:06 am
Jeff Moden (5/31/2010)
Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:
I like your play on words here Jeff! (emphasis mine)
Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...
Well, let me take the appropriate blame here.
The "Deep Dive" method starts with the list of numbers to determine the islands of numbers - which is just the last select statement in the method that I submitted.
The in-line TALLY table that I presented is the "standard" in-line tally table that I use.
The NumList CTE I had to create to get the unique numbers into a dataset to run the values against.
It is after this point that the "Deep Dive" method starts.
Since Jeff changed to using not only larger numbers, but also larger ranges, there will be a lot more thrashing on the in-line tally table to create the NumList CTE. As I mentioned in my earlier post, going with a disk-based Tally table in this case will be a LOT better than an in-line, non-indexed tally table generator.
It's good to see my intuition about the "Quirky Update" prevailing is true...
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
June 1, 2010 at 11:10 am
Jeff Moden (5/31/2010)
WayneS (5/31/2010)
This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.
Guess I'll have to build a million row test an find out for sure.
Thanks for the great post, Wayne. Alternatives are always good.
Jeff - your mis-spelling, though accidental, is still entirely accurate! :w00t:
Maybe is should be re-named to the Quiky Quirky Update???
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
Viewing 15 posts - 1 through 15 (of 38 total)
You must be logged in to reply to this topic. Login to reply