Collapsing a Table

  • Hello

    I'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.

    So for example (simplified version) I'm trying to go from...

    START END

    1 2

    3 4

    5 6

    8 8

    9 9

    10 10

    20 22

    23 25

    26 30

    to...

    START END

    1 6

    8 10

    20 30

    Any suggestions? Here's some code to get it setup...

    declare @rangeTable table (start int, [end] int)

    insert @rangeTable

    select 1, 2 union all

    select 3, 4 union all

    select 5, 6 union all

    select 8, 8 union all

    select 9, 9 union all

    select 10, 10 union all

    select 20, 22 union all

    select 23, 25 union all

    select 26, 30

    Thanks!

  • I'm not sure if this is the fastest way, but at least it's an option... 😉

    ;

    -- unpivot and group the values

    WITH upvt AS

    (

    SELECT Val

    FROM

    (SELECT START,[END]

    FROM @rangeTable) p

    UNPIVOT

    (Val FOR Dat IN

    (START,[END])

    )AS unpvt

    GROUP BY Val

    ),

    -- divide into subgroups holding consecutive numbers

    numbered AS

    (

    SELECT Val - ROW_NUMBER() OVER(ORDER BY Val) AS grp, Val

    FROM upvt

    )

    -- select min and max value per subgroup

    SELECT MIN(val) AS START, MAX(val) AS [END]

    FROM numbered

    GROUP BY grp



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (5/31/2010)


    I'm not sure if this is the fastest way, but at least it's an option... 😉

    That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.

    STARTEND

    16

    810

    2020

    2223

    2526

    3030

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/31/2010)


    lmu92 (5/31/2010)


    I'm not sure if this is the fastest way, but at least it's an option... 😉

    That seems to loose the bubble when the ranges on the same row are more than 1. Here's the results of running that code.

    STARTEND

    16

    810

    2020

    2223

    2526

    3030

    Ok, see your point... Seems like I misunderstood the question. Working on an alternative code.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Here's my take on the problem... sometimes ya just gotta cheat to keep it both simple and fast. 😛

    declare @rangeTable table (start int, [end] int PRIMARY KEY CLUSTERED, MinStart INT)

    DECLARE @MinStart INT,

    @PrevEnd INT

    ;

    insert @rangeTable

    (Start, [End])

    select 1, 2 union all

    select 3, 4 union all

    select 5, 6 union all

    select 8, 8 union all

    select 9, 9 union all

    select 10, 10 union all

    select 20, 22 union all

    select 23, 25 union all

    select 26, 30

    ;

    UPDATE @RangeTable

    SET @MinStart = MinStart = CASE

    WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL

    THEN Start

    ELSE ISNULL(@MinStart,Start)

    END,

    @PrevEnd = [End] --Anchor

    OPTION (MAXDOP 1) --Absolutely required

    ;

    SELECT MinStart AS Start, MAX([End])

    FROM @rangetable

    GROUP BY MinStart

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've been trying to come up with a self referencing solution, but I'll give up now since I won't be able to come even close to your solution. :crying:



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Heh... dammit. I almost didn't post my solution for fear of just that. I really wanted to see a good self referencing solution.:pinch:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok, here's what I came up with: It's a "double self referencing" table.

    The first one (sub) is used to eliminate consecutive rows and the second one (final select) gets the first and last value of the remaining pairs.

    Like I said: not even close to what you came up with. Neither in terms of performance nor in terms of readability...

    ;

    WITH cte AS

    (

    SELECT ROW_NUMBER() OVER(ORDER BY START ) ROW, *

    FROM @rangeTable

    ),

    sub AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY cte1.start ) rowx,

    cte1.[end],

    cte2.start,

    ISNULL(cte2.start-cte1.[end],0) AS diff

    FROM cte cte1

    FULL JOIN cte cte2

    ON cte1.row=cte2.row-1

    WHERE ISNULL(cte2.start-cte1.[end],0)<>1

    )

    SELECT sub1.start,sub2.[end]

    FROM sub sub1

    INNER JOIN sub sub2

    ON sub1.rowx=sub2.rowx-1



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Not to worry... I was going down the same path before I gave in to the lure of the Quirky Update.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.

    This method is based on one of the Gap solutions provided by Itzik Ben-Gan in the excellent book SQL SERVER MVP Deep Dives[/url]. This solution is also based upon the sample code that Jeff provided in his solution. See the relevant comments in the code.

    DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)

    DECLARE @MinStart INT,

    @PrevEnd INT;

    INSERT @rangeTable

    (Start, [End])

    SELECT 1, 2 UNION ALL

    SELECT 3, 4 UNION ALL

    SELECT 5, 6 UNION ALL

    SELECT 8, 8 UNION ALL

    SELECT 9, 9 UNION ALL

    SELECT 10, 10 UNION ALL

    SELECT 20, 22 UNION ALL

    SELECT 23, 25 UNION ALL

    SELECT 26, 30 ;

    -- get the maximum end value to restrict the work the tally table needs to do.

    DECLARE @max-2 INT

    SELECT @max-2 = max([end]) FROM @rangeTable

    -- Note that I'm using an in-line tally table here. You should use a

    -- permanent tally table for better performance!

    -- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.

    ;WITH

    TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),

    MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),

    TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM MILLIONS),

    -- the following select statement gets all of the numbers between the start/end columns.

    NumList AS (SELECT t.N

    FROM TALLY t

    JOIN @rangeTable r

    ON t.N BETWEEN r.start AND r.[end]

    WHERE t.N <= @max-2)

    -- get the starting and ending range

    -- cloned from Itzek Ben-Gan's gaps solution #3

    -- in SQL Server MVP Deep Dives

    SELECT [Start] = min(N),

    [End] = max(N)

    FROM (SELECT A.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)

    FROM NumList A) D

    GROUP BY grp

    ORDER BY [Start]

    Edit: removed erronous leading comments

    Edit2: corrected name misspelling

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/31/2010)


    This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.

    For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.

    Guess I'll have to build a million row test an find out for sure.

    Thanks for the great post, Wayne. Alternatives are always good.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • BWAA-HAA!!!! I just realized the "Deep Dive" method is nothing more than the now old "difference between two sets of ROW_NUMBER()'s trick. Stupid me for not thinking of it here. :blush:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:

    Here's the test table I used...

    DECLARE @rangeTable TABLE (start INT, [end] INT PRIMARY KEY CLUSTERED, MinStart INT)

    DECLARE @MinStart INT,

    @PrevEnd INT;

    INSERT @rangeTable

    (Start, [End])

    SELECT 100000, 299999 UNION ALL

    SELECT 300000, 499999 UNION ALL

    SELECT 500000, 699999 UNION ALL

    SELECT 800000, 899999 UNION ALL

    SELECT 900000, 999999 UNION ALL

    SELECT 1000000, 1099999 UNION ALL

    SELECT 2000000, 2299999 UNION ALL

    SELECT 2300000, 2599999 UNION ALL

    SELECT 2600000, 3099999

    ;

    Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...

    -- Note that I'm using an in-line tally table here. You should use a

    -- permanent tally table for better performance!

    -- See Jeff Modem's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.

    ;WITH

    TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),

    MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),

    TRILLIONS (N) AS (SELECT 1 FROM MILLIONS t1 CROSS JOIN MILLIONS t2),

    TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM TRILLIONS),

    -- the following select statement gets all of the numbers between the start/end columns.

    NumList AS (SELECT t.N

    FROM TALLY t

    JOIN @rangeTable r

    ON t.N BETWEEN r.start AND r.[end]

    WHERE t.N <= @max-2 --This is essential to limit the max number of rows returned by cteTally

    )

    -- get the starting and ending range

    -- cloned from Itzek Ben-Gan's gaps solution #3

    -- in SQL Server MVP Deep Dives

    SELECT [Start] = min(N),

    [End] = max(N)

    FROM (--==== This little gem does the now classic "Difference between two sets of row numbers" to form the groups

    SELECT a.N, grp = N - ROW_NUMBER() OVER (ORDER BY N)

    FROM NumList a) D

    GROUP BY grp

    ORDER BY [Start]

    I suppose that if you understand how the "Deep Dive" method works (like I do now), the following results for performance and resources used to solve a 9 row problem won't be too shocking :pinch::sick:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/31/2010)


    Well I'll be dipped. I guess I'm glad I didn't think of the "Difference between two row numbers" trick here. Without changing the number of rows in the problem... just changing the values, the "Deep Dive" method takes a terribly deep dive on performance and resource usage. Even a (gasp!) C.U.R.S.O.R. would be faster... MUCH faster. :w00t:

    I like your play on words here Jeff! (emphasis mine)

    Of course, I had to add another "layer" to the "Deep Dive" code so it would go over a million...

    Well, let me take the appropriate blame here.

    The "Deep Dive" method starts with the list of numbers to determine the islands of numbers - which is just the last select statement in the method that I submitted.

    The in-line TALLY table that I presented is the "standard" in-line tally table that I use.

    The NumList CTE I had to create to get the unique numbers into a dataset to run the values against.

    It is after this point that the "Deep Dive" method starts.

    Since Jeff changed to using not only larger numbers, but also larger ranges, there will be a lot more thrashing on the in-line tally table to create the NumList CTE. As I mentioned in my earlier post, going with a disk-based Tally table in this case will be a LOT better than an in-line, non-indexed tally table generator.

    It's good to see my intuition about the "Quirky Update" prevailing is true...

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Jeff Moden (5/31/2010)


    WayneS (5/31/2010)


    This post is just to include an alternate method. I haven't tested the various methods, but I believe that the "Quirky Update" solution that Jeff presented will probably be the most efficient.

    For the very limited number of rows in this example, the two methods take turns winning. Of course, the "Deep Dive" method you presented isn't controversial at all like the Quiky Update is.

    Guess I'll have to build a million row test an find out for sure.

    Thanks for the great post, Wayne. Alternatives are always good.

    Jeff - your mis-spelling, though accidental, is still entirely accurate! :w00t:

    Maybe is should be re-named to the Quiky Quirky Update???

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply