Collapsing a Table

  • Thanks Jeff, I'll fix that in the production code too!

    darn flies!

    declare @rangeTable table (start int, [end] int)

    insert @rangeTable

    (Start, [End])

    select 1, 2 union all

    select 3, 4 union all

    select 5, 6 union all

    select 8, 8 union all

    select 9, 9 union all

    select 10, 10 union all

    select 20, 22 union all

    select 23, 25 union all

    select 26, 30 UNION ALL

    select 32, 32 union all

    select 34, 35

    select s.start

    , IsNull((

    select MIN(e.[end])

    from @rangeTable e

    where e.[start] > s.[end]

    and not exists (

    select 1

    from @rangeTable e1

    where e1.[start] = e.[end] + 1

    )

    ),s.[end]) as [end]

    from @rangeTable s

    where not exists (

    select 1

    from @rangeTable s1

    where s1.[end] = s.start - 1

    )

  • cgreathouse (5/31/2010)


    Hello

    I'm trying to find a way to collapse a large table I'm working with. I'm noticing that there are a lot of rows that are very similar. The only thing that is different are the columns that specify a range.

    So for example (simplified version) I'm trying to go from...

    START END

    1 2

    3 4

    5 6

    8 8

    9 9

    10 10

    20 22

    23 25

    26 30

    to...

    START END

    1 6

    8 10

    20 30

    Any suggestions? Here's some code to get it setup...

    declare @rangeTable table (start int, [end] int)

    insert @rangeTable

    select 1, 2 union all

    select 3, 4 union all

    select 5, 6 union all

    select 8, 8 union all

    select 9, 9 union all

    select 10, 10 union all

    select 20, 22 union all

    select 23, 25 union all

    select 26, 30

    Thanks!

    Can the ranges have multiple overlaps? So could your source data have this

    START END

    1 2

    1 5 --?? this

    3 4

    5 6

    8 8

    9 9

    10 10

    20 22

    23 25

    26 30

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • I have not seen this case happen yet. The data comes from a an outside source so it wouldn't surprise me if this eventually happens. But it isn't currently happening.

    If it does, I my be back to consult with you 😀

  • paul_ramster (6/7/2010)


    Thanks Jeff, I'll fix that in the production code too!

    darn flies!

    Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...

    startend

    16

    810

    2030

    3235

    3435

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/8/2010)


    paul_ramster (6/7/2010)


    Thanks Jeff, I'll fix that in the production code too!

    darn flies!

    Ah... my apologies for being the messenger. There's more than one fly... here are the results of your latest...

    startend

    16

    810

    2030

    3235

    3435

    This should do it

    WITH Starts(start) AS (

    SELECT a.start

    FROM @rangeTable a

    WHERE NOT EXISTS(SELECT * FROM @rangeTable b

    WHERE a.start = b.[end]+1)),

    Ends([end]) AS (

    SELECT a.[end]

    FROM @rangeTable a

    WHERE NOT EXISTS(SELECT * FROM @rangeTable b

    WHERE a.[end]+1 = b.start))

    SELECT start,

    MIN([end]) AS [end]

    FROM Starts

    INNER JOIN Ends ON [end]>=start

    GROUP BY start

    ORDER BY start;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Thanks for the suggestion.

    However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.

    I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem Jeff noted.

    This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.

    The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.

    select s.start

    , IsNull((

    select MIN(e.[end])

    from @rangeTable e

    where e.[end] >= s.[end]

    and not exists (

    select 1

    from @rangeTable e1

    where e1.[start] = e.[end] + 1

    )

    ),s.[end]) as [end]

    from @rangeTable s

    where not exists (

    select 1

    from @rangeTable s1

    where s1.[end] = s.start - 1

    )

  • paul_ramster (6/10/2010)


    Thanks for the suggestion.

    However, while that has one big advantage (in that it works), it has a problem for me (not SQL2000 compatible). Although that shouldn't matter on this forum, it does matter to me as this code needs to run on SQL2000.

    I think just changing the > to a >= in the triangular join in my last suggestion fixes the problem you noted.

    This performs better than the CLR solution where the [start] field is not indexed, and the same where it is indexed.

    The two versions do give noticeably different results where overlaps occur in the data, but I don't need to allow for that.

    Did you retest for performance?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've done some performance checks.

    I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.

    Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).

    Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.

    The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).

  • paul_ramster (6/10/2010)


    I've done some performance checks.

    I expect the performance would be poor where there are many (millions of) consecutive rows to flatten, but the data I need to process is typically not huge (thousands rather than millions of rows) with only a small number of consecutive rows to flatten.

    Where a primary key exists on [start], my SQL and Mark's CLR give the same result (basically 4 clustered index seeks).

    Where no primary key exists on [start], my SQL gives four table scans, and Mark's CLR gives four table scans plus an expensive sort, which makes it more expensive.

    The real data is somewhat more complex with a compound primary key, and I'm looking for consecutive ranges within a part of the existing primary key. On that data, with the data sizes and strucure that I'm seeing, the performance is more than good enough (the primary key provide enough selectivity that the triangular join deoesn't cause an issue).

    I just did my own performance testing (see below). My recommendation would be that if you don't want to use the Quirky Update, then use a Cursor or While loop because the performance of the current code is horrible thanks to the triangular join and I strongly recommend that you don't use it. Yeah... I know... you say it's "good enough" because of a supposedly low row count. Famous last words.

    Here's a simple test with [font="Arial Black"]only 4500 rows [/font]with a range breaks setup so [font="Arial Black"]only 9 rows are in each range[/font]. The code takes almost 6 seconds on my box and produces more than 13,000 reads. Remember... that for [font="Arial Black"]only [/font]4,500 rows.

    Here's the code to create the test table...

    CREATE TABLE #RangeTable (start int, [end] int, MinStart INT);

    WITH

    cteTally AS

    ( --=== Build a quick little "Tally" table as a row source

    SELECT TOP 10000

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    ) --=== This will actually only create 4500 rows because of the "ranges"

    INSERT INTO #RangeTable

    (Start, [End])

    SELECT t.N, t.N+1

    FROM cteTally t

    WHERE t.N%2 = 1

    AND (t.n+1)%20 <> 0;

    Here's the "current" code you're using... it takes 5.6 seconds (on my box) to process only 4,500 rows...

    select s.start

    , IsNull((

    select MIN(e.[end])

    from #RangeTable e

    where e.[end] >= s.[end]

    and not exists (

    select 1

    from #RangeTable e1

    where e1.[start] = e.[end] + 1

    )

    ),s.[end]) as [end]

    from #RangeTable s

    where not exists (

    select 1

    from #RangeTable s1

    where s1.[end] = s.start - 1

    )

    The Quirky Update method (below) takes 16 MILLI-SECONDS and produces 15 reads. Even the worse cursor would only be 8 to 10 times slower meaning that even a cursor would blow the doors off the current method.

    DECLARE @MinStart INT,

    @PrevEnd INT

    ;

    UPDATE #RangeTable

    SET @MinStart = MinStart = CASE

    WHEN @PrevEnd+1 <> Start OR @PrevEnd IS NULL

    THEN Start

    ELSE ISNULL(@MinStart,Start)

    END,

    @PrevEnd = [End] --Anchor

    OPTION (MAXDOP 1) --Absolutely required

    ;

    SELECT MinStart AS Start, MAX([End])

    FROM #RangeTable

    GROUP BY MinStart

    ORDER BY MinStart

    ;

    "Set based" doesn't mean "all in one query". The "current" code you're using has a triangular join in it and the RBAR it produces is thousands of times worse than that of even a Cursor. I, again, strongly urge you to reconsider before you put this computational time bomb into your production code. "Good enough" never is if it's based on low row counts... 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply