Return a set of rows between 2 values (multiple times)

  • Hello

     

    I have a table, a (sample only shown)

    This has a set of ids in it

     

    CREATE TABLE a (id int)

    INSERT INTO a (id) VALUES (1, 4, 8, 15)

     

    I'm trying to create a set (table) that holds the id in column 1 (call it pid) and links all the integers for this value to the next in column 2 (call it cid)

    This is so I can join back to a parent id

    NOTE: this is a table I have inherited that cannot be changed so it doesn't hold a parent id

     

    So this would result in:

    pid   cid

    1   1

    1   2

    1   3

    4  4

    4  5

    4  6

    4  7

    8  8

    9  8

    10  8

    11  8

    12  8

    13  8

    14  8

     

     

    Any thoughts

     

    Thanks

    - Damian

  • So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A wrote:

    So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

    Apparently, the PID is the highest CID that's <= to generated CID.  Think "last value not greater than the current value).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    Thom A wrote:

    So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

    Apparently, the PID is the highest CID that's <= to generated CID.  Think "last value not greater than the current value).

    That makes more sense; I was reading it the other way around.

    If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:

    CREATE TABLE a (id int)
    GO
    INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
    GO
    INSERT INTO a (id)
    VALUES (1),(4),(8),(15); --This is 4 rows
    GO

    SELECT a.id,
    GS.value
    FROM GENERATE_SERIES(1,14) GS
    CROSS APPLY (SELECT TOP(1) a.id
    FROM dbo.a
    WHERE a.id <= GS.value
    ORDER BY id DESC) a;
    GO
    DROP TABLE dbo.a;

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A wrote:

    If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:

    Just to be clear on that, I said that GENERATE_SERIES() is fine for LARGE single series.  😀  It's probably also fine for this smaller series because this looks like a "low hit rate" solution.  If it turns out to have a high hit rate, I'd personally use something like the fnTally function or Itzik Ben-Gan's GetNums function.  That's just me, though.  And thanks for the reminder that I should probably publish the testing I did for that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    And thanks for the reminder that I should probably publish the testing I did for that.

    In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (GENERATE_SERIES) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • perfect, thanks

    Yes, I'd defined VALUES incorrectly - typing rather than cut and paste

    - Damian

  • The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?

    If you want the output in a certain order I would put an "order by" clause on the outer select

    CREATE TABLE a (id int)
    GO

    INSERT INTO a (id)
    VALUES (1),(4),(8),(15); --This is 4 rows
    GO

    SELECT a.id,
    GS.value
    FROM GENERATE_SERIES(1,14) GS
    CROSS APPLY (SELECT TOP(1) id
    FROM dbo.a
    WHERE a.id <= GS.value
    ) a

    Order by 1, 2


    DROP TABLE dbo.a;

    on the outer select statement.

  • Thom A wrote:

    Jeff Moden wrote:

    And thanks for the reminder that I should probably publish the testing I did for that.

    In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (GENERATE_SERIES) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.

    You're in good company... Erik Darling reported the same.  I also don't load any pre-RTM releases because of similar issues and I normally don't load RTMs.  In fact, I normally wait until about (in the old days) SP-1 or (now adays) CU-3.  I learned my lessons all to well with 2005 and also witnessed the regression travesties of 2012 until they finally fixed all that in SP-3.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • aitorzur@hotmail.com wrote:

    The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?

    No, it's certainly nessecary, otherwise you get an arbitrary TOP (1) row, which could be any value of id with a value lower than or equal to the value returned from GENERATE_SERIES; it's not just necessary, it's mandatory for consistent and reliable results.

    Also, as a side note, I (and many others) are not a fan of the ORDER BY <ordinal position> syntax. Aaron Bertrand wrote a good article many years ago on the subject: Bad Habits to Kick : ORDER BY ordinal

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • DamianC wrote:

    perfect, thanks

    Yes, I'd defined VALUES incorrectly - typing rather than cut and paste

    Just to be sure, are you all set now?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thom A wrote:

    Jeff Moden wrote:

    Thom A wrote:

    So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

    Apparently, the PID is the highest CID that's <= to generated CID.  Think "last value not greater than the current value).

    That makes more sense; I was reading it the other way around.

    If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:

    CREATE TABLE a (id int)
    GO
    INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
    GO
    INSERT INTO a (id)
    VALUES (1),(4),(8),(15); --This is 4 rows
    GO

    SELECT a.id,
    GS.value
    FROM GENERATE_SERIES(1,14) GS
    CROSS APPLY (SELECT TOP(1) a.id
    FROM dbo.a
    WHERE a.id <= GS.value
    ORDER BY id DESC) a;
    GO
    DROP TABLE dbo.a;

    I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY.

    SELECT MAX(a.id) OVER(ORDER BY GS.value ROWS UNBOUNDED PRECEDING) AS id,
    GS.value
    FROM GENERATE_SERIES(1,14) GS
    LEFT OUTER JOIN #a AS a
    ON gS.value = a.id

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen wrote:

    I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY.

    Really good point on using a Windowing MAX function to do this "Data Smear" and I love it a lot because of its simplicity and elegant nature (and I've seen Drew work a miracle with similar code in a much longer thread).  You know me though, if someone makes an insitu guess at performance, the Dust Bunnies want me to prove things one way or the other. 😀

    If you want to follow along but don't have 2022 to test with, something like fnTally will do the trick here and, so, for others that may also not have 2022, you can get a copy of the fnTally function from the following URL.  It is NOT a direct replacement for GENERATE_SERIES() because it (fnTally) can only start an 0 or 1 and has no "Increment" parameter.  For me, neither is needed about 99.9999% of the time.

    https://www.sqlservercentral.com/scripts/create-a-tally-function-fntally

    Once you have fnTally installed in your test database, just replace "GENERATE_SERIES" with "fnTally" and "t.value" with "t.N" everywhere in the following code.  Since this IS a 2022 forum, we'll continue to use "GENERATE_SERIES". 😉

    The following code generates a much more substantial test of 1000 random INTs (except for "1", which is always present) whose values vary from 1 to 1 MILLION.  The code also guarantees no dupes so you might get slightly fewer than a total of 1000 unique  random values.  It also uses a TEMP Table in TempDB instead of a "real" table.

    --=============================================================================
    -- Create and populate the test table with ~1000 unique random values
    -- ranging in value from "1" to "1000000".
    --=============================================================================
    --===== Conditionally drop and create the test table
    DROP TABLE IF EXISTS #a;
    GO
    CREATE TABLE #a
    (id INT PRIMARY KEY CLUSTERED) --Notice the index here!!!
    ;
    GO
    --===== Populate the table with the test values.
    INSERT INTO #a WITH (TABLOCK) --Required for "Minimal Logging" when CI present.
    (id)
    SELECT id = 1
    UNION
    SELECT id = ABS(CHECKSUM(NEWID())%1000000)+1
    FROM GENERATE_SERIES(1,999) --One has already been assigned
    ORDER BY id --Also sometimes needed for "Minimal Logging".
    ;
    GO

    The first contender is the CROSS APPLY "Data Smear".  Here's the code...

    --=============================================================================
    -- CROSS APPLY "Data Smear"
    --=============================================================================
    --===== "Clear the guns" for the upcoming test.
    -- The code output will be directed to a TEMP Table.
    DROP TABLE IF EXISTS #Result2;
    CHECKPOINT;
    DBCC FREEPROCCACHE;
    DBCC DROPCLEANBUFFERS;
    GO
    --===== CROSS APPLY "Data Smear"
    SET STATISTICS TIME,IO ON
    ;
    SELECT
    pid = GS.value
    ,cid = src.id
    INTO #Result2
    FROM GENERATE_SERIES(1,1000000) GS
    CROSS APPLY (SELECT TOP(1) a.id
    FROM #a a
    WHERE a.id <= GS.value
    ORDER BY id DESC) src
    ;
    SET STATISTICS TIME,IO OFF
    ;
    GO
    -- SELECT * FROM #Result2 ORDER BY pid
    GO

    The run stats on that are pretty impressive for duration but, man... the number of logical reads add up to 15,625 MB (15.6 GB) of memory I/O...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    Table '#a__________________________________________________________________________________________________________________000000000029'.
    Scan count 1000000, logical reads 2000000, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 2860 ms, elapsed time = 359 ms.

    (1000000 rows affected)

    And, here's the test code for the Windows Function "Data Smear"...

    --=============================================================================
    -- Window Function "Data Smear"
    --=============================================================================
    --===== "Clear the guns" for the upcoming test.
    -- The code output will be directed to a TEMP Table.
    DROP TABLE IF EXISTS #Result1;
    CHECKPOINT;
    DBCC FREEPROCCACHE;
    DBCC DROPCLEANBUFFERS;
    GO
    --===== Window Function "Data Smear"
    SET STATISTICS TIME,IO ON
    ;
    SELECT pid = t.value
    ,cid = MAX(src.id) OVER (ORDER BY t.value ROWS UNBOUNDED PRECEDING)
    INTO #Result1
    FROM GENERATE_SERIES(1,1000000) t
    LEFT JOIN #a src
    ON src.id = t.value
    ORDER BY pid
    ;
    SET STATISTICS TIME,IO OFF
    ;
    -- SELECT * FROM #Result1 ORDER BY pid
    GO

    ... and here are the run stats for that... It takes ~3 times longer but ya just gotta love the reads!

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    Table '#a__________________________________________________________________________________________________________________000000000029'.
    Scan count 5, logical reads 7, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 2425 ms, elapsed time = 938 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    (1000000 rows affected)

    Now... here's the BIG difference.  If you remove the Clustered Index from the #a test table, the Windows Function "Data Smear" still takes the same amount of duration and does better on CPU and better on reads, which were tiny to begin with!

    Under those same conditions (no Clustered Index), the CROSS APPLY is on the miserable side for duration, most likely because the reads increased to a whopping 18,636 MB (16.6 GB)...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    Table 'Worktable'. Scan count 1000000, logical reads 2385355, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#a__________________________________________________________________________________________________________________000000000031'.
    Scan count 1, logical reads 2, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 129704 ms, elapsed time = 126875 ms.

    (1000000 rows affected)

    If you need to shave off more than a half second of duration then, at the expense of needing an index and suffering a comparatively HUGE number of reads, the CROSS APPLY method will do.

    If you don't mind it taking nearly a second and love the idea of almost non-existent reads, then the Windows Function is the clear winner with our without an index.

    And, yes, both went massively parallel and used all 12 logical processors on my laptop.  And, just to say it out loud, they both returned the correct output.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I did, thanks Jeff

    Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance

    The suggestion by Thom works, the window amendment by drew also works

    Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id

    Runs in about 3 seconds which is more than good enough for my purpose

     

    Thanks

    - Damian

  • I did, thanks Jeff

    Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance

    The suggestion by Thom works, the window amendment by drew also works

    Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id

    Runs in about 3 seconds which is more than good enough for my purpose

    - Damian

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply