Return a set of rows between 2 values (multiple times)

Question

Return a set of rows between 2 values (multiple times)

DamianC

SSCrazy Eights

Points: 8023
More actions
February 9, 2023 at 2:18 pm

#4147552
Hello

I have a table, a (sample only shown)
This has a set of ids in it
```
CREATE TABLE a (id int)

INSERT INTO a (id) VALUES (1, 4, 8, 15)
```
I'm trying to create a set (table) that holds the id in column 1 (call it pid) and links all the integers for this value to the next in column 2 (call it cid)
This is so I can join back to a parent id
NOTE: this is a table I have inherited that cannot be changed so it doesn't hold a parent id

So this would result in:
pid cid
1 1
1 2
1 3
4 4
4 5
4 6
4 7
8 8
9 8
10 8
11 8
12 8
13 8
14 8

Any thoughts

Thanks
- Damian

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99128 More actions · Answer 1

So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Jeff Moden SSC Guru Points: 1004304 More actions · Answer 2

Thom A wrote:

So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?

Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Thom A SSC Guru Points: 99128 More actions · Answer 3

Jeff Moden wrote:

Thom A wrote:
So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?
Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).

That makes more sense; I was reading it the other way around.

If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:

CREATE TABLE a (id int)
GO
INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
GO
INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO

SELECT a.id,
       GS.value
FROM GENERATE_SERIES(1,14) GS
     CROSS APPLY (SELECT TOP(1) a.id
                  FROM dbo.a
                  WHERE a.id <= GS.value
                  ORDER BY id DESC) a;
GO
DROP TABLE dbo.a;

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Jeff Moden SSC Guru Points: 1004304 More actions · Answer 4

Thom A wrote:

If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:

Just to be clear on that, I said that GENERATE_SERIES() is fine for LARGE single series. 😀 It's probably also fine for this smaller series because this looks like a "low hit rate" solution. If it turns out to have a high hit rate, I'd personally use something like the fnTally function or Itzik Ben-Gan's GetNums function. That's just me, though. And thanks for the reminder that I should probably publish the testing I did for that.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Thom A SSC Guru Points: 99128 More actions · Answer 5

Jeff Moden wrote:

And thanks for the reminder that I should probably publish the testing I did for that.

In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (GENERATE_SERIES) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

DamianC SSCrazy Eights Points: 8023 More actions · Answer 6

perfect, thanks

Yes, I'd defined VALUES incorrectly - typing rather than cut and paste

- Damian

aitorzur@hotmail.com SSC Rookie Points: 32 More actions · Answer 7

The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?

If you want the output in a certain order I would put an "order by" clause on the outer select

CREATE TABLE a (id int)
GO

INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO

SELECT a.id,
       GS.value
FROM GENERATE_SERIES(1,14) GS
     CROSS APPLY (SELECT TOP(1) id
                  FROM dbo.a
                  WHERE a.id <= GS.value
                  ) a

Order by 1, 2


DROP TABLE dbo.a;

on the outer select statement.

This reply was modified 1 year, 9 months ago by aitorzur@hotmail.com.

Jeff Moden SSC Guru Points: 1004304 More actions · Answer 8

Thom A wrote:

Jeff Moden wrote:
And thanks for the reminder that I should probably publish the testing I did for that.
In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (GENERATE_SERIES) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.

You're in good company... Erik Darling reported the same. I also don't load any pre-RTM releases because of similar issues and I normally don't load RTMs. In fact, I normally wait until about (in the old days) SP-1 or (now adays) CU-3. I learned my lessons all to well with 2005 and also witnessed the regression travesties of 2012 until they finally fixed all that in SP-3.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Thom A SSC Guru Points: 99128 More actions · Answer 9

aitorzur@hotmail.com wrote:

The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?

No, it's certainly nessecary, otherwise you get an arbitrary TOP (1) row, which could be any value of id with a value lower than or equal to the value returned from GENERATE_SERIES; it's not just necessary, it's mandatory for consistent and reliable results.

Also, as a side note, I (and many others) are not a fan of the ORDER BY <ordinal position> syntax. Aaron Bertrand wrote a good article many years ago on the subject: Bad Habits to Kick : ORDER BY ordinal

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Jeff Moden SSC Guru Points: 1004304 More actions · Answer 10

DamianC wrote:

perfect, thanks
Yes, I'd defined VALUES incorrectly - typing rather than cut and paste

Just to be sure, are you all set now?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

drew.allen SSC Guru Points: 76985 More actions · Answer 11

Thom A wrote:

Jeff Moden wrote:
Thom A wrote:
So where do these linked integers come from? Why, for example, is 1 linked to 1, 2, and 3, but the numbers 8 to 14 only linked to 8?
Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).
That makes more sense; I was reading it the other way around.
If that's the case, the OP could make use of GENERATE_SERIES as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:
CREATE TABLE a (id int)
GO
INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
GO
INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO

SELECT a.id,
       GS.value
FROM GENERATE_SERIES(1,14) GS
     CROSS APPLY (SELECT TOP(1) a.id
                  FROM dbo.a
                  WHERE a.id <= GS.value
                  ORDER BY id DESC) a;
GO
DROP TABLE dbo.a;

I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY.

SELECT MAX(a.id) OVER(ORDER BY GS.value ROWS UNBOUNDED PRECEDING) AS id,
       GS.value
FROM GENERATE_SERIES(1,14) GS
LEFT OUTER JOIN #a AS a
    ON gS.value = a.id

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jeff Moden SSC Guru Points: 1004304 More actions · Answer 12

drew.allen wrote:

I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY.

Really good point on using a Windowing MAX function to do this "Data Smear" and I love it a lot because of its simplicity and elegant nature (and I've seen Drew work a miracle with similar code in a much longer thread). You know me though, if someone makes an insitu guess at performance, the Dust Bunnies want me to prove things one way or the other. 😀

If you want to follow along but don't have 2022 to test with, something like fnTally will do the trick here and, so, for others that may also not have 2022, you can get a copy of the fnTally function from the following URL. It is NOT a direct replacement for GENERATE_SERIES() because it (fnTally) can only start an 0 or 1 and has no "Increment" parameter. For me, neither is needed about 99.9999% of the time.

https://www.sqlservercentral.com/scripts/create-a-tally-function-fntally

Once you have fnTally installed in your test database, just replace "GENERATE_SERIES" with "fnTally" and "t.value" with "t.N" everywhere in the following code. Since this IS a 2022 forum, we'll continue to use "GENERATE_SERIES". 😉

The following code generates a much more substantial test of 1000 random INTs (except for "1", which is always present) whose values vary from 1 to 1 MILLION. The code also guarantees no dupes so you might get slightly fewer than a total of 1000 unique random values. It also uses a TEMP Table in TempDB instead of a "real" table.

--=============================================================================
--      Create and populate the test table with ~1000 unique random values
--      ranging in value from "1" to "1000000".
--=============================================================================
--===== Conditionally drop and create the test table
   DROP TABLE IF EXISTS #a;
GO
 CREATE TABLE #a
        (id INT PRIMARY KEY CLUSTERED) --Notice the index here!!!
;
GO
--===== Populate the table with the test values.
 INSERT INTO #a WITH (TABLOCK) --Required for "Minimal Logging" when CI present.
        (id)
 SELECT id = 1
  UNION 
 SELECT id = ABS(CHECKSUM(NEWID())%1000000)+1
   FROM GENERATE_SERIES(1,999) --One has already been assigned
  ORDER BY id --Also sometimes needed for "Minimal Logging".
;
GO

The first contender is the CROSS APPLY "Data Smear". Here's the code...

--=============================================================================
--      CROSS APPLY "Data Smear"
--=============================================================================
--===== "Clear the guns" for the upcoming test.
     -- The code output will be directed to a TEMP Table.
   DROP TABLE IF EXISTS #Result2;
        CHECKPOINT;
   DBCC FREEPROCCACHE;
   DBCC DROPCLEANBUFFERS;
GO
--===== CROSS APPLY "Data Smear"
    SET STATISTICS TIME,IO ON
;
 SELECT 
         pid = GS.value
        ,cid = src.id
   INTO #Result2
   FROM GENERATE_SERIES(1,1000000) GS
  CROSS APPLY (SELECT TOP(1) a.id
                FROM #a a
               WHERE a.id <= GS.value
               ORDER BY id DESC) src
;
    SET STATISTICS TIME,IO OFF
;
GO
-- SELECT * FROM #Result2 ORDER BY pid
GO

The run stats on that are pretty impressive for duration but, man... the number of logical reads add up to 15,625 MB (15.6 GB) of memory I/O...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table '#a__________________________________________________________________________________________________________________000000000029'. 
Scan count 1000000, logical reads 2000000, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 2860 ms,  elapsed time = 359 ms.

(1000000 rows affected)

And, here's the test code for the Windows Function "Data Smear"...

--=============================================================================
--      Window Function "Data Smear"
--=============================================================================
--===== "Clear the guns" for the upcoming test.
     -- The code output will be directed to a TEMP Table.
   DROP TABLE IF EXISTS #Result1;
        CHECKPOINT;
   DBCC FREEPROCCACHE;
   DBCC DROPCLEANBUFFERS;
GO
--===== Window Function "Data Smear"
    SET STATISTICS TIME,IO ON
;
 SELECT  pid = t.value
        ,cid = MAX(src.id) OVER (ORDER BY t.value ROWS UNBOUNDED PRECEDING)
   INTO #Result1
   FROM GENERATE_SERIES(1,1000000) t
   LEFT JOIN #a src
     ON src.id = t.value
  ORDER BY pid
;
    SET STATISTICS TIME,IO OFF
;
-- SELECT * FROM #Result1 ORDER BY pid
GO

... and here are the run stats for that... It takes ~3 times longer but ya just gotta love the reads!

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table '#a__________________________________________________________________________________________________________________000000000029'. 
Scan count 5, logical reads 7, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 2425 ms,  elapsed time = 938 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

(1000000 rows affected)

Now... here's the BIG difference. If you remove the Clustered Index from the #a test table, the Windows Function "Data Smear" still takes the same amount of duration and does better on CPU and better on reads, which were tiny to begin with!

Under those same conditions (no Clustered Index), the CROSS APPLY is on the miserable side for duration, most likely because the reads increased to a whopping 18,636 MB (16.6 GB)...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table 'Worktable'. Scan count 1000000, logical reads 2385355, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#a__________________________________________________________________________________________________________________000000000031'. 
Scan count 1, logical reads 2, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 129704 ms,  elapsed time = 126875 ms.

(1000000 rows affected)

If you need to shave off more than a half second of duration then, at the expense of needing an index and suffering a comparatively HUGE number of reads, the CROSS APPLY method will do.

If you don't mind it taking nearly a second and love the idea of almost non-existent reads, then the Windows Function is the clear winner with our without an index.

And, yes, both went massively parallel and used all 12 logical processors on my laptop. And, just to say it out loud, they both returned the correct output.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

DamianC SSCrazy Eights Points: 8023 More actions · Answer 13

I did, thanks Jeff

Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance

The suggestion by Thom works, the window amendment by drew also works

Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id

Runs in about 3 seconds which is more than good enough for my purpose

Thanks

- Damian

DamianC SSCrazy Eights Points: 8023 More actions · Answer 14

I did, thanks Jeff

Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance

The suggestion by Thom works, the window amendment by drew also works

Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id

Runs in about 3 seconds which is more than good enough for my purpose

- Damian