February 9, 2023 at 2:18 pm
Hello
I have a table, a (sample only shown)
This has a set of ids in it
CREATE TABLE a (id int)
INSERT INTO a (id) VALUES (1, 4, 8, 15)
I'm trying to create a set (table) that holds the id in column 1 (call it pid) and links all the integers for this value to the next in column 2 (call it cid)
This is so I can join back to a parent id
NOTE: this is a table I have inherited that cannot be changed so it doesn't hold a parent id
So this would result in:
pid cid
1 1
1 2
1 3
4 4
4 5
4 6
4 7
8 8
9 8
10 8
11 8
12 8
13 8
14 8
Any thoughts
Thanks
- Damian
February 9, 2023 at 2:54 pm
So where do these linked integers come from? Why, for example, is
1
linked to1
,2
, and3
, but the numbers8
to14
only linked to8
?
Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).
--Jeff Moden
Change is inevitable... Change for the better is not.
February 9, 2023 at 3:05 pm
Thom A wrote:So where do these linked integers come from? Why, for example, is
1
linked to1
,2
, and3
, but the numbers8
to14
only linked to8
?Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).
That makes more sense; I was reading it the other way around.
If that's the case, the OP could make use of GENERATE_SERIES
as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:
CREATE TABLE a (id int)
GO
INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
GO
INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO
SELECT a.id,
GS.value
FROM GENERATE_SERIES(1,14) GS
CROSS APPLY (SELECT TOP(1) a.id
FROM dbo.a
WHERE a.id <= GS.value
ORDER BY id DESC) a;
GO
DROP TABLE dbo.a;
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
February 9, 2023 at 3:14 pm
If that's the case, the OP could make use of
GENERATE_SERIES
as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:
Just to be clear on that, I said that GENERATE_SERIES() is fine for LARGE single series. 😀 It's probably also fine for this smaller series because this looks like a "low hit rate" solution. If it turns out to have a high hit rate, I'd personally use something like the fnTally function or Itzik Ben-Gan's GetNums function. That's just me, though. And thanks for the reminder that I should probably publish the testing I did for that.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 9, 2023 at 3:20 pm
And thanks for the reminder that I should probably publish the testing I did for that.
In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (GENERATE_SERIES
) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
February 9, 2023 at 3:34 pm
perfect, thanks
Yes, I'd defined VALUES incorrectly - typing rather than cut and paste
- Damian
February 9, 2023 at 3:37 pm
The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?
If you want the output in a certain order I would put an "order by" clause on the outer select
CREATE TABLE a (id int)
GO
INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO
SELECT a.id,
GS.value
FROM GENERATE_SERIES(1,14) GS
CROSS APPLY (SELECT TOP(1) id
FROM dbo.a
WHERE a.id <= GS.value
) a
Order by 1, 2
DROP TABLE dbo.a;
on the outer select statement.
February 9, 2023 at 3:37 pm
Jeff Moden wrote:And thanks for the reminder that I should probably publish the testing I did for that.
In truth, I need to do my own testing on it as well. I did do some during one of the early CTP releases, and it (
GENERATE_SERIES
) was awful. Some of that was fixed later, when they removed the sort operation and changed it to a table value function, but I haven't done any testing since the release. I've been waiting for CU1 instead, as the early CUs tend to have more than a few teething issues addressed and I wanted to give it as fair a test as I could after my initial results.
You're in good company... Erik Darling reported the same. I also don't load any pre-RTM releases because of similar issues and I normally don't load RTMs. In fact, I normally wait until about (in the old days) SP-1 or (now adays) CU-3. I learned my lessons all to well with 2005 and also witnessed the regression travesties of 2012 until they finally fixed all that in SP-3.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 9, 2023 at 3:42 pm
The " ORDER BY id DESC " portion on your solution is unnecessary. I don't think it help in any way, not even performance. Am I wrong?
No, it's certainly nessecary, otherwise you get an arbitrary TOP (1)
row, which could be any value of id
with a value lower than or equal to the value returned from GENERATE_SERIES
; it's not just necessary, it's mandatory for consistent and reliable results.
Also, as a side note, I (and many others) are not a fan of the ORDER BY <ordinal position>
syntax. Aaron Bertrand wrote a good article many years ago on the subject: Bad Habits to Kick : ORDER BY ordinal
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
February 9, 2023 at 8:08 pm
perfect, thanks
Yes, I'd defined VALUES incorrectly - typing rather than cut and paste
Just to be sure, are you all set now?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 9, 2023 at 8:36 pm
Jeff Moden wrote:Thom A wrote:So where do these linked integers come from? Why, for example, is
1
linked to1
,2
, and3
, but the numbers8
to14
only linked to8
?Apparently, the PID is the highest CID that's <= to generated CID. Think "last value not greater than the current value).
That makes more sense; I was reading it the other way around.
If that's the case, the OP could make use of
GENERATE_SERIES
as (based on your recent findings, Jeff), it's performant for a single series, which the OP seems to be using here:CREATE TABLE a (id int)
GO
INSERT INTO a (id) VALUES (1, 4, 8, 15)--This doesn't work, you've supplied 4 columns, not 4 rows
GO
INSERT INTO a (id)
VALUES (1),(4),(8),(15); --This is 4 rows
GO
SELECT a.id,
GS.value
FROM GENERATE_SERIES(1,14) GS
CROSS APPLY (SELECT TOP(1) a.id
FROM dbo.a
WHERE a.id <= GS.value
ORDER BY id DESC) a;
GO
DROP TABLE dbo.a;
I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY
.
SELECT MAX(a.id) OVER(ORDER BY GS.value ROWS UNBOUNDED PRECEDING) AS id,
GS.value
FROM GENERATE_SERIES(1,14) GS
LEFT OUTER JOIN #a AS a
ON gS.value = a.id
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 10, 2023 at 12:15 am
I don't have access to SQL 2022, but I would think that a windowed function would perform much better than the CROSS APPLY.
Really good point on using a Windowing MAX function to do this "Data Smear" and I love it a lot because of its simplicity and elegant nature (and I've seen Drew work a miracle with similar code in a much longer thread). You know me though, if someone makes an insitu guess at performance, the Dust Bunnies want me to prove things one way or the other. 😀
If you want to follow along but don't have 2022 to test with, something like fnTally will do the trick here and, so, for others that may also not have 2022, you can get a copy of the fnTally function from the following URL. It is NOT a direct replacement for GENERATE_SERIES() because it (fnTally) can only start an 0 or 1 and has no "Increment" parameter. For me, neither is needed about 99.9999% of the time.
https://www.sqlservercentral.com/scripts/create-a-tally-function-fntally
Once you have fnTally installed in your test database, just replace "GENERATE_SERIES" with "fnTally" and "t.value" with "t.N" everywhere in the following code. Since this IS a 2022 forum, we'll continue to use "GENERATE_SERIES". 😉
The following code generates a much more substantial test of 1000 random INTs (except for "1", which is always present) whose values vary from 1 to 1 MILLION. The code also guarantees no dupes so you might get slightly fewer than a total of 1000 unique random values. It also uses a TEMP Table in TempDB instead of a "real" table.
--=============================================================================
-- Create and populate the test table with ~1000 unique random values
-- ranging in value from "1" to "1000000".
--=============================================================================
--===== Conditionally drop and create the test table
DROP TABLE IF EXISTS #a;
GO
CREATE TABLE #a
(id INT PRIMARY KEY CLUSTERED) --Notice the index here!!!
;
GO
--===== Populate the table with the test values.
INSERT INTO #a WITH (TABLOCK) --Required for "Minimal Logging" when CI present.
(id)
SELECT id = 1
UNION
SELECT id = ABS(CHECKSUM(NEWID())%1000000)+1
FROM GENERATE_SERIES(1,999) --One has already been assigned
ORDER BY id --Also sometimes needed for "Minimal Logging".
;
GO
The first contender is the CROSS APPLY "Data Smear". Here's the code...
--=============================================================================
-- CROSS APPLY "Data Smear"
--=============================================================================
--===== "Clear the guns" for the upcoming test.
-- The code output will be directed to a TEMP Table.
DROP TABLE IF EXISTS #Result2;
CHECKPOINT;
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
GO
--===== CROSS APPLY "Data Smear"
SET STATISTICS TIME,IO ON
;
SELECT
pid = GS.value
,cid = src.id
INTO #Result2
FROM GENERATE_SERIES(1,1000000) GS
CROSS APPLY (SELECT TOP(1) a.id
FROM #a a
WHERE a.id <= GS.value
ORDER BY id DESC) src
;
SET STATISTICS TIME,IO OFF
;
GO
-- SELECT * FROM #Result2 ORDER BY pid
GO
The run stats on that are pretty impressive for duration but, man... the number of logical reads add up to 15,625 MB (15.6 GB) of memory I/O...
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table '#a__________________________________________________________________________________________________________________000000000029'.
Scan count 1000000, logical reads 2000000, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2860 ms, elapsed time = 359 ms.
(1000000 rows affected)
And, here's the test code for the Windows Function "Data Smear"...
--=============================================================================
-- Window Function "Data Smear"
--=============================================================================
--===== "Clear the guns" for the upcoming test.
-- The code output will be directed to a TEMP Table.
DROP TABLE IF EXISTS #Result1;
CHECKPOINT;
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
GO
--===== Window Function "Data Smear"
SET STATISTICS TIME,IO ON
;
SELECT pid = t.value
,cid = MAX(src.id) OVER (ORDER BY t.value ROWS UNBOUNDED PRECEDING)
INTO #Result1
FROM GENERATE_SERIES(1,1000000) t
LEFT JOIN #a src
ON src.id = t.value
ORDER BY pid
;
SET STATISTICS TIME,IO OFF
;
-- SELECT * FROM #Result1 ORDER BY pid
GO
... and here are the run stats for that... It takes ~3 times longer but ya just gotta love the reads!
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table '#a__________________________________________________________________________________________________________________000000000029'.
Scan count 5, logical reads 7, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2425 ms, elapsed time = 938 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
(1000000 rows affected)
Now... here's the BIG difference. If you remove the Clustered Index from the #a test table, the Windows Function "Data Smear" still takes the same amount of duration and does better on CPU and better on reads, which were tiny to begin with!
Under those same conditions (no Clustered Index), the CROSS APPLY is on the miserable side for duration, most likely because the reads increased to a whopping 18,636 MB (16.6 GB)...
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Table 'Worktable'. Scan count 1000000, logical reads 2385355, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#a__________________________________________________________________________________________________________________000000000031'.
Scan count 1, logical reads 2, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times:
CPU time = 129704 ms, elapsed time = 126875 ms.
(1000000 rows affected)
If you need to shave off more than a half second of duration then, at the expense of needing an index and suffering a comparatively HUGE number of reads, the CROSS APPLY method will do.
If you don't mind it taking nearly a second and love the idea of almost non-existent reads, then the Windows Function is the clear winner with our without an index.
And, yes, both went massively parallel and used all 12 logical processors on my laptop. And, just to say it out loud, they both returned the correct output.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 10, 2023 at 9:28 am
I did, thanks Jeff
Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance
The suggestion by Thom works, the window amendment by drew also works
Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id
Runs in about 3 seconds which is more than good enough for my purpose
Thanks
- Damian
February 10, 2023 at 9:30 am
I did, thanks Jeff
Worked it both ways (with and without GENERATE_SERIES) as I have a 2022 and a 2016 instance
The suggestion by Thom works, the window amendment by drew also works
Your tally function suggestion also works - I tweaked a little bit to use variables as my actual set works using a moving start and end id
Runs in about 3 seconds which is more than good enough for my purpose
- Damian
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply