December 23, 2015 at 7:50 am
I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?
Thanks!
December 23, 2015 at 7:53 am
rs80 (12/23/2015)
I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?Thanks!
Not knowing what the source code looks like it is hard to say but it certainly isn't a self join. There are performance improvements using LEAD/LAG over the more traditional self join so it is obviously some other mechanism.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
December 23, 2015 at 8:07 am
Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.
December 23, 2015 at 8:19 am
rs80 (12/23/2015)
Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.
Worktables (window spools).[/url]
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 23, 2015 at 8:23 am
rs80 (12/23/2015)
Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.
Of course not, the source code is part of the engine. That was my point really. There are hundreds and hundreds of articles about LEAD/LAG. It seems like it gets the result set first and then applies the LEAD/LAG logic. That would make sense from a performance perspective and offers the benefits that a self join does not have because a self join has to look at the entire table instead of the result set of the query. There are plenty of others who know more than I do about this topic but I don't remember anywhere seeing anybody stating that "behind the scenes it is doing xxx".
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
December 23, 2015 at 8:26 am
I would read about spools, but also perhaps pose the question to Itzik Ben-Gan. He may know.
For most of us, I'm not sure it matters how this is implemented. I'd guess some sort of hash join myself, but who knows what's really going on. You might be able to read how PostgreSQL does their implementation of windowing. I thought they had this in there. My guess is there aren't a lot of algorithmic ways to do this efficiently.
December 23, 2015 at 8:27 am
I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:
--Jeff Moden
Change is inevitable... Change for the better is not.
December 23, 2015 at 8:41 am
Jeff Moden (12/23/2015)
I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:
Jeff...if I recall someone had a series of scripts that compared LAG to QU?
was it you or do you have a bookmark?
would like to read it again.
thanks
edit...this is the one I was referring to:
http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
December 23, 2015 at 10:41 am
J Livingston SQL (12/23/2015)
Jeff Moden (12/23/2015)
I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:Jeff...if I recall someone had a series of scripts that compared LAG to QU?
was it you or do you have a bookmark?
would like to read it again.
thanks
edit...this is the one I was referring to:
http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/
That's the one I refer to, as well. I've recently executed it in SQL Server 2012 with nearly identical results.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 23, 2015 at 1:29 pm
This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...
IF OBJECT_ID('tempdb..#temp', 'U') IS NOT NULL
DROP TABLE #temp;
CREATE TABLE #temp (
Amount DECIMAL(19,0) NOT NULL PRIMARY KEY
);
;WITH
n (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)
),
Tally (n) AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM n n1, n n2, n n3, n n4, n n5, n n6
)
INSERT #temp (Amount)
SELECT t.n FROM Tally t ORDER BY t.n;
--=====================================================================
DECLARE
@b1 DATETIME2(7),
@b2 DATETIME2(7),
@b3 DATETIME2(7),
@e3 DATETIME2(7),
@V1 DECIMAL(19,0),
@V2 DECIMAL(19,0)
;
SET @b1 = SYSDATETIME(); --=============================================
SELECT
@V1 = t.Amount,
@V2 = SUM(t.Amount) OVER (ORDER BY t.Amount ROWS UNBOUNDED PRECEDING)
FROM
#temp t;
SET @b2 = SYSDATETIME(); --=============================================
IF OBJECT_ID('tempdb..#RunningTotal', 'U') IS NOT NULL
DROP TABLE #RunningTotal;
SELECT
t.Amount,
RT = CAST(NULL AS DECIMAL(19,0))
INTO
#RunningTotal
FROM
#temp t;
CREATE UNIQUE NONCLUSTERED INDEX cix_RunningTotal_Amount ON #RunningTotal (Amount);
------------
SET @b3 = SYSDATETIME();
DECLARE
@Amt DECIMAL(19,0) = 0,
@rt DECIMAL(19,0) = 0;
UPDATE rt SET
FROM
#RunningTotal rt (TABLOCKX)
OPTION(MAXDOP 1);
SET @e3 = SYSDATETIME();
---------
SELECT
DATEDIFF(ms, @b1, @b2) AS WindowedFunction,
DATEDIFF(ms, @b2, SYSDATETIME()) AS QuiryWithTableCreation,
DATEDIFF(ms, @b3, @e3) AS QuirkyAlone
December 23, 2015 at 5:45 pm
Jason A. Long (12/23/2015)
This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...
You've fallen into the same trap that half the world has fallen into when they test CSV splitters.
Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".
For a hint, look at the test data in Wayne's article and compare it against yours.
Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 23, 2015 at 8:06 pm
Jeff Moden (12/23/2015)
Jason A. Long (12/23/2015)
This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...You've fallen into the same trap that half the world has fallen into when they test CSV splitters.
Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".
For a hint, look at the test data in Wayne's article and compare it against yours.
Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?
The ANCHOR!!! Damn it... I actually had it in there but was getting weird results (forgot the clustered index for about 5 mins), I must have pulled it out when I was trouble shooting and forgot to put it back...
Giving the referenced article a 'quick-ish' look...I do see that all of the tests involve performing an actual update to the dbo.TransactionDetail table... Providing (IMO) an advantage to the quirky update which doesn't work without updating a table...
So... I suppose that brings us into "proposed use" territory... Are we wanting to calculate the running total on the fly with a SELECT query or persist the running totals on the table itself?
If we want to persist the data, the quirky method has the advantage (assuming that the table already has the correct clustered index), but if we simply want to execute a select query, we have to add the cost of selecting the base data into a temp table and creating the clustered index...
With the Windowed SUM, the opposite is true... Persisting the data requires at least 1 index update that it doesn't need to do if it's only doing a select... In either case, one needs to be slowed down to achieve the results of the other.
If I have time over the weekend, I'll do some more testing using your harness... Both "calculated on the fly" and "persisted".
That said, I still have a tough time shrugging off Itzik's warning... Ordered UPDATE and Set-Based Solutions to Running Aggregates
I know some put more stock in the "documented vs undocumented behavior" argument, than others... So I'll just call it a tie breaker... (maybe even a thumb on the scale if it's close)
December 24, 2015 at 2:13 pm
It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.
Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.
The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.
The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.
Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 24, 2015 at 2:59 pm
Jeff Moden (12/24/2015)
It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.
Yikes... That's what I get for eating paint chips in the morning... :Whistling:
Jeff Moden (12/24/2015)
Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.
The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.
Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.
How about this one?
USE TempDB
GO
IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL
DROP TABLE TempDB.dbo.TransactionDetail
GO
CREATE TABLE dbo.TransactionDetail
(
TransactionDetailID INT IDENTITY(1,1), --Temporal "tie-breaker"
Date DATETIME,
AccountID INT,
Amount MONEY,
-- Removed to the "SELECT ONLY" test
-- AccountRunningTotal MONEY, --Running total across each account
-- AccountRunningCount INT, --Like "Rank" across each account
NCID INT, --For "proof" later in the article
CONSTRAINT PK_TransactionDetail_TransactionDetailID
PRIMARY KEY NONCLUSTERED (TransactionDetailID)
WITH FILLFACTOR = 100
)
GO
CREATE CLUSTERED INDEX IXC_Transaction_AccountID_Date_TransactionDetailID
ON dbo.TransactionDetail (AccountID, Date, TransactionDetailID)
CREATE NONCLUSTERED INDEX IX_Transaction_NCID
ON dbo.TransactionDetail (NCID DESC)
GO
SET NOCOUNT ON
WHILE (ISNULL(IDENT_CURRENT('TransactionDetail'),0)) < 1000000
BEGIN
INSERT INTO dbo.TransactionDetail
(Date, AccountID, Amount)
SELECT TOP 10000
--10 years worth of dates with times from 1/1/2000 to 12/31/2009
CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,
--100 different account numbers
ABS(CHECKSUM(NEWID()))%100+1,
--Dollar amounts from -99.99 to + 99.99
CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
END
--===== Update the NCID column to be the reverse of the TransactionDetailID column
UPDATE dbo.TransactionDetail
SET NCID = 1000000 - TransactionDetailID + 1
GO
--SELECT * FROM dbo.TransactionDetail td
--==================================================================================
--==================================================================================
DECLARE
@b1 DATETIME2(7),
@b2 DATETIME2(7),
@b3 DATETIME2(7)
;
SET @b1 = SYSDATETIME();
SELECT
td.TransactionDetailID,
td.Date,
td.AccountID,
td.Amount,
AccountRunningTotal = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),
AccountRunningCount = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),
td.NCID
FROM
dbo.TransactionDetail td
SET @b2 = SYSDATETIME();
IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL
DROP TABLE #RunningTotals;
SELECT
td.TransactionDetailID,
td.[Date],
td.AccountID,
td.Amount,
AccountRunningTotal = CAST(NULL AS MONEY),
AccountRunningCount = CAST(NULL AS INT),
td.NCID
INTO #RunningTotals
FROM
dbo.TransactionDetail td;
CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID
ON #RunningTotals (AccountID, [Date], TransactionDetailID);
SET @b3 = SYSDATETIME();
DECLARE @PrevAccountID INT
DECLARE @AccountRunningTotal MONEY
DECLARE @AccountRunningCount INT;
UPDATE rt
SET @AccountRunningTotal = rt.AccountRunningTotal = CASE
WHEN rt.AccountID = @PrevAccountID
THEN @AccountRunningTotal + rt.Amount
ELSE rt.Amount
END,
@AccountRunningCount = rt.AccountRunningCount = CASE
WHEN rt.AccountID = @PrevAccountID
THEN @AccountRunningCount + 1
ELSE 1
END,
@PrevAccountID = rt.AccountID
FROM #RunningTotals rt WITH (TABLOCKX)
OPTION (MAXDOP 1);
SELECT
rt.TransactionDetailID,
rt.Date,
rt.AccountID,
rt.Amount,
rt.AccountRunningTotal,
rt.AccountRunningCount,
rt.NCID
FROM
#RunningTotals rt
ORDER BY
rt.AccountID,
rt.[Date],
rt.TransactionDetailID;
SELECT
DATEDIFF(ms, @b1, @b2) AS WindowedSUM,
DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,
DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;
The above allows all rows to render in the display... I tested a version that dumps the values into variables
DECLARE
@b1 DATETIME2(7),
@b2 DATETIME2(7),
@b3 DATETIME2(7),
@rt MONEY,
@rc int
;
SET @b1 = SYSDATETIME();
SELECT
@rt = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),
@rc = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING)
FROM
dbo.TransactionDetail td
SET @b2 = SYSDATETIME();
IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL
DROP TABLE #RunningTotals;
SELECT
td.TransactionDetailID,
td.[Date],
td.AccountID,
td.Amount,
AccountRunningTotal = CAST(NULL AS MONEY),
AccountRunningCount = CAST(NULL AS INT),
td.NCID
INTO #RunningTotals
FROM
dbo.TransactionDetail td;
CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID
ON #RunningTotals (AccountID, [Date], TransactionDetailID);
SET @b3 = SYSDATETIME();
DECLARE @PrevAccountID INT
DECLARE @AccountRunningTotal MONEY
DECLARE @AccountRunningCount INT;
UPDATE rt
SET @AccountRunningTotal = rt.AccountRunningTotal = CASE
WHEN rt.AccountID = @PrevAccountID
THEN @AccountRunningTotal + rt.Amount
ELSE rt.Amount
END,
@AccountRunningCount = rt.AccountRunningCount = CASE
WHEN rt.AccountID = @PrevAccountID
THEN @AccountRunningCount + 1
ELSE 1
END,
@PrevAccountID = rt.AccountID
FROM #RunningTotals rt WITH (TABLOCKX)
OPTION (MAXDOP 1);
SELECT
DATEDIFF(ms, @b1, @b2) AS WindowedSUM,
DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,
DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;
Here are my results... [@@Version = Microsoft SQL Server 2014 - 12.0.4100.1 (X64) Apr 20 2015 17:29:27 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)]
-- with display rendering --
WindowedSUMQuirkyWTempQuirkyAlone
440856625192
486855535132
432256315195
432157105257
421256785241
-- without display rendering --
WindowedSUMQuirkyWTempQuirkyAlone
8421358905
8431326889
8431357905
8431294873
8581342889
December 24, 2015 at 4:24 pm
Jason A. Long (12/24/2015)
How about this one?
Now you're cooking with gas. MUCH better data and test!... and, I stand corrected.
1. If all you want to do is display the data, SUM() OVER wins.
2. If you want to create a separate table with the running totals, SUM() OVER wins.
3. If you want to do an in-place update on the original table, Quirky Update wins. This one is what Wayne's test did.
I used your code for the 3 various tests above.
For #1, I used your test as is.
For #2, I added an "INTO" to the SUM() OVER code and commented out the final SELECT for the QU.
For #3, I used Wayne's code.
Both Wayne and I skipped testing for #1 and #2... a mistake for both of us. :blush:
Well, done, Jason! I learned something new today! :w00t: You should write an article on this.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply