LEAP and LAG behind the scenes

  • I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?

    Thanks!

  • rs80 (12/23/2015)


    I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?

    Thanks!

    Not knowing what the source code looks like it is hard to say but it certainly isn't a self join. There are performance improvements using LEAD/LAG over the more traditional self join so it is obviously some other mechanism.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

  • rs80 (12/23/2015)


    Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

    Worktables (window spools).[/url]

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • rs80 (12/23/2015)


    Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

    Of course not, the source code is part of the engine. That was my point really. There are hundreds and hundreds of articles about LEAD/LAG. It seems like it gets the result set first and then applies the LEAD/LAG logic. That would make sense from a performance perspective and offers the benefits that a self join does not have because a self join has to look at the entire table instead of the result set of the query. There are plenty of others who know more than I do about this topic but I don't remember anywhere seeing anybody stating that "behind the scenes it is doing xxx".

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I would read about spools, but also perhaps pose the question to Itzik Ben-Gan. He may know.

    For most of us, I'm not sure it matters how this is implemented. I'd guess some sort of hash join myself, but who knows what's really going on. You might be able to read how PostgreSQL does their implementation of windowing. I thought they had this in there. My guess is there aren't a lot of algorithmic ways to do this efficiently.

  • I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/23/2015)


    I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

    Jeff...if I recall someone had a series of scripts that compared LAG to QU?

    was it you or do you have a bookmark?

    would like to read it again.

    thanks

    edit...this is the one I was referring to:

    http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (12/23/2015)


    Jeff Moden (12/23/2015)


    I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

    Jeff...if I recall someone had a series of scripts that compared LAG to QU?

    was it you or do you have a bookmark?

    would like to read it again.

    thanks

    edit...this is the one I was referring to:

    http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/

    That's the one I refer to, as well. I've recently executed it in SQL Server 2012 with nearly identical results.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...

    IF OBJECT_ID('tempdb..#temp', 'U') IS NOT NULL

    DROP TABLE #temp;

    CREATE TABLE #temp (

    Amount DECIMAL(19,0) NOT NULL PRIMARY KEY

    );

    ;WITH

    n (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)

    ),

    Tally (n) AS (

    SELECT

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM n n1, n n2, n n3, n n4, n n5, n n6

    )

    INSERT #temp (Amount)

    SELECT t.n FROM Tally t ORDER BY t.n;

    --=====================================================================

    DECLARE

    @b1 DATETIME2(7),

    @b2 DATETIME2(7),

    @b3 DATETIME2(7),

    @e3 DATETIME2(7),

    @V1 DECIMAL(19,0),

    @V2 DECIMAL(19,0)

    ;

    SET @b1 = SYSDATETIME(); --=============================================

    SELECT

    @V1 = t.Amount,

    @V2 = SUM(t.Amount) OVER (ORDER BY t.Amount ROWS UNBOUNDED PRECEDING)

    FROM

    #temp t;

    SET @b2 = SYSDATETIME(); --=============================================

    IF OBJECT_ID('tempdb..#RunningTotal', 'U') IS NOT NULL

    DROP TABLE #RunningTotal;

    SELECT

    t.Amount,

    RT = CAST(NULL AS DECIMAL(19,0))

    INTO

    #RunningTotal

    FROM

    #temp t;

    CREATE UNIQUE NONCLUSTERED INDEX cix_RunningTotal_Amount ON #RunningTotal (Amount);

    ------------

    SET @b3 = SYSDATETIME();

    DECLARE

    @Amt DECIMAL(19,0) = 0,

    @rt DECIMAL(19,0) = 0;

    UPDATE rt SET

    @rt = rt.RT = @rt + rt.Amount

    FROM

    #RunningTotal rt (TABLOCKX)

    OPTION(MAXDOP 1);

    SET @e3 = SYSDATETIME();

    ---------

    SELECT

    DATEDIFF(ms, @b1, @b2) AS WindowedFunction,

    DATEDIFF(ms, @b2, SYSDATETIME()) AS QuiryWithTableCreation,

    DATEDIFF(ms, @b3, @e3) AS QuirkyAlone

  • Jason A. Long (12/23/2015)


    This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...

    You've fallen into the same trap that half the world has fallen into when they test CSV splitters.

    Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".

    For a hint, look at the test data in Wayne's article and compare it against yours.

    Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/23/2015)


    Jason A. Long (12/23/2015)


    This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...

    You've fallen into the same trap that half the world has fallen into when they test CSV splitters.

    Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".

    For a hint, look at the test data in Wayne's article and compare it against yours.

    Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?

    The ANCHOR!!! Damn it... I actually had it in there but was getting weird results (forgot the clustered index for about 5 mins), I must have pulled it out when I was trouble shooting and forgot to put it back...

    Giving the referenced article a 'quick-ish' look...I do see that all of the tests involve performing an actual update to the dbo.TransactionDetail table... Providing (IMO) an advantage to the quirky update which doesn't work without updating a table...

    So... I suppose that brings us into "proposed use" territory... Are we wanting to calculate the running total on the fly with a SELECT query or persist the running totals on the table itself?

    If we want to persist the data, the quirky method has the advantage (assuming that the table already has the correct clustered index), but if we simply want to execute a select query, we have to add the cost of selecting the base data into a temp table and creating the clustered index...

    With the Windowed SUM, the opposite is true... Persisting the data requires at least 1 index update that it doesn't need to do if it's only doing a select... In either case, one needs to be slowed down to achieve the results of the other.

    If I have time over the weekend, I'll do some more testing using your harness... Both "calculated on the fly" and "persisted".

    That said, I still have a tough time shrugging off Itzik's warning... Ordered UPDATE and Set-Based Solutions to Running Aggregates

    I know some put more stock in the "documented vs undocumented behavior" argument, than others... So I'll just call it a tie breaker... (maybe even a thumb on the scale if it's close)

  • It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.

    Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.

    The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.

    The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.

    Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/24/2015)


    It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.

    Yikes... That's what I get for eating paint chips in the morning... :Whistling:

    Jeff Moden (12/24/2015)


    Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.

    The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.

    The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.

    Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.

    How about this one?

    USE TempDB

    GO

    IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL

    DROP TABLE TempDB.dbo.TransactionDetail

    GO

    CREATE TABLE dbo.TransactionDetail

    (

    TransactionDetailID INT IDENTITY(1,1), --Temporal "tie-breaker"

    Date DATETIME,

    AccountID INT,

    Amount MONEY,

    -- Removed to the "SELECT ONLY" test

    -- AccountRunningTotal MONEY, --Running total across each account

    -- AccountRunningCount INT, --Like "Rank" across each account

    NCID INT, --For "proof" later in the article

    CONSTRAINT PK_TransactionDetail_TransactionDetailID

    PRIMARY KEY NONCLUSTERED (TransactionDetailID)

    WITH FILLFACTOR = 100

    )

    GO

    CREATE CLUSTERED INDEX IXC_Transaction_AccountID_Date_TransactionDetailID

    ON dbo.TransactionDetail (AccountID, Date, TransactionDetailID)

    CREATE NONCLUSTERED INDEX IX_Transaction_NCID

    ON dbo.TransactionDetail (NCID DESC)

    GO

    SET NOCOUNT ON

    WHILE (ISNULL(IDENT_CURRENT('TransactionDetail'),0)) < 1000000

    BEGIN

    INSERT INTO dbo.TransactionDetail

    (Date, AccountID, Amount)

    SELECT TOP 10000

    --10 years worth of dates with times from 1/1/2000 to 12/31/2009

    CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,

    --100 different account numbers

    ABS(CHECKSUM(NEWID()))%100+1,

    --Dollar amounts from -99.99 to + 99.99

    CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    END

    --===== Update the NCID column to be the reverse of the TransactionDetailID column

    UPDATE dbo.TransactionDetail

    SET NCID = 1000000 - TransactionDetailID + 1

    GO

    --SELECT * FROM dbo.TransactionDetail td

    --==================================================================================

    --==================================================================================

    DECLARE

    @b1 DATETIME2(7),

    @b2 DATETIME2(7),

    @b3 DATETIME2(7)

    ;

    SET @b1 = SYSDATETIME();

    SELECT

    td.TransactionDetailID,

    td.Date,

    td.AccountID,

    td.Amount,

    AccountRunningTotal = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

    AccountRunningCount = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

    td.NCID

    FROM

    dbo.TransactionDetail td

    SET @b2 = SYSDATETIME();

    IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL

    DROP TABLE #RunningTotals;

    SELECT

    td.TransactionDetailID,

    td.[Date],

    td.AccountID,

    td.Amount,

    AccountRunningTotal = CAST(NULL AS MONEY),

    AccountRunningCount = CAST(NULL AS INT),

    td.NCID

    INTO #RunningTotals

    FROM

    dbo.TransactionDetail td;

    CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID

    ON #RunningTotals (AccountID, [Date], TransactionDetailID);

    SET @b3 = SYSDATETIME();

    DECLARE @PrevAccountID INT

    DECLARE @AccountRunningTotal MONEY

    DECLARE @AccountRunningCount INT;

    UPDATE rt

    SET @AccountRunningTotal = rt.AccountRunningTotal = CASE

    WHEN rt.AccountID = @PrevAccountID

    THEN @AccountRunningTotal + rt.Amount

    ELSE rt.Amount

    END,

    @AccountRunningCount = rt.AccountRunningCount = CASE

    WHEN rt.AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @PrevAccountID = rt.AccountID

    FROM #RunningTotals rt WITH (TABLOCKX)

    OPTION (MAXDOP 1);

    SELECT

    rt.TransactionDetailID,

    rt.Date,

    rt.AccountID,

    rt.Amount,

    rt.AccountRunningTotal,

    rt.AccountRunningCount,

    rt.NCID

    FROM

    #RunningTotals rt

    ORDER BY

    rt.AccountID,

    rt.[Date],

    rt.TransactionDetailID;

    SELECT

    DATEDIFF(ms, @b1, @b2) AS WindowedSUM,

    DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,

    DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;

    The above allows all rows to render in the display... I tested a version that dumps the values into variables

    DECLARE

    @b1 DATETIME2(7),

    @b2 DATETIME2(7),

    @b3 DATETIME2(7),

    @rt MONEY,

    @rc int

    ;

    SET @b1 = SYSDATETIME();

    SELECT

    @rt = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

    @rc = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING)

    FROM

    dbo.TransactionDetail td

    SET @b2 = SYSDATETIME();

    IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL

    DROP TABLE #RunningTotals;

    SELECT

    td.TransactionDetailID,

    td.[Date],

    td.AccountID,

    td.Amount,

    AccountRunningTotal = CAST(NULL AS MONEY),

    AccountRunningCount = CAST(NULL AS INT),

    td.NCID

    INTO #RunningTotals

    FROM

    dbo.TransactionDetail td;

    CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID

    ON #RunningTotals (AccountID, [Date], TransactionDetailID);

    SET @b3 = SYSDATETIME();

    DECLARE @PrevAccountID INT

    DECLARE @AccountRunningTotal MONEY

    DECLARE @AccountRunningCount INT;

    UPDATE rt

    SET @AccountRunningTotal = rt.AccountRunningTotal = CASE

    WHEN rt.AccountID = @PrevAccountID

    THEN @AccountRunningTotal + rt.Amount

    ELSE rt.Amount

    END,

    @AccountRunningCount = rt.AccountRunningCount = CASE

    WHEN rt.AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @PrevAccountID = rt.AccountID

    FROM #RunningTotals rt WITH (TABLOCKX)

    OPTION (MAXDOP 1);

    SELECT

    DATEDIFF(ms, @b1, @b2) AS WindowedSUM,

    DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,

    DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;

    Here are my results... [@@Version = Microsoft SQL Server 2014 - 12.0.4100.1 (X64) Apr 20 2015 17:29:27 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)]

    -- with display rendering --

    WindowedSUMQuirkyWTempQuirkyAlone

    440856625192

    486855535132

    432256315195

    432157105257

    421256785241

    -- without display rendering --

    WindowedSUMQuirkyWTempQuirkyAlone

    8421358905

    8431326889

    8431357905

    8431294873

    8581342889

  • Jason A. Long (12/24/2015)


    How about this one?

    Now you're cooking with gas. MUCH better data and test!... and, I stand corrected.

    1. If all you want to do is display the data, SUM() OVER wins.

    2. If you want to create a separate table with the running totals, SUM() OVER wins.

    3. If you want to do an in-place update on the original table, Quirky Update wins. This one is what Wayne's test did.

    I used your code for the 3 various tests above.

    For #1, I used your test as is.

    For #2, I added an "INTO" to the SUM() OVER code and commented out the final SELECT for the QU.

    For #3, I used Wayne's code.

    Both Wayne and I skipped testing for #1 and #2... a mistake for both of us. :blush:

    Well, done, Jason! I learned something new today! :w00t: You should write an article on this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply