Linking to next row using a join on row_number()+1=row_number()

  • Derek Dongray (4/14/2008)


    Including the update, this is 50% faster. Excluding the update, it's 100% faster.

    The only thing I'm dubious about is whether the update will always get the records in the right order?

    If there is a clustered index on the table over columns (id, date) it will.

    CREATE CLUSTERED INDEX IX_t ON #t (id, date)

    You can also have the UPDATE part in a trigger and update only for all records of same id.

    It doesn't matter which seq it is, as long as seq is unique within id group.


    N 56°04'39.16"
    E 12°55'05.25"

  • Anybody leave part of this cow for me?

    This seems to me to be the most obvious approach, though perhaps not the fastest (when I needed this kind of query to be fast, I've always used the SELECT..INTO technique that Jeff showed to add a sequence ID to the data set):

    set nocount off

    set statistics io on

    set statistics time on

    Select a.id, a.st as sta, a.date as da, b.st as stb, b.date as db

    From #t A

    Join #t B ON ( a.id=b.id and a.date b.st

    And Not Exists (Select * from #t b2

    Where a.id=b2.id -- and/or c.id=b.id

    And a.st<>b2.st

    And a.date < b2.date

    and b2.date < b.date

    ) )

    Where Not Exists (Select * from #t a2

    Where a.id=a2.id

    And a.st=a2.st

    And a.date < a2.date

    And a2.date < b.date

    )

    Order By a.id, da

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I created a couple of test queries using the real data based on Peso's method and using row_number().

    After several tests, Peso's method was fractionally faster (41015ms) than the triangular/diagonal join method (45468ms) in producing 36820 paired status records from input of 80351 records. The row_number() method completely bombed with 660685ms for some reason! I checked the implementation several times to see if I'd got something wrong, but eventually gave up.

    For Peso's method, if I add the overhead of setting the Seq field to null (656ms) and updating it (891ms) then the gain is less (42562ms). Adding an index on Seq fractionally improves the query (40547ms) but significantly worsens the clearing (3064ms) and update (3297ms) making it slower overall (46908ms). Since it's basically doing sequential processing, it's not a surprise that an index doesn't help.

    The conclusion is that the triangular/diagonal join isn't the problem it's perceived to be and I might just as well leave the system using the working code.

    Thanks to all who took a look at it for me.

    Derek

  • Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/15/2008)


    Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉

    I agree, but I don't think the problem with the real selection is in the algorithm.

    The test generator can do any number of records. Here's the results from running 200000 records on my laptop

    200000 records generated

    >>>> original (triangular)

    (806874 row(s) affected)

    SQL Server Execution Times:

    CPU time = 17625 ms, elapsed time = 1773935 ms.

    >>>> using row_number()

    (806874 row(s) affected)

    SQL Server Execution Times:

    CPU time = 13000 ms, elapsed time = 1547006 ms.

    >>>> using dense_rank()

    (806874 row(s) affected)

    SQL Server Execution Times:

    CPU time = 12765 ms, elapsed time = 1380089 ms.

    >>>> peso

    >Update

    SQL Server Execution Times:

    CPU time = 2984 ms, elapsed time = 184522 ms.

    (1045946 row(s) affected)

    >select

    (806874 row(s) affected)

    SQL Server Execution Times:

    CPU time = 7703 ms, elapsed time = 1148723 ms.

    >>>> Jeff (copy) [uncorrected for minor error]

    >copy

    SQL Server Execution Times:

    CPU time = 1391 ms, elapsed time = 98708 ms.

    (1045946 row(s) affected)

    >select [slight error here]

    (845946 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3985 ms, elapsed time = 999281 ms.

    The problem with the 'real' data is hat I need to work out working day corrections dependent on country and timezone.

    Here are the results from one of the 'full' test runs (zero values from reads removed).

    Existing method using triangular/diagonal join

    -------- ------ ----- ------------------- ----

    Table 'tblClient'. Logical reads 182.

    Table 'tblCompany'. Scan count 1, logical reads 2.

    Table 'Worktable'. Scan count 42, logical reads 75318.

    Table 'tblWorkFlow'. Scan count 30, logical reads 34608.

    Table 'tblCountry'. Scan count 12, logical reads 24.

    Table 'WorkingDays'. Scan count 36825, logical reads 122608.

    SQL Server Execution Times:

    CPU time = 45468 ms, elapsed time = 12836 ms.

    (36820 row(s) affected)

    -------------------------------------------------------------

    'Peso' method using seq field on tblWorkflow (indexed seq)

    ------ ------ ----- --- ----- -- -----------

    Step 1: set seq = null

    -------

    Table 'Worktable'. Scan count 4, logical reads 163390.

    Table 'tblWorkFlow'. Scan count 5, logical reads 618352.

    SQL Server Execution Times:

    CPU time = 3064 ms, elapsed time = 2424 ms.

    (80351 row(s) affected)

    Step 2: set seq=seq+1

    -------

    Table 'Worktable'. Scan count 4, logical reads 164118.

    Table 'tblWorkFlow'. Scan count 5, logical reads 628685.

    SQL Server Execution Times:

    CPU time = 3297 ms, elapsed time = 3069 ms.

    (80351 row(s) affected)

    Step 3: select

    -------

    Table 'tblClient'. Logical reads 182.

    Table 'tblCompany'. Scan count 1, logical reads 2.

    Table 'Worktable'. Scan count 42, logical reads 75318.

    Table 'tblWorkFlow'. Scan count 20, logical reads 7782.

    Table 'WorkingDays'. Scan count 10, logical reads 1178.

    Table 'tblCountry'. Scan count 12, logical reads 24.

    SQL Server Execution Times:

    CPU time = 40547 ms, elapsed time = 12635 ms.

    -------

    (36820 row(s) affected)

    Totals:

    CPU time = 46908 ms, elapsed time = 18128 ms.

    -------------------------------------------------------------

    'Peso' method using seq field on tblWorkflow (no index on seq)

    ------ ------ ----- --- ----- -- -----------

    Step 1: set seq = null

    -------

    Table 'tblWorkFlow'. Scan count 1, logical reads 246418.

    SQL Server Execution Times:

    CPU time = 656 ms, elapsed time = 654 ms.

    (80351 row(s) affected)

    Step 2: set seq=seq+1

    -------

    Table 'tblWorkFlow'. Scan count 1, logical reads 246847.

    SQL Server Execution Times:

    CPU time = 891 ms, elapsed time = 901 ms.

    (80351 row(s) affected)

    Step 3: select

    -------

    Table 'tblClient'. Logical reads 182.

    Table 'tblCompany'. Scan count 1, logical reads 2.

    Table 'Worktable'. Scan count 42, logical reads 75318.

    Table 'tblWorkFlow'. Scan count 20, logical reads 42654.

    Table 'WorkingDays'. Scan count 10, logical reads 1196.

    Table 'tblCountry'. Scan count 12, logical reads 24.

    SQL Server Execution Times:

    CPU time = 41015 ms, elapsed time = 12322 ms.

    (36820 row(s) affected)

    Totals:

    CPU time = 42562 ms, elapsed time = 13877 ms.

    -------------------------------------------------------------

    Using CTEs and row_number()

    ----- ---- --- ------------

    Table 'WorkingDays'. Scan count 10, logical reads 1352.

    Table 'tblClient'. Logical reads 182.

    Table 'tblCompany'. Scan count 1, logical reads 2.

    Table 'Worktable'. Scan count 12.

    Table 'tblWorkFlow'. Scan count 16, logical reads 77194.

    Table 'tblCountry'. Scan count 12, logical reads 24.

    SQL Server Execution Times:

    CPU time = 660685 ms, elapsed time = 281187 ms.

    (36820 row(s) affected)

    As you can see, the scan count on the table 'WorkingDays' in the first query is ridiculous. This will be the next area of investigation as I believe it should be possible to significantly improve this.

    Derek

  • Perhaps I'm missing some of the subtleties here, but this processes 350,000 pairs out of 500,000 records in 17 seconds.

    [font="Courier New"]--clean up the environment

    DROP TABLE #t

    DROP TABLE #matt

    DROP TABLE #matt2

    GO

    --create test data

    CREATE TABLE #t (id INT , st CHAR(1), date DATETIME)

    INSERT #t (st,date,id)

    SELECT TOP 250000 CHAR(CAST(RAND(checksum(NEWID()))*3 AS INT)+65),

       DATEADD(mi,RAND(checksum(NEWID()))*1440.0*28,CONVERT(DATETIME,'2008/2/1')),

       CAST(RAND(checksum(NEWID()))*1000+1 AS integer)

    FROM sys.all_columns sc1, sys.all_columns sc2

    GO

    --start the processing

    DECLARE @g DATETIME

    SET @g=GETDATE()

    SELECT DISTINCT *, CAST(0 AS INT) AS grp

    INTO #matt

    FROM #t

    CREATE UNIQUE CLUSTERED INDEX pkM ON #matt(id,date,st)

    DECLARE @prevID INT

    SET @previd=0;

    DECLARE @prevgrp INT

    SET @prevgrp=0;

    DECLARE @prevstat CHAR

    SET @prevstat='';

    UPDATE #matt

    SET @prevgrp=grp=CASE WHEN @previd<>ID THEN 1

                 WHEN @prevstat<>st THEN @prevgrp+1

                 ELSE @prevgrp END,

        @previd=id,

        @prevstat=st

    FROM #matt WITH (tablockX, INDEX(pkm))

    OPTION (MAXDOP 1)

    SELECT id,grp,st,MAX(date) lastdte

    INTO #matt2

    FROM #matt

    GROUP BY id,grp,st

    SELECT m1.id, m1.st oldst,

                   m1.lastdte oldlast,

                   m2.st newst,

                   m2.lastdte newlast

    FROM #matt2 m1

         LEFT OUTER JOIN #matt2 m2 ON m1.id=m2.id AND m1.grp=m2.grp-1

    ORDER BY m1.id,m2.grp    

    SELECT DATEDIFF(ms, @g, GETDATE())

    GO

    --clean up the environment

    DROP TABLE #t

    DROP TABLE #matt

    DROP TABLE #matt2

    [/font]

    100000 goes in under 5 seconds.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • ...actually - by adding a clustered index right before that last results query - it drops to 14 seconds (14221 ms) for 500,000 rows.

    create unique clustered index #pkm2 on #matt2(id,grp)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I ran some tests measuring the elapsed (from datediff(ms,,getdate())) and cpu (from @@cpu_busy*@@timeticks) times on both my laptop and the server where this application resides.

    I generated 500,000 series which generated about 830,000 pairs using 5 methods (Jeff's doesn't ignore unchanged states, so I left it out). Output was selected into temp tables (which improves times).

    Laptop:

    Test 1:

    what elapsed cpu

    peso 24203 14093

    derek 2 (row_num) 27716 15750

    derek 3 (dense_rank) 23876 16437

    derek 1 (tri/diag join) 25546 17218

    matt 26516 19937

    Test 2:

    what elapsed cpu

    peso 30200 14968

    derek 2 (row_num) 21436 15531

    derek 3 (dense_rank) 22440 16062

    derek 1 (tri/diag join) 36453 19031

    matt 24173 20093

    Server:

    Test 1:

    what elapsed cpu

    matt 23546 28406

    peso 15623 29656

    derek 1 (tri/diag join) 13186 35343

    derek 2 (row_num) 21250 41968

    derek 3 (dense_rank) 20923 42500

    Test 2:

    what elapsed cpu

    peso 14170 28531

    matt 24806 28718

    derek 1 (tri/diag join) 13560 35406

    derek 2 (row_num) 19196 41718

    derek 3 (dense_rank) 19603 42125

    Times in ms.

    On the server, Matt's and Peso's take roughly the same CPU time, but Peso's is faster on elapsed time (presumably because of taking advantage of multiple processors). However, even though the triangular join method takes more CPU time, it obviously makes even better use of the multiple processors and consistently finishes faster.

    I don't understand why the laptop consistently uses less CPU time. Presumably a better execution plan is chosen for a uni-processor.

    Is there some way to force a plan for a given number of processors?

    Hardware:

    Laptop has 2.0Ghz Celeron (M) + 1.5Gb memory

    Server has 4*3.4Ghz Xeon + 4.0Gb memory

    Derek

  • Is those timings for Peso for both UPDATE and SELECT?

    What are the times for just SELECT, if the UPDATEs are already done with trigger?

    You can hint a query with MAXDOP.


    N 56°04'39.16"
    E 12°55'05.25"

  • You should try the query that I posted, it is highly parallel also.

    If you need a working version of Jeff's query, here is one that I modified but is only about 70% as fast as the original:

    set nocount off

    set statistics io on

    set statistics time on

    SELECT IDENTITY(INT,1,1) AS RowNum,

    ID,ST,Date

    INTO #MyHead

    FROM #t

    ORDER BY ID,DATE

    Select IDENTITY(INT,1,1) AS ChangeNum

    , Cast(RowNum as int) as RowNum, ID, ST, Date

    Into #StChange

    From #MyHead H1

    Where NOT Exists (Select * From #MyHead H2

    Where H1.RowNum = H2.RowNum+1

    And H1.St = H2.St

    And H1.ID = H2.ID)

    Order By RowNum

    SELECT t1.ID,

    t1.ST AS STa,

    t1.Date AS DateA,

    t2.ST AS STb,

    t2.Date AS DateB

    FROM #StChange t1

    JOIN #StChange t2

    ON t1.ChangeNum+1 = t2.ChangeNum

    And t1.ID = t2.ID

    ORDER BY t1.ID

    print 'done.'

    Drop table #MyHead

    Drop table #StChange

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.

    A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......

    For more info on that - take a look at some of the tests Jeff ran here:

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • There was an error here...

    rbarryyoung (4/15/2008)


    Join #t B ON ( a.id=b.id and a.date b.st

    What should the line have been?

    Derek

  • Also - one more thought for you. The "triangular join" processes will tend to work well if ALL of the sequences /groups are small. However, even just one or two outlier groups (with a sizeable chunk of rows in the grouping), it can give you a really bad day. Just be sure that your production data is as balanced as your test data, and that your test is in fact representative of the "real" data.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Derek Dongray (4/16/2008)


    There was an error here...

    rbarryyoung (4/15/2008)


    Join #t B ON ( a.id=b.id and a.date b.st

    What should the line have been?

    Darn angle bracket filters! (blankety-blank!)

    Hope this works:

    Join #t B ON ( a.id=b.id and a.date &lt b.date and a.st <> b.st

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Matt Miller (4/16/2008)


    For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.

    A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......

    I'm not sure I understand this.

    Our server has the 'Max degree of parallelism' parameter set to 0 (default) which means it will actually use a value of 4 (the number of processors) in looking for a parallel plan. However, from what you say, the update methods expecting a particular order (using a clustered index) may produce inaccurate results if MAXDOP is not 1. So the update methods, while being fast producing the correct number of records, may need careful checking.

    Have I got that right?

    Derek

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply