Linking to next row using a join on row_number()+1=row_number()

Question

Linking to next row using a join on row_number()+1=row_number()

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 1

Derek Dongray (4/14/2008)
Including the update, this is 50% faster. Excluding the update, it's 100% faster.
The only thing I'm dubious about is whether the update will always get the records in the right order?

If there is a clustered index on the table over columns (id, date) it will.

CREATE CLUSTERED INDEX IX_t ON #t (id, date)

You can also have the UPDATE part in a trigger and update only for all records of same id.

It doesn't matter which seq it is, as long as seq is unique within id group.

N 56°04'39.16"
E 12°55'05.25"

RBarryYoung SSC Guru Points: 143329 More actions · Answer 2

Anybody leave part of this cow for me?

This seems to me to be the most obvious approach, though perhaps not the fastest (when I needed this kind of query to be fast, I've always used the SELECT..INTO technique that Jeff showed to add a sequence ID to the data set):

set nocount off

set statistics io on

set statistics time on

Select a.id, a.st as sta, a.date as da, b.st as stb, b.date as db

From #t A

Join #t B ON ( a.id=b.id and a.date b.st

And Not Exists (Select * from #t b2

Where a.id=b2.id -- and/or c.id=b.id

And a.st<>b2.st

And a.date < b2.date

and b2.date < b.date

) )

Where Not Exists (Select * from #t a2

Where a.id=a2.id

And a.st=a2.st

And a.date < a2.date

And a2.date < b.date

)

Order By a.id, da

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 3

I created a couple of test queries using the real data based on Peso's method and using row_number().

After several tests, Peso's method was fractionally faster (41015ms) than the triangular/diagonal join method (45468ms) in producing 36820 paired status records from input of 80351 records. The row_number() method completely bombed with 660685ms for some reason! I checked the implementation several times to see if I'd got something wrong, but eventually gave up.

For Peso's method, if I add the overhead of setting the Seq field to null (656ms) and updating it (891ms) then the gain is less (42562ms). Adding an index on Seq fractionally improves the query (40547ms) but significantly worsens the clearing (3064ms) and update (3297ms) making it slower overall (46908ms). Since it's basically doing sequential processing, it's not a surprise that an index doesn't help.

The conclusion is that the triangular/diagonal join isn't the problem it's perceived to be and I might just as well leave the system using the working code.

Thanks to all who took a look at it for me.

Derek

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 4

Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 5

Jeff Moden (4/15/2008)
Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉

I agree, but I don't think the problem with the real selection is in the algorithm.

The test generator can do any number of records. Here's the results from running 200000 records on my laptop

200000 records generated
>>>> original (triangular)
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 17625 ms, elapsed time = 1773935 ms.
>>>> using row_number()
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 13000 ms, elapsed time = 1547006 ms.
>>>> using dense_rank()
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 12765 ms, elapsed time = 1380089 ms.
>>>> peso
>Update
SQL Server Execution Times:
CPU time = 2984 ms, elapsed time = 184522 ms.
(1045946 row(s) affected)
>select
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 7703 ms, elapsed time = 1148723 ms.
>>>> Jeff (copy) [uncorrected for minor error]
>copy
SQL Server Execution Times:
CPU time = 1391 ms, elapsed time = 98708 ms.
(1045946 row(s) affected)
>select [slight error here]
(845946 row(s) affected)
SQL Server Execution Times:
CPU time = 3985 ms, elapsed time = 999281 ms.

The problem with the 'real' data is hat I need to work out working day corrections dependent on country and timezone.

Here are the results from one of the 'full' test runs (zero values from reads removed).

Existing method using triangular/diagonal join
-------- ------ ----- ------------------- ----
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 30, logical reads 34608.
Table 'tblCountry'. Scan count 12, logical reads 24.
Table 'WorkingDays'. Scan count 36825, logical reads 122608.
SQL Server Execution Times:
CPU time = 45468 ms, elapsed time = 12836 ms.
(36820 row(s) affected)
-------------------------------------------------------------
'Peso' method using seq field on tblWorkflow (indexed seq)
------ ------ ----- --- ----- -- -----------
Step 1: set seq = null
-------
Table 'Worktable'. Scan count 4, logical reads 163390.
Table 'tblWorkFlow'. Scan count 5, logical reads 618352.
SQL Server Execution Times:
CPU time = 3064 ms, elapsed time = 2424 ms.
(80351 row(s) affected)
Step 2: set seq=seq+1
-------
Table 'Worktable'. Scan count 4, logical reads 164118.
Table 'tblWorkFlow'. Scan count 5, logical reads 628685.
SQL Server Execution Times:
CPU time = 3297 ms, elapsed time = 3069 ms.
(80351 row(s) affected)
Step 3: select
-------
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 20, logical reads 7782.
Table 'WorkingDays'. Scan count 10, logical reads 1178.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 40547 ms, elapsed time = 12635 ms.
-------
(36820 row(s) affected)
Totals:
CPU time = 46908 ms, elapsed time = 18128 ms.
-------------------------------------------------------------
'Peso' method using seq field on tblWorkflow (no index on seq)
------ ------ ----- --- ----- -- -----------
Step 1: set seq = null
-------
Table 'tblWorkFlow'. Scan count 1, logical reads 246418.
SQL Server Execution Times:
CPU time = 656 ms, elapsed time = 654 ms.
(80351 row(s) affected)
Step 2: set seq=seq+1
-------
Table 'tblWorkFlow'. Scan count 1, logical reads 246847.
SQL Server Execution Times:
CPU time = 891 ms, elapsed time = 901 ms.
(80351 row(s) affected)
Step 3: select
-------
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 20, logical reads 42654.
Table 'WorkingDays'. Scan count 10, logical reads 1196.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 41015 ms, elapsed time = 12322 ms.
(36820 row(s) affected)
Totals:
CPU time = 42562 ms, elapsed time = 13877 ms.
-------------------------------------------------------------
Using CTEs and row_number()
----- ---- --- ------------
Table 'WorkingDays'. Scan count 10, logical reads 1352.
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 12.
Table 'tblWorkFlow'. Scan count 16, logical reads 77194.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 660685 ms, elapsed time = 281187 ms.
(36820 row(s) affected)

As you can see, the scan count on the table 'WorkingDays' in the first query is ridiculous. This will be the next area of investigation as I believe it should be possible to significantly improve this.

Derek

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 6

Perhaps I'm missing some of the subtleties here, but this processes 350,000 pairs out of 500,000 records in 17 seconds.

[font="Courier New"]--clean up the environment

DROP TABLE #t

DROP TABLE #matt

DROP TABLE #matt2

GO

--create test data

CREATE TABLE #t (id INT , st CHAR(1), date DATETIME)

INSERT #t (st,date,id)

SELECT TOP 250000 CHAR(CAST(RAND(checksum(NEWID()))*3 AS INT)+65),

DATEADD(mi,RAND(checksum(NEWID()))*1440.0*28,CONVERT(DATETIME,'2008/2/1')),

CAST(RAND(checksum(NEWID()))*1000+1 AS integer)

FROM sys.all_columns sc1, sys.all_columns sc2

GO

--start the processing

DECLARE @g DATETIME

SET @g=GETDATE()

SELECT DISTINCT *, CAST(0 AS INT) AS grp

INTO #matt

FROM #t

CREATE UNIQUE CLUSTERED INDEX pkM ON #matt(id,date,st)

DECLARE @prevID INT

SET @previd=0;

DECLARE @prevgrp INT

SET @prevgrp=0;

DECLARE @prevstat CHAR

SET @prevstat='';

UPDATE #matt

SET @prevgrp=grp=CASE WHEN @previd<>ID THEN 1

WHEN @prevstat<>st THEN @prevgrp+1

ELSE @prevgrp END,

@previd=id,

@prevstat=st

FROM #matt WITH (tablockX, INDEX(pkm))

OPTION (MAXDOP 1)

SELECT id,grp,st,MAX(date) lastdte

INTO #matt2

FROM #matt

GROUP BY id,grp,st

SELECT m1.id, m1.st oldst,

m1.lastdte oldlast,

m2.st newst,

m2.lastdte newlast

FROM #matt2 m1

LEFT OUTER JOIN #matt2 m2 ON m1.id=m2.id AND m1.grp=m2.grp-1

ORDER BY m1.id,m2.grp

SELECT DATEDIFF(ms, @g, GETDATE())

GO

--clean up the environment

DROP TABLE #t

DROP TABLE #matt

DROP TABLE #matt2

[/font]

100000 goes in under 5 seconds.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

...actually - by adding a clustered index right before that last results query - it drops to 14 seconds (14221 ms) for 500,000 rows.

create unique clustered index #pkm2 on #matt2(id,grp)

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 8

I ran some tests measuring the elapsed (from datediff(ms,,getdate())) and cpu (from @@cpu_busy*@@timeticks) times on both my laptop and the server where this application resides.

I generated 500,000 series which generated about 830,000 pairs using 5 methods (Jeff's doesn't ignore unchanged states, so I left it out). Output was selected into temp tables (which improves times).

Laptop:

Test 1:

what elapsed cpu

peso 24203 14093

derek 2 (row_num) 27716 15750

derek 3 (dense_rank) 23876 16437

derek 1 (tri/diag join) 25546 17218

matt 26516 19937

Test 2:

what elapsed cpu

peso 30200 14968

derek 2 (row_num) 21436 15531

derek 3 (dense_rank) 22440 16062

derek 1 (tri/diag join) 36453 19031

matt 24173 20093

Server:

Test 1:

what elapsed cpu

matt 23546 28406

peso 15623 29656

derek 1 (tri/diag join) 13186 35343

derek 2 (row_num) 21250 41968

derek 3 (dense_rank) 20923 42500

Test 2:

what elapsed cpu

peso 14170 28531

matt 24806 28718

derek 1 (tri/diag join) 13560 35406

derek 2 (row_num) 19196 41718

derek 3 (dense_rank) 19603 42125

Times in ms.

On the server, Matt's and Peso's take roughly the same CPU time, but Peso's is faster on elapsed time (presumably because of taking advantage of multiple processors). However, even though the triangular join method takes more CPU time, it obviously makes even better use of the multiple processors and consistently finishes faster.

I don't understand why the laptop consistently uses less CPU time. Presumably a better execution plan is chosen for a uni-processor.

Is there some way to force a plan for a given number of processors?

Hardware:

Laptop has 2.0Ghz Celeron (M) + 1.5Gb memory

Server has 4*3.4Ghz Xeon + 4.0Gb memory

Derek

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 9

Is those timings for Peso for both UPDATE and SELECT?

What are the times for just SELECT, if the UPDATEs are already done with trigger?

You can hint a query with MAXDOP.

N 56°04'39.16"
E 12°55'05.25"

RBarryYoung SSC Guru Points: 143329 More actions · Answer 10

You should try the query that I posted, it is highly parallel also.

If you need a working version of Jeff's query, here is one that I modified but is only about 70% as fast as the original:

set nocount off

set statistics io on

set statistics time on

SELECT IDENTITY(INT,1,1) AS RowNum,

ID,ST,Date

INTO #MyHead

FROM #t

ORDER BY ID,DATE

Select IDENTITY(INT,1,1) AS ChangeNum

, Cast(RowNum as int) as RowNum, ID, ST, Date

Into #StChange

From #MyHead H1

Where NOT Exists (Select * From #MyHead H2

Where H1.RowNum = H2.RowNum+1

And H1.St = H2.St

And H1.ID = H2.ID)

Order By RowNum

SELECT t1.ID,

t1.ST AS STa,

t1.Date AS DateA,

t2.ST AS STb,

t2.Date AS DateB

FROM #StChange t1

JOIN #StChange t2

ON t1.ChangeNum+1 = t2.ChangeNum

And t1.ID = t2.ID

ORDER BY t1.ID

print 'done.'

Drop table #MyHead

Drop table #StChange

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.

A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......

For more info on that - take a look at some of the tests Jeff ran here:

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 12

There was an error here...

rbarryyoung (4/15/2008)
Join #t B ON ( a.id=b.id and a.date b.st

What should the line have been?

Derek

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

Also - one more thought for you. The "triangular join" processes will tend to work well if ALL of the sequences /groups are small. However, even just one or two outlier groups (with a sizeable chunk of rows in the grouping), it can give you a really bad day. Just be sure that your production data is as balanced as your test data, and that your test is in fact representative of the "real" data.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

RBarryYoung SSC Guru Points: 143329 More actions · Answer 14

Derek Dongray (4/16/2008)
There was an error here...
rbarryyoung (4/15/2008)
Join #t B ON ( a.id=b.id and a.date b.st
What should the line have been?

Darn angle bracket filters! (blankety-blank!)

Hope this works:

Join #t B ON ( a.id=b.id and a.date &lt b.date and a.st <> b.st

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 15

Matt Miller (4/16/2008)
For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.
A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......

I'm not sure I understand this.

Our server has the 'Max degree of parallelism' parameter set to 0 (default) which means it will actually use a value of 4 (the number of processors) in looking for a parallel plan. However, from what you say, the update methods expecting a particular order (using a clustered index) may produce inaccurate results if MAXDOP is not 1. So the update methods, while being fast producing the correct number of records, may need careful checking.

Have I got that right?

Derek