April 14, 2008 at 2:53 pm
Derek Dongray (4/14/2008)
Including the update, this is 50% faster. Excluding the update, it's 100% faster.The only thing I'm dubious about is whether the update will always get the records in the right order?
If there is a clustered index on the table over columns (id, date) it will.
CREATE CLUSTERED INDEX IX_t ON #t (id, date)
You can also have the UPDATE part in a trigger and update only for all records of same id.
It doesn't matter which seq it is, as long as seq is unique within id group.
N 56°04'39.16"
E 12°55'05.25"
April 15, 2008 at 6:39 am
Anybody leave part of this cow for me?
This seems to me to be the most obvious approach, though perhaps not the fastest (when I needed this kind of query to be fast, I've always used the SELECT..INTO technique that Jeff showed to add a sequence ID to the data set):
set nocount off
set statistics io on
set statistics time on
Select a.id, a.st as sta, a.date as da, b.st as stb, b.date as db
From #t A
Join #t B ON ( a.id=b.id and a.date b.st
And Not Exists (Select * from #t b2
Where a.id=b2.id -- and/or c.id=b.id
And a.st<>b2.st
And a.date < b2.date
and b2.date < b.date
) )
Where Not Exists (Select * from #t a2
Where a.id=a2.id
And a.st=a2.st
And a.date < a2.date
And a2.date < b.date
)
Order By a.id, da
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 15, 2008 at 9:40 am
I created a couple of test queries using the real data based on Peso's method and using row_number().
After several tests, Peso's method was fractionally faster (41015ms) than the triangular/diagonal join method (45468ms) in producing 36820 paired status records from input of 80351 records. The row_number() method completely bombed with 660685ms for some reason! I checked the implementation several times to see if I'd got something wrong, but eventually gave up.
For Peso's method, if I add the overhead of setting the Seq field to null (656ms) and updating it (891ms) then the gain is less (42562ms). Adding an index on Seq fractionally improves the query (40547ms) but significantly worsens the clearing (3064ms) and update (3297ms) making it slower overall (46908ms). Since it's basically doing sequential processing, it's not a surprise that an index doesn't help.
The conclusion is that the triangular/diagonal join isn't the problem it's perceived to be and I might just as well leave the system using the working code.
Thanks to all who took a look at it for me.
Derek
April 15, 2008 at 9:53 am
Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2008 at 10:29 am
Jeff Moden (4/15/2008)
Can the data generator code you provided be used to generate that many rows? I'd like to "play" some more because 41 to 45 seconds to produce "36820 paired status records from input of 80351 records" absolutely sucks no matter which way you do it. 😉
I agree, but I don't think the problem with the real selection is in the algorithm.
The test generator can do any number of records. Here's the results from running 200000 records on my laptop
200000 records generated
>>>> original (triangular)
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 17625 ms, elapsed time = 1773935 ms.
>>>> using row_number()
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 13000 ms, elapsed time = 1547006 ms.
>>>> using dense_rank()
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 12765 ms, elapsed time = 1380089 ms.
>>>> peso
>Update
SQL Server Execution Times:
CPU time = 2984 ms, elapsed time = 184522 ms.
(1045946 row(s) affected)
>select
(806874 row(s) affected)
SQL Server Execution Times:
CPU time = 7703 ms, elapsed time = 1148723 ms.
>>>> Jeff (copy) [uncorrected for minor error]
>copy
SQL Server Execution Times:
CPU time = 1391 ms, elapsed time = 98708 ms.
(1045946 row(s) affected)
>select [slight error here]
(845946 row(s) affected)
SQL Server Execution Times:
CPU time = 3985 ms, elapsed time = 999281 ms.
The problem with the 'real' data is hat I need to work out working day corrections dependent on country and timezone.
Here are the results from one of the 'full' test runs (zero values from reads removed).
Existing method using triangular/diagonal join
-------- ------ ----- ------------------- ----
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 30, logical reads 34608.
Table 'tblCountry'. Scan count 12, logical reads 24.
Table 'WorkingDays'. Scan count 36825, logical reads 122608.
SQL Server Execution Times:
CPU time = 45468 ms, elapsed time = 12836 ms.
(36820 row(s) affected)
-------------------------------------------------------------
'Peso' method using seq field on tblWorkflow (indexed seq)
------ ------ ----- --- ----- -- -----------
Step 1: set seq = null
-------
Table 'Worktable'. Scan count 4, logical reads 163390.
Table 'tblWorkFlow'. Scan count 5, logical reads 618352.
SQL Server Execution Times:
CPU time = 3064 ms, elapsed time = 2424 ms.
(80351 row(s) affected)
Step 2: set seq=seq+1
-------
Table 'Worktable'. Scan count 4, logical reads 164118.
Table 'tblWorkFlow'. Scan count 5, logical reads 628685.
SQL Server Execution Times:
CPU time = 3297 ms, elapsed time = 3069 ms.
(80351 row(s) affected)
Step 3: select
-------
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 20, logical reads 7782.
Table 'WorkingDays'. Scan count 10, logical reads 1178.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 40547 ms, elapsed time = 12635 ms.
-------
(36820 row(s) affected)
Totals:
CPU time = 46908 ms, elapsed time = 18128 ms.
-------------------------------------------------------------
'Peso' method using seq field on tblWorkflow (no index on seq)
------ ------ ----- --- ----- -- -----------
Step 1: set seq = null
-------
Table 'tblWorkFlow'. Scan count 1, logical reads 246418.
SQL Server Execution Times:
CPU time = 656 ms, elapsed time = 654 ms.
(80351 row(s) affected)
Step 2: set seq=seq+1
-------
Table 'tblWorkFlow'. Scan count 1, logical reads 246847.
SQL Server Execution Times:
CPU time = 891 ms, elapsed time = 901 ms.
(80351 row(s) affected)
Step 3: select
-------
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 42, logical reads 75318.
Table 'tblWorkFlow'. Scan count 20, logical reads 42654.
Table 'WorkingDays'. Scan count 10, logical reads 1196.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 41015 ms, elapsed time = 12322 ms.
(36820 row(s) affected)
Totals:
CPU time = 42562 ms, elapsed time = 13877 ms.
-------------------------------------------------------------
Using CTEs and row_number()
----- ---- --- ------------
Table 'WorkingDays'. Scan count 10, logical reads 1352.
Table 'tblClient'. Logical reads 182.
Table 'tblCompany'. Scan count 1, logical reads 2.
Table 'Worktable'. Scan count 12.
Table 'tblWorkFlow'. Scan count 16, logical reads 77194.
Table 'tblCountry'. Scan count 12, logical reads 24.
SQL Server Execution Times:
CPU time = 660685 ms, elapsed time = 281187 ms.
(36820 row(s) affected)
As you can see, the scan count on the table 'WorkingDays' in the first query is ridiculous. This will be the next area of investigation as I believe it should be possible to significantly improve this.
Derek
April 15, 2008 at 10:38 am
Perhaps I'm missing some of the subtleties here, but this processes 350,000 pairs out of 500,000 records in 17 seconds.
[font="Courier New"]--clean up the environment
DROP TABLE #t
DROP TABLE #matt
DROP TABLE #matt2
GO
--create test data
CREATE TABLE #t (id INT , st CHAR(1), date DATETIME)
INSERT #t (st,date,id)
SELECT TOP 250000 CHAR(CAST(RAND(checksum(NEWID()))*3 AS INT)+65),
DATEADD(mi,RAND(checksum(NEWID()))*1440.0*28,CONVERT(DATETIME,'2008/2/1')),
CAST(RAND(checksum(NEWID()))*1000+1 AS integer)
FROM sys.all_columns sc1, sys.all_columns sc2
GO
--start the processing
DECLARE @g DATETIME
SET @g=GETDATE()
SELECT DISTINCT *, CAST(0 AS INT) AS grp
INTO #matt
FROM #t
CREATE UNIQUE CLUSTERED INDEX pkM ON #matt(id,date,st)
DECLARE @prevID INT
SET @previd=0;
DECLARE @prevgrp INT
SET @prevgrp=0;
DECLARE @prevstat CHAR
SET @prevstat='';
UPDATE #matt
SET @prevgrp=grp=CASE WHEN @previd<>ID THEN 1
WHEN @prevstat<>st THEN @prevgrp+1
ELSE @prevgrp END,
@previd=id,
@prevstat=st
FROM #matt WITH (tablockX, INDEX(pkm))
OPTION (MAXDOP 1)
SELECT id,grp,st,MAX(date) lastdte
INTO #matt2
FROM #matt
GROUP BY id,grp,st
SELECT m1.id, m1.st oldst,
m1.lastdte oldlast,
m2.st newst,
m2.lastdte newlast
FROM #matt2 m1
LEFT OUTER JOIN #matt2 m2 ON m1.id=m2.id AND m1.grp=m2.grp-1
ORDER BY m1.id,m2.grp
SELECT DATEDIFF(ms, @g, GETDATE())
GO
--clean up the environment
DROP TABLE #t
DROP TABLE #matt
DROP TABLE #matt2
[/font]
100000 goes in under 5 seconds.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 15, 2008 at 10:41 am
...actually - by adding a clustered index right before that last results query - it drops to 14 seconds (14221 ms) for 500,000 rows.
create unique clustered index #pkm2 on #matt2(id,grp)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 16, 2008 at 5:09 am
I ran some tests measuring the elapsed (from datediff(ms,,getdate())) and cpu (from @@cpu_busy*@@timeticks) times on both my laptop and the server where this application resides.
I generated 500,000 series which generated about 830,000 pairs using 5 methods (Jeff's doesn't ignore unchanged states, so I left it out). Output was selected into temp tables (which improves times).
Laptop:
Test 1:
what elapsed cpu
peso 24203 14093
derek 2 (row_num) 27716 15750
derek 3 (dense_rank) 23876 16437
derek 1 (tri/diag join) 25546 17218
matt 26516 19937
Test 2:
what elapsed cpu
peso 30200 14968
derek 2 (row_num) 21436 15531
derek 3 (dense_rank) 22440 16062
derek 1 (tri/diag join) 36453 19031
matt 24173 20093
Server:
Test 1:
what elapsed cpu
matt 23546 28406
peso 15623 29656
derek 1 (tri/diag join) 13186 35343
derek 2 (row_num) 21250 41968
derek 3 (dense_rank) 20923 42500
Test 2:
what elapsed cpu
peso 14170 28531
matt 24806 28718
derek 1 (tri/diag join) 13560 35406
derek 2 (row_num) 19196 41718
derek 3 (dense_rank) 19603 42125
Times in ms.
On the server, Matt's and Peso's take roughly the same CPU time, but Peso's is faster on elapsed time (presumably because of taking advantage of multiple processors). However, even though the triangular join method takes more CPU time, it obviously makes even better use of the multiple processors and consistently finishes faster.
I don't understand why the laptop consistently uses less CPU time. Presumably a better execution plan is chosen for a uni-processor.
Is there some way to force a plan for a given number of processors?
Hardware:
Laptop has 2.0Ghz Celeron (M) + 1.5Gb memory
Server has 4*3.4Ghz Xeon + 4.0Gb memory
Derek
April 16, 2008 at 5:13 am
Is those timings for Peso for both UPDATE and SELECT?
What are the times for just SELECT, if the UPDATEs are already done with trigger?
You can hint a query with MAXDOP.
N 56°04'39.16"
E 12°55'05.25"
April 16, 2008 at 7:23 am
You should try the query that I posted, it is highly parallel also.
If you need a working version of Jeff's query, here is one that I modified but is only about 70% as fast as the original:
set nocount off
set statistics io on
set statistics time on
SELECT IDENTITY(INT,1,1) AS RowNum,
ID,ST,Date
INTO #MyHead
FROM #t
ORDER BY ID,DATE
Select IDENTITY(INT,1,1) AS ChangeNum
, Cast(RowNum as int) as RowNum, ID, ST, Date
Into #StChange
From #MyHead H1
Where NOT Exists (Select * From #MyHead H2
Where H1.RowNum = H2.RowNum+1
And H1.St = H2.St
And H1.ID = H2.ID)
Order By RowNum
SELECT t1.ID,
t1.ST AS STa,
t1.Date AS DateA,
t2.ST AS STb,
t2.Date AS DateB
FROM #StChange t1
JOIN #StChange t2
ON t1.ChangeNum+1 = t2.ChangeNum
And t1.ID = t2.ID
ORDER BY t1.ID
print 'done.'
Drop table #MyHead
Drop table #StChange
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 16, 2008 at 7:34 am
For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.
A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......
For more info on that - take a look at some of the tests Jeff ran here:
http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 16, 2008 at 7:39 am
There was an error here...
rbarryyoung (4/15/2008)
Join #t B ON ( a.id=b.id and a.date b.st
What should the line have been?
Derek
April 16, 2008 at 7:42 am
Also - one more thought for you. The "triangular join" processes will tend to work well if ALL of the sequences /groups are small. However, even just one or two outlier groups (with a sizeable chunk of rows in the grouping), it can give you a really bad day. Just be sure that your production data is as balanced as your test data, and that your test is in fact representative of the "real" data.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 16, 2008 at 9:54 am
Derek Dongray (4/16/2008)
There was an error here...rbarryyoung (4/15/2008)
Join #t B ON ( a.id=b.id and a.date b.st
What should the line have been?
Darn angle bracket filters! (blankety-blank!)
Hope this works:
Join #t B ON ( a.id=b.id and a.date < b.date and a.st <> b.st
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 16, 2008 at 11:30 am
Matt Miller (4/16/2008)
For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......
I'm not sure I understand this.
Our server has the 'Max degree of parallelism' parameter set to 0 (default) which means it will actually use a value of 4 (the number of processors) in looking for a parallel plan. However, from what you say, the update methods expecting a particular order (using a clustered index) may produce inaccurate results if MAXDOP is not 1. So the update methods, while being fast producing the correct number of records, may need careful checking.
Have I got that right?
Derek
Viewing 15 posts - 16 through 30 (of 37 total)
You must be logged in to reply to this topic. Login to reply