April 17, 2007 at 12:13 am
Ooooh... Sorry Peter... didn't mean to make that sound as a challenge... seriously, I've not had the opportunity to work with SQL Server 2005 and the method you used truly intrigues me... I had to do something similar in Oracle except I had to leave the top 2 dupes if 2 or more dupes per cat existed. I was really hoping you would do a test with your code on the test harness I posted so we see how SQL Server 2005 stands up to that type of code... it IS supposed to be much better and I'd really like to know.
Also, you are absolutely correct... but which row would you delete in the event of a date time tie? There's no guarantee that the data in the other columns is also duplicated... still, I understand your point and the code you posted will certainly resolve that in SQL Server 2005.
The only way my code would actually delete all but the latest dupe with dupe dates, is if the rownumber were used as an additional tiebreaker.
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE t1
FROM #BigTest t1,
#BigTest t2
WHERE t1.SomeID = t2.SomeID
AND t1.SomeDate <= t2.SomeDate
AND t1.RowNum < t2.RowNum
... but that only works correctly in SQL 2k if the RowNum has some chronological value (usually does for me because of the way I load the data for a dupe check)... won't work correctly for the test table I built for this example.
Like I said... definitely was not a performance challenge... I'm trully interested in how the new Row_Number/Over function performs in SQL Server 2005 and this seemed like the perfect opportunity.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 17, 2007 at 12:21 am
David,
Peter is correct about a potential problem with my code and, I'm sorry to say, I completely forgot about it because of the way I load data for dupe checks. If the ID and the DATE tie, my snippet of code will not delete all-but-one of the tied max dupes... I usually load raw data in such a manner that the rownum provides the unique tie breaker so only unique data is left after the dupe delete. Peter's code inherently takes that into consideration...
--Jeff Moden
Change is inevitable... Change for the better is not.
April 17, 2007 at 12:22 am
I didn't mean to be offensive either.
I thought it would be proper to point out that OP wanted only one record left for each group.
The problem description is vague and his knowledge needs improvement.
So I thought he didn't know himself of this scenario of duplicate values over the composite key.
Just a better help, I suppose.
N 56°04'39.16"
E 12°55'05.25"
April 17, 2007 at 12:36 am
Not a problem... didn't take it offensively but I appreciate your feedback. I still wouldn't mind finding out what the performance of your dupecheck is on SQL Server 2005 because that's actually the (logically) better way to do it and it has more utility in that you can do wierd things with it like delete all but the top 2 rows from each cat.
Thanks, Peter.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 17, 2007 at 4:22 am
This is the execution plan with your method
|--Hash Match(Inner Join, HASH[t2].[SomeID])=([t1].[SomeID]), RESIDUAL[tempdb].[dbo].[#BigTest].[SomeID] as [t1].[SomeID]=[tempdb].[dbo].[#BigTest].[SomeID] as [t2].[SomeID] AND [tempdb].[dbo].[#BigTest].[SomeDate] as [t1].[SomeDate]<[tempdb].[dbo].[#BigTest].[SomeDate] as [t2].[SomeDate]))
|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t2]))
|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t1]))
And this is the execution plan with my method
|--Filter(WHERE[Expr1003]>(1)))
|--Sequence Project(DEFINE[Expr1003]=row_number))
|--Compute Scalar(DEFINE[Expr1005]=(1)))
|--Segment
|--Sort(ORDER BY[tempdb].[dbo].[#BigTest].[SomeID] ASC, [tempdb].[dbo].[#BigTest].[SomeDate] DESC))
|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest]))
This is an excerpt of the test data with SomeID = 22
RowNum SomeID SomeValue SomeNumber SomeDate
335195 22 FD 91,4072 2007-12-12 02:21:28.727
270491 22 RH 50,3793 2005-08-29 20:19:36.407
669219 22 KE 8,4194 2003-06-02 16:23:10.303
Changing DELETE to SELECT yields for your method
RowNum SomeID SomeValue SomeNumber SomeDate
270491 22 RH 50,3793 2005-08-29 20:19:36.407
669219 22 KE 8,4194 2003-06-02 16:23:10.303
669219 22 KE 8,4194 2003-06-02 16:23:10.303
And with my method
RowNum SomeID SomeValue SomeNumber SomeDate RecID
270491 22 RH 50,3793 2005-08-29 20:19:36.407 2
669219 22 KE 8,4194 2003-06-02 16:23:10.303 3
As you can see, your method creates a lot of duplicate rows.
In my batch, total records to delete was 249,486 with your code, and 212,617 with my method.
This can explain the time difference why my method is faster...
N 56°04'39.16"
E 12°55'05.25"
April 17, 2007 at 4:39 am
I've tested the # ways on a testbox singleproc 1,8mhz / 1GbRam with 10,000,000 rows per table and cleared buffers to begin with ...
----CREATE DATABASE DPlanTest
--GO
--USE DPlanTest
--go
--
--SET statistics IO OFF
--GO
--SET NOCOUNT ON
--go
----===== Create and populate a million row test table (takes about 37 seconds)
-- -- Column "SomeDate" has a range of >=01/01/2000 <01/01/2010
-- -- That's ten years worth of dates.
-- SELECT TOP 10000000 -- indeed 10.000.000 rows / table
-- RowNum = IDENTITY(INT,1,1),
-- SomeID = CAST(RAND(CAST(NEWID() AS VARBINARY))*2000000+1 AS INT),
-- SomeValue = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
-- + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),
-- SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),
-- SomeDate = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)
-- INTO BigTest
-- FROM Master.dbo.SysColumns sc1,
-- Master.dbo.SysColumns sc2
---- took 28 seconds to load
--
--SELECT *
--INTO BigTest2
--FROM BigTest
--
--SELECT *
--INTO BigTest3
--FROM BigTest
--
--SELECT *
--INTO BigTest4
--FROM BigTest
--
--SELECT *
--INTO BigTest5
--FROM BigTest
--
----===== Add a token primary key to simulate a real table (takes about 4 seconds)
-- ALTER TABLE BigTest
-- ADD PRIMARY KEY CLUSTERED (RowNum)
---- took 4 seconds to alter
-- ALTER TABLE BigTest2
-- ADD PRIMARY KEY CLUSTERED (RowNum)
--go
--CREATE index x_BigTest2 on BigTest2 (SomeDate)
--CREATE index x1_BigTest2 on BigTest2 (SomeID)
---- took 4 seconds
--go
-- ALTER TABLE BigTest3
-- ADD PRIMARY KEY CLUSTERED (RowNum)
--go
--CREATE index x_BigTest3 on BigTest3 (SomeDate)
--CREATE index x1_BigTest3 on BigTest3 (SomeID)
--
---- took 4 seconds
--go
--ALTER TABLE BigTest4
-- ADD PRIMARY KEY CLUSTERED (RowNum)
--go
--CREATE index x_BigTest4 on BigTest4 (SomeDate)
--CREATE index x1_BigTest4 on BigTest4 (SomeID)
--GO
-- ALTER TABLE BigTest5
-- ADD PRIMARY KEY CLUSTERED (RowNum)
--GO
DBCC
DROPCLEANBUFFERS
GO
SET
statistics IO ON
GO
SET
NOCOUNT OFF
go
go
'Test01'
go
--===== Start a "timer"
DECLARE
@StartTime DATETIME
SET @StartTime = GETDATE()
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE t1
FROM BigTest t1,
BigTest t2
WHERE t1.SomeID = t2.SomeID
AND t1.SomeDate < t2.SomeDate
--===== Display the duration of the DELETE
PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)
-- DROP TABLE BigTest
GO
'Test02'
go
--===== Start a "timer"
DECLARE
@StartTime DATETIME
SET @StartTime = GETDATE()
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE t1
FROM BigTest2 t1,
BigTest2 t2
WHERE t1.SomeID = t2.SomeID
AND t1.SomeDate < t2.SomeDate
--===== Display the duration of the DELETE
PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)
-- DROP TABLE BigTest2
GO
'Test03'
go
--===== Start a "timer"
DECLARE
@StartTime DATETIME
SET @StartTime = GETDATE()
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE t1
FROM BigTest3 t1
inner join (select SomeID , max(SomeDate) as MaxSomeDate
FROM BigTest3
group by SomeID ) t2
on t1.SomeID = t2.SomeID
AND t1.SomeDate < t2.MaxSomeDate
--===== Display the duration of the DELETE
PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)
-- DROP TABLE BigTest3
GO
'Test04'
go
--===== Start a "timer"
DECLARE
@StartTime DATETIME
SET @StartTime = GETDATE()
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE
t1
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID
FROM
BigTest4
)
AS t1
WHERE
RecID > 1
--===== Display the duration of the DELETE
PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)
-- DROP TABLE BigTest4
GO
'Test05'
go
--===== Start a "timer"
DECLARE
@StartTime DATETIME
SET @StartTime = GETDATE()
--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)
DELETE
t1
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID
FROM
BigTest5
)
AS t1
WHERE
RecID > 1
--===== Display the duration of the DELETE
PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)
-- DROP TABLE BigTest5
GO
/*
THE RESULTS
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Test01
Table 'BigTest'. Scan count 10, logical reads 24628549, physical reads 0, read-ahead reads 37150, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(8013465 row(s) affected)
(1 row(s) affected)
00:04:30:420
Test02
Table 'BigTest2'. Scan count 2, logical reads 74911276, physical reads 64640, read-ahead reads 92198, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 20823, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(8013465 row(s) affected)
(1 row(s) affected)
00:12:59:850
Test03
Table 'BigTest3'. Scan count 2, logical reads 74994281, physical reads 67125, read-ahead reads 89947, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 48219, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(8013465 row(s) affected)
(1 row(s) affected)
00:12:12:900
Test04
Table 'BigTest4'. Scan count 1, logical reads 74786052, physical reads 58260, read-ahead reads 48409, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 2759, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(8013465 row(s) affected)
(1 row(s) affected)
00:11:29:370
Test05
Table 'BigTest5'. Scan count 5, logical reads 24839987, physical reads 14885, read-ahead reads 42739, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 9, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(8013465 row(s) affected)
(1 row(s) affected)
00:03:41:270
*/
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
April 17, 2007 at 9:00 am
A lot in impressive information! Well done.
What does it say?
N 56°04'39.16"
E 12°55'05.25"
April 17, 2007 at 10:48 am
In short :
The delete query :
DELETE t1
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID
FROM
BigTest5
)
AS t1
WHERE
RecID > 1
is the fastest one when the operations can be performed by forcing a tablescan( no extra indexes but the (useless for this operation) primary key)
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
April 17, 2007 at 5:30 pm
Very cool... thanks for taking the time, ALZDBA!
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 9 posts - 16 through 23 (of 23 total)
You must be logged in to reply to this topic. Login to reply