DELETE all records except most recent

Question

DELETE all records except most recent

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 1

Ooooh... Sorry Peter... didn't mean to make that sound as a challenge... seriously, I've not had the opportunity to work with SQL Server 2005 and the method you used truly intrigues me... I had to do something similar in Oracle except I had to leave the top 2 dupes if 2 or more dupes per cat existed. I was really hoping you would do a test with your code on the test harness I posted so we see how SQL Server 2005 stands up to that type of code... it IS supposed to be much better and I'd really like to know.

Also, you are absolutely correct... but which row would you delete in the event of a date time tie? There's no guarantee that the data in the other columns is also duplicated... still, I understand your point and the code you posted will certainly resolve that in SQL Server 2005.

The only way my code would actually delete all but the latest dupe with dupe dates, is if the rownumber were used as an additional tiebreaker.

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE t1

FROM #BigTest t1,

#BigTest t2

WHERE t1.SomeID = t2.SomeID

AND t1.SomeDate <= t2.SomeDate

AND t1.RowNum < t2.RowNum

... but that only works correctly in SQL 2k if the RowNum has some chronological value (usually does for me because of the way I load the data for a dupe check)... won't work correctly for the test table I built for this example.

Like I said... definitely was not a performance challenge... I'm trully interested in how the new Row_Number/Over function performs in SQL Server 2005 and this seemed like the perfect opportunity.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 2

David,

Peter is correct about a potential problem with my code and, I'm sorry to say, I completely forgot about it because of the way I load data for dupe checks. If the ID and the DATE tie, my snippet of code will not delete all-but-one of the tied max dupes... I usually load raw data in such a manner that the rownum provides the unique tie breaker so only unique data is left after the dupe delete. Peter's code inherently takes that into consideration...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 3

I didn't mean to be offensive either.

I thought it would be proper to point out that OP wanted only one record left for each group.

The problem description is vague and his knowledge needs improvement.

So I thought he didn't know himself of this scenario of duplicate values over the composite key.

Just a better help, I suppose.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 4

Not a problem... didn't take it offensively but I appreciate your feedback. I still wouldn't mind finding out what the performance of your dupecheck is on SQL Server 2005 because that's actually the (logically) better way to do it and it has more utility in that you can do wierd things with it like delete all but the top 2 rows from each cat.

Thanks, Peter.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 5

This is the execution plan with your method

|--Hash Match(Inner Join, HASH[t2].[SomeID])=([t1].[SomeID]), RESIDUAL[tempdb].[dbo].[#BigTest].[SomeID] as [t1].[SomeID]=[tempdb].[dbo].[#BigTest].[SomeID] as [t2].[SomeID] AND [tempdb].[dbo].[#BigTest].[SomeDate] as [t1].[SomeDate]<[tempdb].[dbo].[#BigTest].[SomeDate] as [t2].[SomeDate]))

|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t2]))

|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t1]))

And this is the execution plan with my method

|--Filter(WHERE[Expr1003]>(1)))

|--Sequence Project(DEFINE[Expr1003]=row_number))

|--Compute Scalar(DEFINE[Expr1005]=(1)))

|--Segment

|--Sort(ORDER BY[tempdb].[dbo].[#BigTest].[SomeID] ASC, [tempdb].[dbo].[#BigTest].[SomeDate] DESC))

|--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest]))

This is an excerpt of the test data with SomeID = 22

RowNum SomeID SomeValue SomeNumber SomeDate

335195 22 FD 91,4072 2007-12-12 02:21:28.727

270491 22 RH 50,3793 2005-08-29 20:19:36.407

669219 22 KE 8,4194 2003-06-02 16:23:10.303

Changing DELETE to SELECT yields for your method

RowNum SomeID SomeValue SomeNumber SomeDate

270491 22 RH 50,3793 2005-08-29 20:19:36.407

669219 22 KE 8,4194 2003-06-02 16:23:10.303

And with my method

RowNum SomeID SomeValue SomeNumber SomeDate RecID

270491 22 RH 50,3793 2005-08-29 20:19:36.407 2

669219 22 KE 8,4194 2003-06-02 16:23:10.303 3

As you can see, your method creates a lot of duplicate rows.

In my batch, total records to delete was 249,486 with your code, and 212,617 with my method.

This can explain the time difference why my method is faster...

N 56°04'39.16"
E 12°55'05.25"

Johan Bijnens SSC Guru Points: 135369 More actions · Answer 6

I've tested the # ways on a testbox singleproc 1,8mhz / 1GbRam with 10,000,000 rows per table and cleared buffers to begin with ...

----CREATE DATABASE DPlanTest

--GO

--USE DPlanTest

--go

--

--SET statistics IO OFF

--GO

--SET NOCOUNT ON

--go

----===== Create and populate a million row test table (takes about 37 seconds)

-- -- Column "SomeDate" has a range of >=01/01/2000 <01/01/2010

-- -- That's ten years worth of dates.

-- SELECT TOP 10000000 -- indeed 10.000.000 rows / table

-- RowNum = IDENTITY(INT,1,1),

-- SomeID = CAST(RAND(CAST(NEWID() AS VARBINARY))*2000000+1 AS INT),

-- SomeValue = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

-- + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),

-- SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),

-- SomeDate = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)

-- INTO BigTest

-- FROM Master.dbo.SysColumns sc1,

-- Master.dbo.SysColumns sc2

---- took 28 seconds to load

--

--SELECT *

--INTO BigTest2

--FROM BigTest

--

--SELECT *

--INTO BigTest3

--FROM BigTest

--

--SELECT *

--INTO BigTest4

--FROM BigTest

--

--SELECT *

--INTO BigTest5

--FROM BigTest

--

----===== Add a token primary key to simulate a real table (takes about 4 seconds)

-- ALTER TABLE BigTest

-- ADD PRIMARY KEY CLUSTERED (RowNum)

---- took 4 seconds to alter

-- ALTER TABLE BigTest2

-- ADD PRIMARY KEY CLUSTERED (RowNum)

--go

--CREATE index x_BigTest2 on BigTest2 (SomeDate)

--CREATE index x1_BigTest2 on BigTest2 (SomeID)

---- took 4 seconds

--go

-- ALTER TABLE BigTest3

-- ADD PRIMARY KEY CLUSTERED (RowNum)

--go

--CREATE index x_BigTest3 on BigTest3 (SomeDate)

--CREATE index x1_BigTest3 on BigTest3 (SomeID)

--

---- took 4 seconds

--go

--ALTER TABLE BigTest4

-- ADD PRIMARY KEY CLUSTERED (RowNum)

--go

--CREATE index x_BigTest4 on BigTest4 (SomeDate)

--CREATE index x1_BigTest4 on BigTest4 (SomeID)

--GO

-- ALTER TABLE BigTest5

-- ADD PRIMARY KEY CLUSTERED (RowNum)

--GO

DBCC

DROPCLEANBUFFERS

GO

SET

statistics IO ON

GO

SET

NOCOUNT OFF

go

PRINT

'Test01'

go

--===== Start a "timer"

DECLARE

@StartTime DATETIME

SET @StartTime = GETDATE()

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE t1

FROM BigTest t1,

BigTest t2

WHERE t1.SomeID = t2.SomeID

AND t1.SomeDate < t2.SomeDate

--===== Display the duration of the DELETE

PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

-- DROP TABLE BigTest

GO

PRINT

'Test02'

go

--===== Start a "timer"

DECLARE

@StartTime DATETIME

SET @StartTime = GETDATE()

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE t1

FROM BigTest2 t1,

BigTest2 t2

WHERE t1.SomeID = t2.SomeID

AND t1.SomeDate < t2.SomeDate

--===== Display the duration of the DELETE

PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

-- DROP TABLE BigTest2

GO

PRINT

'Test03'

go

--===== Start a "timer"

DECLARE

@StartTime DATETIME

SET @StartTime = GETDATE()

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE t1

FROM BigTest3 t1

inner join (select SomeID , max(SomeDate) as MaxSomeDate

FROM BigTest3

group by SomeID ) t2

on t1.SomeID = t2.SomeID

AND t1.SomeDate < t2.MaxSomeDate

--===== Display the duration of the DELETE

PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

-- DROP TABLE BigTest3

GO

PRINT

'Test04'

go

--===== Start a "timer"

DECLARE

@StartTime DATETIME

SET @StartTime = GETDATE()

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE

t1

FROM

(

SELECT

ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

FROM

BigTest4

)

AS t1

WHERE

RecID > 1

--===== Display the duration of the DELETE

PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

-- DROP TABLE BigTest4

GO

PRINT

'Test05'

go

--===== Start a "timer"

DECLARE

@StartTime DATETIME

SET @StartTime = GETDATE()

--===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

DELETE

t1

FROM

(

SELECT

ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

FROM

BigTest5

)

AS t1

WHERE

RecID > 1

--===== Display the duration of the DELETE

PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

-- DROP TABLE BigTest5

GO

/*

THE RESULTS

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Test01

Table 'BigTest'. Scan count 10, logical reads 24628549, physical reads 0, read-ahead reads 37150, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(8013465 row(s) affected)

(1 row(s) affected)

00:04:30:420

Test02

Table 'BigTest2'. Scan count 2, logical reads 74911276, physical reads 64640, read-ahead reads 92198, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 20823, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(8013465 row(s) affected)

(1 row(s) affected)

00:12:59:850

Test03

Table 'BigTest3'. Scan count 2, logical reads 74994281, physical reads 67125, read-ahead reads 89947, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 48219, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(8013465 row(s) affected)

(1 row(s) affected)

00:12:12:900

Test04

Table 'BigTest4'. Scan count 1, logical reads 74786052, physical reads 58260, read-ahead reads 48409, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 2759, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(8013465 row(s) affected)

(1 row(s) affected)

00:11:29:370

Test05

Table 'BigTest5'. Scan count 5, logical reads 24839987, physical reads 14885, read-ahead reads 42739, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 9, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(8013465 row(s) affected)

(1 row(s) affected)

00:03:41:270

*/

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 7

A lot in impressive information! Well done.

What does it say?

N 56°04'39.16"
E 12°55'05.25"

Johan Bijnens SSC Guru Points: 135369 More actions · Answer 8

In short :

The delete query :

DELETE t1

FROM

(

SELECT

ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

FROM

BigTest5

)

AS t1

WHERE

RecID > 1

is the fastest one when the operations can be performed by forcing a tablescan( no extra indexes but the (useless for this operation) primary key)

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 9

Very cool... thanks for taking the time, ALZDBA!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)