DELETE all records except most recent

  • Ooooh... Sorry Peter... didn't mean to make that sound as a challenge... seriously, I've not had the opportunity to work with SQL Server 2005 and the method you used truly intrigues me... I had to do something similar in Oracle except I had to leave the top 2 dupes if 2 or more dupes per cat existed.  I was really hoping you would do a test with your code on the test harness I posted so we see how SQL Server 2005 stands up to that type of code... it IS supposed to be much better and I'd really like to know.

    Also, you are absolutely correct... but which row would you delete in the event of a date time tie?  There's no guarantee that the data in the other columns is also duplicated... still, I understand your point and the code you posted will certainly resolve that in SQL Server 2005. 

    The only way my code would actually delete all but the latest dupe with dupe dates, is if the rownumber were used as an additional tiebreaker.

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

     DELETE t1

       FROM #BigTest t1,

            #BigTest t2

      WHERE t1.SomeID    = t2.SomeID

        AND t1.SomeDate <= t2.SomeDate

        AND t1.RowNum   < t2.RowNum

    ... but that only works correctly in SQL 2k if the RowNum has some chronological value (usually does for me because of the way I load the data for a dupe check)... won't work correctly for the test table I built for this example.

    Like I said... definitely was not a performance challenge... I'm trully interested in how the new Row_Number/Over function performs in SQL Server 2005 and this seemed like the perfect opportunity.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • David,

    Peter is correct about a potential problem with my code and, I'm sorry to say, I completely forgot about it because of the way I load data for dupe checks.  If the ID and the DATE tie, my snippet of code will not delete all-but-one of the tied max dupes... I usually load raw data in such a manner that the rownum provides the unique tie breaker so only unique data is left after the dupe delete.  Peter's code inherently takes that into consideration...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I didn't mean to be offensive either.

    I thought it would be proper to point out that OP wanted only one record left for each group.

    The problem description is vague and his knowledge needs improvement.

    So I thought he didn't know himself of this scenario of duplicate values over the composite key.

    Just a better help, I suppose.

     


    N 56°04'39.16"
    E 12°55'05.25"

  • Not a problem... didn't take it offensively but I appreciate your feedback.  I still wouldn't mind finding out what the performance of your dupecheck is on SQL Server 2005 because that's actually the (logically) better way to do it and it has more utility in that you can do wierd things with it like delete all but the top 2 rows from each cat.

    Thanks, Peter.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This is the execution plan with your method

      |--Hash Match(Inner Join, HASH[t2].[SomeID])=([t1].[SomeID]), RESIDUAL[tempdb].[dbo].[#BigTest].[SomeID] as [t1].[SomeID]=[tempdb].[dbo].[#BigTest].[SomeID] as [t2].[SomeID] AND [tempdb].[dbo].[#BigTest].[SomeDate] as [t1].[SomeDate]<[tempdb].[dbo].[#BigTest].[SomeDate] as [t2].[SomeDate]))

           |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t2]))

           |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest] AS [t1]))

    And this is the execution plan with my method

      |--Filter(WHERE[Expr1003]>(1)))

           |--Sequence Project(DEFINE[Expr1003]=row_number))

                |--Compute Scalar(DEFINE[Expr1005]=(1)))

                     |--Segment

                          |--Sort(ORDER BY[tempdb].[dbo].[#BigTest].[SomeID] ASC, [tempdb].[dbo].[#BigTest].[SomeDate] DESC))

                               |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#BigTest]))

    This is an excerpt of the test data with SomeID = 22

    RowNum SomeID SomeValue SomeNumber SomeDate

    335195 22     FD        91,4072    2007-12-12 02:21:28.727

    270491 22     RH        50,3793    2005-08-29 20:19:36.407

    669219 22     KE         8,4194    2003-06-02 16:23:10.303

    Changing DELETE to SELECT yields for your method

    RowNum SomeID SomeValue SomeNumber SomeDate

    270491 22     RH        50,3793    2005-08-29 20:19:36.407

    669219 22     KE         8,4194    2003-06-02 16:23:10.303

    669219 22     KE         8,4194    2003-06-02 16:23:10.303

    And with my method

    RowNum SomeID SomeValue SomeNumber SomeDate                RecID

    270491 22     RH        50,3793    2005-08-29 20:19:36.407 2

    669219 22     KE         8,4194    2003-06-02 16:23:10.303 3

    As you can see, your method creates a lot of duplicate rows.

    In my batch, total records to delete was 249,486 with your code, and 212,617 with my method.

    This can explain the time difference why my method is faster...

     


    N 56°04'39.16"
    E 12°55'05.25"

  • I've tested the # ways on a testbox singleproc 1,8mhz / 1GbRam with 10,000,000 rows per table and cleared buffers to begin with ...

    ----CREATE DATABASE DPlanTest

    --GO

    --USE DPlanTest

    --go

    --

    --SET statistics IO OFF

    --GO

    --SET NOCOUNT ON

    --go

    ----===== Create and populate a million row test table (takes about 37 seconds)

    -- -- Column "SomeDate" has a range of >=01/01/2000 <01/01/2010

    -- -- That's ten years worth of dates.

    -- SELECT TOP 10000000 -- indeed 10.000.000 rows / table

    -- RowNum = IDENTITY(INT,1,1),

    -- SomeID = CAST(RAND(CAST(NEWID() AS VARBINARY))*2000000+1 AS INT),

    -- SomeValue = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    -- + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),

    -- SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),

    -- SomeDate = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)

    -- INTO BigTest

    -- FROM Master.dbo.SysColumns sc1,

    -- Master.dbo.SysColumns sc2

    ---- took 28 seconds to load

    --

    --SELECT *

    --INTO BigTest2

    --FROM BigTest

    --

    --SELECT *

    --INTO BigTest3

    --FROM BigTest

    --

    --SELECT *

    --INTO BigTest4

    --FROM BigTest

    --

    --SELECT *

    --INTO BigTest5

    --FROM BigTest

    --

    ----===== Add a token primary key to simulate a real table (takes about 4 seconds)

    -- ALTER TABLE BigTest

    -- ADD PRIMARY KEY CLUSTERED (RowNum)

    ---- took 4 seconds to alter

    -- ALTER TABLE BigTest2

    -- ADD PRIMARY KEY CLUSTERED (RowNum)

    --go

    --CREATE index x_BigTest2 on BigTest2 (SomeDate)

    --CREATE index x1_BigTest2 on BigTest2 (SomeID)

    ---- took 4 seconds

    --go

    -- ALTER TABLE BigTest3

    -- ADD PRIMARY KEY CLUSTERED (RowNum)

    --go

    --CREATE index x_BigTest3 on BigTest3 (SomeDate)

    --CREATE index x1_BigTest3 on BigTest3 (SomeID)

    --

    ---- took 4 seconds

    --go

    --ALTER TABLE BigTest4

    -- ADD PRIMARY KEY CLUSTERED (RowNum)

    --go

    --CREATE index x_BigTest4 on BigTest4 (SomeDate)

    --CREATE index x1_BigTest4 on BigTest4 (SomeID)

    --GO

    -- ALTER TABLE BigTest5

    -- ADD PRIMARY KEY CLUSTERED (RowNum)

    --GO

    DBCC

    DROPCLEANBUFFERS

    GO

    SET

    statistics IO ON

    GO

    SET

    NOCOUNT OFF

    go

    go

    PRINT

    'Test01'

    go

    --===== Start a "timer"

    DECLARE

    @StartTime DATETIME

    SET @StartTime = GETDATE()

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

    DELETE t1

    FROM BigTest t1,

    BigTest t2

    WHERE t1.SomeID = t2.SomeID

    AND t1.SomeDate < t2.SomeDate

    --===== Display the duration of the DELETE

    PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

    -- DROP TABLE BigTest

    GO

    PRINT

    'Test02'

    go

    --===== Start a "timer"

    DECLARE

    @StartTime DATETIME

    SET @StartTime = GETDATE()

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

    DELETE t1

    FROM BigTest2 t1,

    BigTest2 t2

    WHERE t1.SomeID = t2.SomeID

    AND t1.SomeDate < t2.SomeDate

    --===== Display the duration of the DELETE

    PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

    -- DROP TABLE BigTest2

    GO

    PRINT

    'Test03'

    go

    --===== Start a "timer"

    DECLARE

    @StartTime DATETIME

    SET @StartTime = GETDATE()

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

    DELETE t1

    FROM BigTest3 t1

    inner join (select SomeID , max(SomeDate) as MaxSomeDate

    FROM BigTest3

    group by SomeID ) t2

    on t1.SomeID = t2.SomeID

    AND t1.SomeDate < t2.MaxSomeDate

    --===== Display the duration of the DELETE

    PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

    -- DROP TABLE BigTest3

    GO

    PRINT

    'Test04'

    go

    --===== Start a "timer"

    DECLARE

    @StartTime DATETIME

    SET @StartTime = GETDATE()

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

    DELETE

    t1

    FROM

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

    FROM

    BigTest4

    )

    AS t1

    WHERE

    RecID > 1

    --===== Display the duration of the DELETE

    PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

    -- DROP TABLE BigTest4

    GO

    PRINT

    'Test05'

    go

    --===== Start a "timer"

    DECLARE

    @StartTime DATETIME

    SET @StartTime = GETDATE()

    --===== Delete all but the lastest row for each SomeID according to SomeDate (takes about 11 seconds)

    DELETE

    t1

    FROM

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

    FROM

    BigTest5

    )

    AS t1

    WHERE

    RecID > 1

    --===== Display the duration of the DELETE

    PRINT CONVERT(VARCHAR(30),GETDATE()-@StartTime,114)

    -- DROP TABLE BigTest5

    GO

    /*

    THE RESULTS

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    Test01

    Table 'BigTest'. Scan count 10, logical reads 24628549, physical reads 0, read-ahead reads 37150, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (8013465 row(s) affected)

     

    (1 row(s) affected)

    00:04:30:420

    Test02

    Table 'BigTest2'. Scan count 2, logical reads 74911276, physical reads 64640, read-ahead reads 92198, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 20823, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (8013465 row(s) affected)

     

    (1 row(s) affected)

    00:12:59:850

    Test03

    Table 'BigTest3'. Scan count 2, logical reads 74994281, physical reads 67125, read-ahead reads 89947, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 48219, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (8013465 row(s) affected)

     

    (1 row(s) affected)

    00:12:12:900

    Test04

    Table 'BigTest4'. Scan count 1, logical reads 74786052, physical reads 58260, read-ahead reads 48409, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 24250839, physical reads 2759, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (8013465 row(s) affected)

     

    (1 row(s) affected)

    00:11:29:370

    Test05

    Table 'BigTest5'. Scan count 5, logical reads 24839987, physical reads 14885, read-ahead reads 42739, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 23493048, physical reads 9, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (8013465 row(s) affected)

     

    (1 row(s) affected)

    00:03:41:270

    */

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • A lot in impressive information! Well done.

    What does it say?

     


    N 56°04'39.16"
    E 12°55'05.25"

  • In short :

    The delete query :

    DELETE t1

    FROM

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY SomeID ORDER BY SomeDate DESC) AS RecID

    FROM

    BigTest5

    )

    AS t1

    WHERE

    RecID > 1

    is the fastest one when the operations can be performed by forcing a tablescan( no extra indexes but the (useless for this operation) primary key)

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Very cool... thanks for taking the time, ALZDBA!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply