Deleting duplicate records

  • I have a table tmPunchtimeSummary which contains a sum of employee's hours per day. The table contains some duplicates.

    CREATE TABLE [tmPunchtimeSummary]

    (

     [iTmPunchTimeSummaryId] [int] IDENTITY (1, 1) NOT NULL ,

     [sCalldate] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [sEmployeeId] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [dTotalHrs] [decimal](18, 4) NULL

    ) ON [PRIMARY]

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '1234', 4.5)

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '1234', 4.5)

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '2468', 8.0)

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '1357', 9.0)

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '2345', 8.5)

    INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)

    VALUES('20060610', '2345', 8.5

    How can I write a delete statement to only delete the duplicates which in this case would be the 1st and 5th records?

    Thanks,

    Ninel

  • Hi,

    Use this one..

    First Insert the Non duplicates records into a temp table

    1) Select distinct * into #t1 from <table_name>

    2) Truncated your Table

    Truncate table <table_name>

    3) Insert Non duplicate Records into your truncated table for temp table

    Insert into <table_name>  select * from #t1

    Regards,

    Amit Gupta.

     

     

     

  • >>1) Select distinct * into #t1 from <table_name>

    That won't work - there's an Identity column, so Select Distinct * changes nothing since every record is distinct due to ascending identity value.

     

    If there are 2 records, with the same date and employeeid, but different number of hours, which record would you keep ?

     

  • It doesn't matter. They are identical except for the Id field.

  • Join to a derived table that gives the MIN() ID (or MAX(), whatever you want) per dupe:

    SELECT t.iTmPunchTimeSummaryId, t.sCalldate, t.sEmployeeId, t.dTotalHrs

    FROM tmPunchtimeSummary As t

    INNER JOIN

    (

      SELECT MIN(iTmPunchTimeSummaryId) As RetainedID

      FROM tmPunchtimeSummary

      GROUP BY sCalldate, sEmployeeId, dTotalHrs

    ) dt

      ON (dt.RetrainedID = t.iTmPunchTimeSummaryId)

     

     

  • You can use set rowcount to delete duplicates for example to delete first record in you case

    set rowcount 1

    delete from tmPunchtimeSummary  where sEmployeeId='1234'.

    This will delete only one record.

    Thanks,

    SR

     

     

    Thanks,
    SR

  • My preferred method uses some grouping, and direct deletion:

    DELETE FROM tmPunchtimeSummary

    WHERE iTmPunchTimeSummaryId NOT IN

        (SELECT MAX(itmPunchtimeSummaryId)

         FROM tmPunchtimeSummary

         GROUP BY sCalldate, sEmployeeId, dTotalHrs

         )

    This will handle cases where there is more than one dup row, and will always delete the earlier-in-sequence rows, keeping the latest row. To keep the first instance instead, make the MAX a MIN.

    This also has the advantage of never leaving the table with no rows in it, which will occur with the temp table copy out and back in.

  • DELETE FROM

     p1

    FROM

     tmPunchtimeSummary  p1

    INNER JOIN

     (

      SELECT MAX(itmPunchtimeSummaryId) as as itmPunchTimeSummaryID, sCalldate, sEmployeeId, dTotalHrs FROM tmPunchtimeSummary GROUP BY  sCalldate, sEmployeeId, dTotalHrs HAVING COUNT(*) > 1

    &nbsp p2

     ON(

      p1.sCalldate = p2.sCalldate

      AND

      p1.sEmployeeId = p2.sEmployeeId AND

      p1.itmPunchTimeSummaryID <> p2.itmPunchTimeSummaryID&nbsp

  • No one has asked the obvious question: "How did the duplicates get there in the first place?"

    Usually, when duplicates show up, it is due to some upstream error and/or a design flaw that failed to recognize and deal with the potential for duplicates.

    Too often, we are lulled into a sense of security by the assurance that identiy column provides the requisite "uniqueness" for a table.   This case is an eaxmple where the requisite "uniqueness" is obviously underserved by the identity column.  

    The correct (although often most painful) solution to the problem of "duplicates" is not in covering them over in some downstream processes, but, rather, in perventing them at the source.

  • Absolutely right, If your system has (genuine) duplicates in, then the system is broken.

    Guessing based on the schema, I'd say you need a unique constraint on date,employee. You also need to ensure

    (a) That the repeated values are real duplicates, i.e represent the same fact or thing. For example, if the data contains repeated values for date,employee but with different values for totalhours, that might suggest that the procedure which generates the data is not aggregating to a sufficiently low level of granularity (sufficiently small number of grouping columns). In that case, the totalhours would need to be summed by date,employee rather than the duplicates deleted.

    (b) that the deduping procedure is run after the source of the error is fixed so that no more corrupt data can be introduced.

    (c) that you couldn't use a (indexed) view instead of maintaining the denormalised data yourself.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • I absolutely agree with Richard and Tim... if you have dupes in a table, something is usually wrong.  However, it might not be your fault because of data you receive from a 3rd party that doesn't really have a handle on things.  You could, however, easily prevent the dupes by adding a UNIQUE KEY with an IGNORE DUPES caveat to keep the dupes out of the table. 

    That, however, doesn't work for making sure you have the latest data... instead, that keeps the earliest data.  I've found that letting the dupes in and then deleting all but the latest dupe (as indicated by an autonumbering ID or a date column) is, many times, faster than doing an "upsert" (merge, insert/update, whatever you want to call it).

    With that in mind, there's some pretty good suggestions and code for deleting dupes here... couldn't resist doing some testing... here are the average results on 100,000 rows for 10 runs for each sample... note that the code randomly generates the test data which is why the average number of row, although close, changes a bit between tests.

    The last test is for if no dupes are found...

    Test Conditions

    No PK or Indexes

    Clustered PK on ID

    NonClustered PK on ID

    NonClustered PK + Index

    NonClustered PK + Index and no dupes

    Method

    PW

    Nillsond

    Amit

    Jeff

    PW

    Nillsond

    Amit

    Jeff

    PW

    Nillsond

    Amit

    Jeff

    PW

    Nillsond

    Amit

    Jeff

    PW

    Nillsond

    Amit

    Jeff

    Average Rows Deleted

    5,334

    5,353

    5,320

    5,392

    5,320

    5,306

    5,338

    5,343

    5,337

    5,321

    5,359

    5,337

    5,350

    5,380

    5,352

    5,352

    0

    0

    0

    0

    Average time to Delete

    5.300

    3.406

    4.106

    2.703

    5.330

    3.643

    4.233

    3.016

    4.826

    3.326

    4.983

    2.766

    3.703

    3.063

    5.000

    2.250

    2.986

    1.090

    5.043

    0.733

    Oh, yeah... forgot my suggestion for deleting dupes ...

    --===== Jeff's Delete

     DELETE FROM t1

       FROM TmPunchtimeSummary  t1,

            tmPunchtimeSummary  t2

      WHERE t1.sCalldate    =  t2.sCalldate

        AND t1.sEmployeeId  = t2.sEmployeeId

        AND t1.dTotalHrs    =  t2.dTotalHrs

        AND t1.iTmPunchTimeSummaryID < t2.iTmPunchTimeSummaryID 

    Although Amit had a great idea, the thing I don't like about it is that, as PW pointed out, you cannot use SELECT DISTINCT * because of the IDENTITY field in the original table.  If you exclude the IDENTITY column, you loose what the original identity value was.  That may be ok or it might not... in any case, a pot wad of data got moved back and forth using Amit's method whether there were dupes or not.  And, finally, it cannot be used as an "Upsert" of the latest information unless you sort by chronology in descending order, which would add even more time not to mention that the data in the table is, however brief, temporarily unavailable.

    For those interested, here's the test code I used... note that it had to be slightly modified to run Amit's tests to get the average row counts...

    --===== Declare local variables

    DECLARE @DeletedRowCount INT

    DECLARE @StartTime       DATETIME

    DECLARE @LoopCounter     INT

    DECLARE @MaxLoops        INT

    DECLARE @Duration        DATETIME

    --===== Preset the variables and setting for the test

        SET @DeletedRowCount = 0

        SET @LoopCounter     = 1

        SET @MaxLoops        = 10

        SET NOCOUNT ON

    --===== Run the test in a loop ten times

      WHILE @LoopCounter <= @MaxLoops

      BEGIN

            --===== If the test table exists, drop it

                 IF OBJECT_ID('TestTable') IS NOT NULL

                    DROP TABLE TestTable

           

            --===== Create the test table

             CREATE TABLE TestTable

                    (

                     iTestTableId INT IDENTITY(1,1),

    --                  iTestTableId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    --                  iTestTableId INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,

                     sCalldate   VARCHAR(20),

                     sEmployeeId VARCHAR(20),

                     dTotalHrs   DECIMAL(18,4)

                    )

           

            --===== Populate the test table with 100k rows of random data

             INSERT INTO TestTable

                    (sCalldate, sEmployeeId, dTotalHrs)

             SELECT TOP 100000

                    sCalldate   = '20060610',

                    sEmployeeId = CAST(RAND(CAST(NEWID() AS VARBINARY))*8999+1000 AS INT),

                    dTotalHrs   = STR(RAND(CAST(NEWID() AS VARBINARY))*10,3,1)

               FROM Master.dbo.SysComments sc1,

                    Master.dbo.SysComments sc2

            --===== Create an index for the WHERE clauses (commented out for most tests)

    --         CREATE INDEX TestIndex ON TestTable (sCalldate,sEmployeeId,dTotalHrs)

           

            --===== Clear any cached data that may exist

               DBCC DROPCLEANBUFFERS

               DBCC FREEPROCCACHE

           

            --===== Start the performance timer

                SET @StartTime = GETDATE()

            --===== Put code to be tested here

     

            --===== Capture the number of deleted rows and the duration

                SET @DeletedRowCount = @DeletedRowCount + @@ROWCOUNT --comment out this line to test Amit's

                SET @Duration = GETDATE()-@StartTime

       

        SET @LoopCounter = @LoopCounter + 1

        END --End of test loop

    --===== Test complete, report duration

      PRINT REPLICATE('=',78)

      PRINT STR(@DeletedRowCount/@MaxLoops) + ' Average rows deleted'

      PRINT STR(DATEDIFF(ms,0,@Duration)/1000.0,10,3) + ' Average time to delete'

    And, here's the different test code's after a little reformatting and, in Nillsond's code case, a bit of repair (pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...

    --===== PW's delete

     DELETE FROM TestTable

      WHERE iTestTableId NOT IN

            (

             SELECT MAX(iTestTableId)

               FROM TestTable

              GROUP BY  sCalldate,

                        sEmployeeId,

                        dTotalHrs

            )

    --===== Nillsond's delete (after a bit of repair)

     DELETE FROM p1

       FROM TestTable p1

      INNER JOIN

            (

             SELECT MAX(iTestTableID) AS iTestTableID,

                    sCalldate,

                    sEmployeeId,

                    dTotalHrs

               FROM TestTable

              GROUP BY  sCalldate,

                        sEmployeeId,

                        dTotalHrs HAVING COUNT(*) > 1

            ) p2

         ON p1.sCalldate   = p2.sCalldate

        AND p1.sEmployeeId = p2.sEmployeeId

        AND p1.dTotalHrs   = p2.dTotalHrs --Missing from original code

        AND p1.iTestTableID <> p2.iTestTableID

    --===== Jeff's Delete

     DELETE FROM t1

       FROM TestTable t1,

            TestTable t2

      WHERE t1.sCalldate    =  t2.sCalldate

        AND t1.sEmployeeId  = t2.sEmployeeId

        AND t1.dTotalHrs    =  t2.dTotalHrs

        AND t1.iTestTableId < t2.iTestTableId

    Amit's test had to be run a little differently so far as how the average record count was derived...

    --===== Amit's Delete

     SELECT DISTINCT sCalldate,sEmployeeId,dTotalHrs

       INTO #MyTemp

      FROM TestTable

       SET @DeletedRowCount = @DeletedRowCount + (100000-@@ROWCOUNT)

    TRUNCATE TABLE TestTable

     INSERT INTO TestTable

            (sCalldate,sEmployeeId,dTotalHrs)

     SELECT sCalldate,sEmployeeId,dTotalHrs

       FROM #MyTemp

       DROP TABLE #MyTemp

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Forgot to mention... I changed the name of the table to TestTable to keep anyone from accidently wiping out a production table... just in case...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • >(pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...

    Even cooler when no-one realises you haven't tested it!

    BTW, MAX() might be more efficient than the theta-join if there were many repeats of the same values - say octuplicates rather than strict duplicates.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • If you have that many, you need to fire the 3rd party butt first out of a cannon into a stone wall

    I did do the test, though... 120 k records consisting of 15000 unique records duplicated 8 times...

    PW's came in the high 11 second range, the other two cxode solutions came in thew low twelves...

    On 512 k records  consisting of 2000 unique records duplicated 256 times, PW's came in at 50 seconds, the other two came in at about 58 seconds.  Of course, I'm thinking kill the vendor with that many dupes

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply