Deleting duplicate records

Question

Post reply

Deleting duplicate records

ninel

Ten Centuries

Points: 1087
More actions
June 14, 2006 at 10:37 am

#112519

I have a table tmPunchtimeSummary which contains a sum of employee's hours per day. The table contains some duplicates.
CREATE TABLE [tmPunchtimeSummary]
(
[iTmPunchTimeSummaryId] [int] IDENTITY (1, 1) NOT NULL ,
[sCalldate] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[sEmployeeId] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[dTotalHrs] [decimal](18, 4) NULL
) ON [PRIMARY]
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1234', 4.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1234', 4.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2468', 8.0)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1357', 9.0)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2345', 8.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2345', 8.5
How can I write a delete statement to only delete the duplicates which in this case would be the 1st and 5th records?
Thanks,
Ninel

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

AMIT GUPTA-263376 SSCarpal Tunnel Points: 4244 More actions · Answer 1

Hi,

Use this one..

First Insert the Non duplicates records into a temp table

1) Select distinct * into #t1 from <table_name>

2) Truncated your Table

Truncate table <table_name>

3) Insert Non duplicate Records into your truncated table for temp table

Insert into <table_name> select * from #t1

Regards,

Amit Gupta.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 2

>>1) Select distinct * into #t1 from <table_name>

That won't work - there's an Identity column, so Select Distinct * changes nothing since every record is distinct due to ascending identity value.

If there are 2 records, with the same date and employeeid, but different number of hours, which record would you keep ?

ninel Ten Centuries Points: 1087 More actions · Answer 3

It doesn't matter. They are identical except for the Id field.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 4

Join to a derived table that gives the MIN() ID (or MAX(), whatever you want) per dupe:

SELECT t.iTmPunchTimeSummaryId, t.sCalldate, t.sEmployeeId, t.dTotalHrs

FROM tmPunchtimeSummary As t

INNER JOIN

(

SELECT MIN(iTmPunchTimeSummaryId) As RetainedID

FROM tmPunchtimeSummary

GROUP BY sCalldate, sEmployeeId, dTotalHrs

) dt

ON (dt.RetrainedID = t.iTmPunchTimeSummaryId)

sree-226516 SSCrazy Points: 2590 More actions · Answer 5

You can use set rowcount to delete duplicates for example to delete first record in you case

set rowcount 1

delete from tmPunchtimeSummary where sEmployeeId='1234'.

This will delete only one record.

Thanks,

SR

Thanks,
SR

brendt hess Hall of Fame Points: 3256 More actions · Answer 6

My preferred method uses some grouping, and direct deletion:

DELETE FROM tmPunchtimeSummary

WHERE iTmPunchTimeSummaryId NOT IN

(SELECT MAX(itmPunchtimeSummaryId)

FROM tmPunchtimeSummary

GROUP BY sCalldate, sEmployeeId, dTotalHrs

)

This will handle cases where there is more than one dup row, and will always delete the earlier-in-sequence rows, keeping the latest row. To keep the first instance instead, make the MAX a MIN.

This also has the advantage of never leaving the table with no rows in it, which will occur with the temp table copy out and back in.

Nilssond Hall of Fame Points: 3285 More actions · Answer 7

DELETE FROM

p1

FROM

tmPunchtimeSummary p1

INNER JOIN

(

SELECT MAX(itmPunchtimeSummaryId) as as itmPunchTimeSummaryID, sCalldate, sEmployeeId, dTotalHrs FROM tmPunchtimeSummary GROUP BY sCalldate, sEmployeeId, dTotalHrs HAVING COUNT(*) > 1

&nbsp p2

ON(

p1.sCalldate = p2.sCalldate

AND

p1.sEmployeeId = p2.sEmployeeId AND

p1.itmPunchTimeSummaryID <> p2.itmPunchTimeSummaryID&nbsp

Richard R. Robson Old Hand Points: 366 More actions · Answer 8

No one has asked the obvious question: "How did the duplicates get there in the first place?"

Usually, when duplicates show up, it is due to some upstream error and/or a design flaw that failed to recognize and deal with the potential for duplicates.

Too often, we are lulled into a sense of security by the assurance that identiy column provides the requisite "uniqueness" for a table. This case is an eaxmple where the requisite "uniqueness" is obviously underserved by the identity column.

The correct (although often most painful) solution to the problem of "duplicates" is not in covering them over in some downstream processes, but, rather, in perventing them at the source.

stax68 SSChampion Points: 11711 More actions · Answer 9

Absolutely right, If your system has (genuine) duplicates in, then the system is broken.

Guessing based on the schema, I'd say you need a unique constraint on date,employee. You also need to ensure

(a) That the repeated values are real duplicates, i.e represent the same fact or thing. For example, if the data contains repeated values for date,employee but with different values for totalhours, that might suggest that the procedure which generates the data is not aggregating to a sufficiently low level of granularity (sufficiently small number of grouping columns). In that case, the totalhours would need to be summed by date,employee rather than the duplicates deleted.

(b) that the deduping procedure is run after the source of the error is fixed so that no more corrupt data can be introduced.

(c) that you couldn't use a (indexed) view instead of maintaining the denormalised data yourself.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 10

I absolutely agree with Richard and Tim... if you have dupes in a table, something is usually wrong. However, it might not be your fault because of data you receive from a 3rd party that doesn't really have a handle on things. You could, however, easily prevent the dupes by adding a UNIQUE KEY with an IGNORE DUPES caveat to keep the dupes out of the table.

That, however, doesn't work for making sure you have the latest data... instead, that keeps the earliest data. I've found that letting the dupes in and then deleting all but the latest dupe (as indicated by an autonumbering ID or a date column) is, many times, faster than doing an "upsert" (merge, insert/update, whatever you want to call it).

With that in mind, there's some pretty good suggestions and code for deleting dupes here... couldn't resist doing some testing... here are the average results on 100,000 rows for 10 runs for each sample... note that the code randomly generates the test data which is why the average number of row, although close, changes a bit between tests.

The last test is for if no dupes are found...

Test Conditions

No PK or Indexes

Clustered PK on ID

NonClustered PK on ID

NonClustered PK + Index

NonClustered PK + Index and no dupes

Method

PW

Nillsond

Amit

Jeff

PW

Nillsond

Amit

Jeff

PW

Nillsond

Amit

Jeff

PW

Nillsond

Amit

Jeff

PW

Nillsond

Amit

Jeff

Average Rows Deleted

5,334

5,353

5,320

5,392

5,320

5,306

5,338

5,343

5,337

5,321

5,359

5,337

5,350

5,380

5,352

0

Average time to Delete

5.300

3.406

4.106

2.703

5.330

3.643

4.233

3.016

4.826

3.326

4.983

2.766

3.703

3.063

5.000

2.250

2.986

1.090

5.043

0.733

Oh, yeah... forgot my suggestion for deleting dupes ...

--===== Jeff's Delete

DELETE FROM t1

FROM TmPunchtimeSummary t1,

tmPunchtimeSummary t2

WHERE t1.sCalldate = t2.sCalldate

AND t1.sEmployeeId = t2.sEmployeeId

AND t1.dTotalHrs = t2.dTotalHrs

AND t1.iTmPunchTimeSummaryID < t2.iTmPunchTimeSummaryID

Although Amit had a great idea, the thing I don't like about it is that, as PW pointed out, you cannot use SELECT DISTINCT * because of the IDENTITY field in the original table. If you exclude the IDENTITY column, you loose what the original identity value was. That may be ok or it might not... in any case, a pot wad of data got moved back and forth using Amit's method whether there were dupes or not. And, finally, it cannot be used as an "Upsert" of the latest information unless you sort by chronology in descending order, which would add even more time not to mention that the data in the table is, however brief, temporarily unavailable.

For those interested, here's the test code I used... note that it had to be slightly modified to run Amit's tests to get the average row counts...

--===== Declare local variables

DECLARE @DeletedRowCount INT

DECLARE @StartTime DATETIME

DECLARE @LoopCounter INT

DECLARE @MaxLoops INT

DECLARE @Duration DATETIME

--===== Preset the variables and setting for the test

SET @DeletedRowCount = 0

SET @LoopCounter = 1

SET @MaxLoops = 10

SET NOCOUNT ON

--===== Run the test in a loop ten times

WHILE @LoopCounter <= @MaxLoops

BEGIN

--===== If the test table exists, drop it

IF OBJECT_ID('TestTable') IS NOT NULL

DROP TABLE TestTable

--===== Create the test table

CREATE TABLE TestTable

(

iTestTableId INT IDENTITY(1,1),

-- iTestTableId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

-- iTestTableId INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,

sCalldate VARCHAR(20),

sEmployeeId VARCHAR(20),

dTotalHrs DECIMAL(18,4)

)

--===== Populate the test table with 100k rows of random data

INSERT INTO TestTable

(sCalldate, sEmployeeId, dTotalHrs)

SELECT TOP 100000

sCalldate = '20060610',

sEmployeeId = CAST(RAND(CAST(NEWID() AS VARBINARY))*8999+1000 AS INT),

dTotalHrs = STR(RAND(CAST(NEWID() AS VARBINARY))*10,3,1)

FROM Master.dbo.SysComments sc1,

Master.dbo.SysComments sc2

--===== Create an index for the WHERE clauses (commented out for most tests)

-- CREATE INDEX TestIndex ON TestTable (sCalldate,sEmployeeId,dTotalHrs)

--===== Clear any cached data that may exist

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

--===== Start the performance timer

SET @StartTime = GETDATE()

--===== Put code to be tested here

--===== Capture the number of deleted rows and the duration

SET @DeletedRowCount = @DeletedRowCount + @@ROWCOUNT --comment out this line to test Amit's

SET @Duration = GETDATE()-@StartTime

SET @LoopCounter = @LoopCounter + 1

END --End of test loop

--===== Test complete, report duration

PRINT REPLICATE('=',78)

PRINT STR(@DeletedRowCount/@MaxLoops) + ' Average rows deleted'

PRINT STR(DATEDIFF(ms,0,@Duration)/1000.0,10,3) + ' Average time to delete'

And, here's the different test code's after a little reformatting and, in Nillsond's code case, a bit of repair (pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...

--===== PW's delete

DELETE FROM TestTable

WHERE iTestTableId NOT IN

(

SELECT MAX(iTestTableId)

FROM TestTable

GROUP BY sCalldate,

sEmployeeId,

dTotalHrs

)

--===== Nillsond's delete (after a bit of repair)

DELETE FROM p1

FROM TestTable p1

INNER JOIN

(

SELECT MAX(iTestTableID) AS iTestTableID,

sCalldate,

sEmployeeId,

dTotalHrs

FROM TestTable

GROUP BY sCalldate,

sEmployeeId,

dTotalHrs HAVING COUNT(*) > 1

) p2

ON p1.sCalldate = p2.sCalldate

AND p1.sEmployeeId = p2.sEmployeeId

AND p1.dTotalHrs = p2.dTotalHrs --Missing from original code

AND p1.iTestTableID <> p2.iTestTableID

--===== Jeff's Delete

DELETE FROM t1

FROM TestTable t1,

TestTable t2

WHERE t1.sCalldate = t2.sCalldate

AND t1.sEmployeeId = t2.sEmployeeId

AND t1.dTotalHrs = t2.dTotalHrs

AND t1.iTestTableId < t2.iTestTableId

Amit's test had to be run a little differently so far as how the average record count was derived...

--===== Amit's Delete

SELECT DISTINCT sCalldate,sEmployeeId,dTotalHrs

INTO #MyTemp

FROM TestTable

SET @DeletedRowCount = @DeletedRowCount + (100000-@@ROWCOUNT)

TRUNCATE TABLE TestTable

INSERT INTO TestTable

(sCalldate,sEmployeeId,dTotalHrs)

SELECT sCalldate,sEmployeeId,dTotalHrs

FROM #MyTemp

DROP TABLE #MyTemp

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

Forgot to mention... I changed the name of the table to TestTable to keep anyone from accidently wiping out a production table... just in case...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

stax68 SSChampion Points: 11711 More actions · Answer 12

>(pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...

Even cooler when no-one realises you haven't tested it!

BTW, MAX() might be more efficient than the theta-join if there were many repeats of the same values - say octuplicates rather than strict duplicates.

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

If you have that many, you need to fire the 3rd party butt first out of a cannon into a stone wall

I did do the test, though... 120 k records consisting of 15000 unique records duplicated 8 times...

PW's came in the high 11 second range, the other two cxode solutions came in thew low twelves...

On 512 k records consisting of 2000 unique records duplicated 256 times, PW's came in at 50 seconds, the other two came in at about 58 seconds. Of course, I'm thinking kill the vendor with that many dupes

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)