June 14, 2006 at 10:37 am
I have a table tmPunchtimeSummary which contains a sum of employee's hours per day. The table contains some duplicates.
CREATE TABLE [tmPunchtimeSummary]
(
[iTmPunchTimeSummaryId] [int] IDENTITY (1, 1) NOT NULL ,
[sCalldate] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[sEmployeeId] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[dTotalHrs] [decimal](18, 4) NULL
) ON [PRIMARY]
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1234', 4.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1234', 4.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2468', 8.0)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '1357', 9.0)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2345', 8.5)
INSERT tmPunchtimeSummary (sCalldate, sEmployeeId, dTotalHrs)
VALUES('20060610', '2345', 8.5
How can I write a delete statement to only delete the duplicates which in this case would be the 1st and 5th records?
Thanks,
Ninel
June 14, 2006 at 10:44 am
Hi,
Use this one..
First Insert the Non duplicates records into a temp table
1) Select distinct * into #t1 from <table_name>
2) Truncated your Table
Truncate table <table_name>
3) Insert Non duplicate Records into your truncated table for temp table
Insert into <table_name> select * from #t1
Regards,
Amit Gupta.
June 14, 2006 at 10:46 am
>>1) Select distinct * into #t1 from <table_name>
That won't work - there's an Identity column, so Select Distinct * changes nothing since every record is distinct due to ascending identity value.
If there are 2 records, with the same date and employeeid, but different number of hours, which record would you keep ?
June 14, 2006 at 10:49 am
It doesn't matter. They are identical except for the Id field.
June 14, 2006 at 10:53 am
Join to a derived table that gives the MIN() ID (or MAX(), whatever you want) per dupe:
SELECT t.iTmPunchTimeSummaryId, t.sCalldate, t.sEmployeeId, t.dTotalHrs
FROM tmPunchtimeSummary As t
INNER JOIN
(
SELECT MIN(iTmPunchTimeSummaryId) As RetainedID
FROM tmPunchtimeSummary
GROUP BY sCalldate, sEmployeeId, dTotalHrs
) dt
ON (dt.RetrainedID = t.iTmPunchTimeSummaryId)
June 14, 2006 at 11:30 am
You can use set rowcount to delete duplicates for example to delete first record in you case
set rowcount 1
delete from tmPunchtimeSummary where sEmployeeId='1234'.
This will delete only one record.
Thanks,
SR
Thanks,
SR
June 14, 2006 at 11:55 am
My preferred method uses some grouping, and direct deletion:
DELETE FROM tmPunchtimeSummary
WHERE iTmPunchTimeSummaryId NOT IN
(SELECT MAX(itmPunchtimeSummaryId)
FROM tmPunchtimeSummary
GROUP BY sCalldate, sEmployeeId, dTotalHrs
)
This will handle cases where there is more than one dup row, and will always delete the earlier-in-sequence rows, keeping the latest row. To keep the first instance instead, make the MAX a MIN.
This also has the advantage of never leaving the table with no rows in it, which will occur with the temp table copy out and back in.
June 14, 2006 at 12:21 pm
DELETE FROM
p1
FROM
tmPunchtimeSummary p1
INNER JOIN
(
SELECT MAX(itmPunchtimeSummaryId) as as itmPunchTimeSummaryID, sCalldate, sEmployeeId, dTotalHrs FROM tmPunchtimeSummary GROUP BY sCalldate, sEmployeeId, dTotalHrs HAVING COUNT(*) > 1
  p2
ON(
p1.sCalldate = p2.sCalldate
AND
p1.sEmployeeId = p2.sEmployeeId AND
p1.itmPunchTimeSummaryID <> p2.itmPunchTimeSummaryID 
June 17, 2006 at 9:55 am
No one has asked the obvious question: "How did the duplicates get there in the first place?"
Usually, when duplicates show up, it is due to some upstream error and/or a design flaw that failed to recognize and deal with the potential for duplicates.
Too often, we are lulled into a sense of security by the assurance that identiy column provides the requisite "uniqueness" for a table. This case is an eaxmple where the requisite "uniqueness" is obviously underserved by the identity column.
The correct (although often most painful) solution to the problem of "duplicates" is not in covering them over in some downstream processes, but, rather, in perventing them at the source.
June 17, 2006 at 7:14 pm
Absolutely right, If your system has (genuine) duplicates in, then the system is broken.
Guessing based on the schema, I'd say you need a unique constraint on date,employee. You also need to ensure
(a) That the repeated values are real duplicates, i.e represent the same fact or thing. For example, if the data contains repeated values for date,employee but with different values for totalhours, that might suggest that the procedure which generates the data is not aggregating to a sufficiently low level of granularity (sufficiently small number of grouping columns). In that case, the totalhours would need to be summed by date,employee rather than the duplicates deleted.
(b) that the deduping procedure is run after the source of the error is fixed so that no more corrupt data can be introduced.
(c) that you couldn't use a (indexed) view instead of maintaining the denormalised data yourself.
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
June 17, 2006 at 9:29 pm
I absolutely agree with Richard and Tim... if you have dupes in a table, something is usually wrong. However, it might not be your fault because of data you receive from a 3rd party that doesn't really have a handle on things. You could, however, easily prevent the dupes by adding a UNIQUE KEY with an IGNORE DUPES caveat to keep the dupes out of the table.
That, however, doesn't work for making sure you have the latest data... instead, that keeps the earliest data. I've found that letting the dupes in and then deleting all but the latest dupe (as indicated by an autonumbering ID or a date column) is, many times, faster than doing an "upsert" (merge, insert/update, whatever you want to call it).
With that in mind, there's some pretty good suggestions and code for deleting dupes here... couldn't resist doing some testing... here are the average results on 100,000 rows for 10 runs for each sample... note that the code randomly generates the test data which is why the average number of row, although close, changes a bit between tests.
The last test is for if no dupes are found...
Test Conditions |
No PK or Indexes
Clustered PK on ID
NonClustered PK on ID
NonClustered PK + Index
NonClustered PK + Index and no dupes
Method
PW
Nillsond
Amit
Jeff
PW
Nillsond
Amit
Jeff
PW
Nillsond
Amit
Jeff
PW
Nillsond
Amit
Jeff
PW
Nillsond
Amit
Jeff
Average Rows Deleted
5,334
5,353
5,320
5,392
5,320
5,306
5,338
5,343
5,337
5,321
5,359
5,337
5,350
5,380
5,352
5,352
0
0
0
0
Average time to Delete
5.300
3.406
4.106
2.703
5.330
3.643
4.233
3.016
4.826
3.326
4.983
2.766
3.703
3.063
5.000
2.250
2.986
1.090
5.043
0.733
Oh, yeah... forgot my suggestion for deleting dupes ...
--===== Jeff's Delete
DELETE FROM t1
FROM TmPunchtimeSummary t1,
tmPunchtimeSummary t2
WHERE t1.sCalldate = t2.sCalldate
AND t1.sEmployeeId = t2.sEmployeeId
AND t1.dTotalHrs = t2.dTotalHrs
AND t1.iTmPunchTimeSummaryID < t2.iTmPunchTimeSummaryID
Although Amit had a great idea, the thing I don't like about it is that, as PW pointed out, you cannot use SELECT DISTINCT * because of the IDENTITY field in the original table. If you exclude the IDENTITY column, you loose what the original identity value was. That may be ok or it might not... in any case, a pot wad of data got moved back and forth using Amit's method whether there were dupes or not. And, finally, it cannot be used as an "Upsert" of the latest information unless you sort by chronology in descending order, which would add even more time not to mention that the data in the table is, however brief, temporarily unavailable.
For those interested, here's the test code I used... note that it had to be slightly modified to run Amit's tests to get the average row counts...
--===== Declare local variables
DECLARE @DeletedRowCount INT
DECLARE @StartTime DATETIME
DECLARE @LoopCounter INT
DECLARE @MaxLoops INT
DECLARE @Duration DATETIME
--===== Preset the variables and setting for the test
SET @DeletedRowCount = 0
SET @LoopCounter = 1
SET @MaxLoops = 10
SET NOCOUNT ON
--===== Run the test in a loop ten times
WHILE @LoopCounter <= @MaxLoops
BEGIN
--===== If the test table exists, drop it
IF OBJECT_ID('TestTable') IS NOT NULL
DROP TABLE TestTable
--===== Create the test table
CREATE TABLE TestTable
(
iTestTableId INT IDENTITY(1,1),
-- iTestTableId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
-- iTestTableId INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
sCalldate VARCHAR(20),
sEmployeeId VARCHAR(20),
dTotalHrs DECIMAL(18,4)
)
--===== Populate the test table with 100k rows of random data
INSERT INTO TestTable
(sCalldate, sEmployeeId, dTotalHrs)
SELECT TOP 100000
sCalldate = '20060610',
sEmployeeId = CAST(RAND(CAST(NEWID() AS VARBINARY))*8999+1000 AS INT),
dTotalHrs = STR(RAND(CAST(NEWID() AS VARBINARY))*10,3,1)
FROM Master.dbo.SysComments sc1,
Master.dbo.SysComments sc2
--===== Create an index for the WHERE clauses (commented out for most tests)
-- CREATE INDEX TestIndex ON TestTable (sCalldate,sEmployeeId,dTotalHrs)
--===== Clear any cached data that may exist
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
--===== Start the performance timer
SET @StartTime = GETDATE()
--===== Put code to be tested here
--===== Capture the number of deleted rows and the duration
SET @DeletedRowCount = @DeletedRowCount + @@ROWCOUNT --comment out this line to test Amit's
SET @Duration = GETDATE()-@StartTime
SET @LoopCounter = @LoopCounter + 1
END --End of test loop
--===== Test complete, report duration
PRINT REPLICATE('=',78)
PRINT STR(@DeletedRowCount/@MaxLoops) + ' Average rows deleted'
PRINT STR(DATEDIFF(ms,0,@Duration)/1000.0,10,3) + ' Average time to delete'
And, here's the different test code's after a little reformatting and, in Nillsond's code case, a bit of repair (pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...
--===== PW's delete
DELETE FROM TestTable
WHERE iTestTableId NOT IN
(
SELECT MAX(iTestTableId)
FROM TestTable
GROUP BY sCalldate,
sEmployeeId,
dTotalHrs
)
--===== Nillsond's delete (after a bit of repair)
DELETE FROM p1
FROM TestTable p1
INNER JOIN
(
SELECT MAX(iTestTableID) AS iTestTableID,
sCalldate,
sEmployeeId,
dTotalHrs
FROM TestTable
GROUP BY sCalldate,
sEmployeeId,
dTotalHrs HAVING COUNT(*) > 1
) p2
ON p1.sCalldate = p2.sCalldate
AND p1.sEmployeeId = p2.sEmployeeId
AND p1.dTotalHrs = p2.dTotalHrs --Missing from original code
AND p1.iTestTableID <> p2.iTestTableID
--===== Jeff's Delete
DELETE FROM t1
FROM TestTable t1,
TestTable t2
WHERE t1.sCalldate = t2.sCalldate
AND t1.sEmployeeId = t2.sEmployeeId
AND t1.dTotalHrs = t2.dTotalHrs
AND t1.iTestTableId < t2.iTestTableId
Amit's test had to be run a little differently so far as how the average record count was derived...
--===== Amit's Delete
SELECT DISTINCT sCalldate,sEmployeeId,dTotalHrs
INTO #MyTemp
FROM TestTable
SET @DeletedRowCount = @DeletedRowCount + (100000-@@ROWCOUNT)
TRUNCATE TABLE TestTable
INSERT INTO TestTable
(sCalldate,sEmployeeId,dTotalHrs)
SELECT sCalldate,sEmployeeId,dTotalHrs
FROM #MyTemp
DROP TABLE #MyTemp
--Jeff Moden
Change is inevitable... Change for the better is not.
June 18, 2006 at 7:08 am
Forgot to mention... I changed the name of the table to TestTable to keep anyone from accidently wiping out a production table... just in case...
--Jeff Moden
Change is inevitable... Change for the better is not.
June 18, 2006 at 9:11 am
>(pretty cool... even though he didn't test it {had 2 "AS" in it}, he almost got it right)...
Even cooler when no-one realises you haven't tested it!
BTW, MAX() might be more efficient than the theta-join if there were many repeats of the same values - say octuplicates rather than strict duplicates.
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
June 18, 2006 at 1:27 pm
If you have that many, you need to fire the 3rd party butt first out of a cannon into a stone wall
I did do the test, though... 120 k records consisting of 15000 unique records duplicated 8 times...
PW's came in the high 11 second range, the other two cxode solutions came in thew low twelves...
On 512 k records consisting of 2000 unique records duplicated 256 times, PW's came in at 50 seconds, the other two came in at about 58 seconds. Of course, I'm thinking kill the vendor with that many dupes
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply