July 13, 2003 at 12:00 am
Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/ccubley/findinganddeletingduplicatedata.asp
Chris Cubley
www.queryplan.com
July 25, 2003 at 2:11 am
Neatly encapsulates several techniques in one place. I will save it in my bits of useful code.
Graham Farrow
DBA
Mastercare Central Division
July 25, 2003 at 3:45 am
Useful topic to cover but isn't the final solution over-complex?
Couldn't you just use an inequality of PaymentID's on a self-join as follows..?
DELETE Payment
WHERE PaymentID IN
(SELECT P1.PaymentID
FROM Payment P1
INNER JOIN Payment P2
ON P1.CustomerNumber = p2.CustomerNumber
AND
p1.PostedDatetime = p2.PostedDatetime
AND
p1.PaymentAmt = p2.PaymentAmt
AND
p1.PaymentID > p2.PaymentID)
Mark
July 25, 2003 at 5:49 am
This is not as complex as using temporary tables and scrubbing processes and then moving to live tables.
That was my first atempt.
July 25, 2003 at 7:49 am
Excellent article ! This is something I plan to use on one of my tables later today.
July 26, 2003 at 5:53 pm
Good Article! Thanks.
July 29, 2003 at 10:18 am
I think you can use this approach too
--table structure
CREATE TABLE [a4] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[title] [varchar] (50) ,
[name] [varchar] (50) NULL ,
CONSTRAINT [PK_a4] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
--to have distinct id for the doubles
--the first id's
select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id
--the last id's
select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id
--to delete the first id's doubles and get unique rows in the original table
delete from a4 where id in (
select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id )
--to delete the last id's doubles and get unique rows in the original table
delete from a4 where id in (
select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id )
July 29, 2003 at 1:18 pm
Just to follow up, I followed all of the sample code and everything worked great the first time. The article was well written and the logic was clear in describing why the steps were necessary. Keep up the good work !!!
July 26, 2004 at 11:53 am
This is a pretty good technique, but there are problems with it. Using derived tables in a query can be very powerful, but they can also cause excessive use of tempdb (similar to using # tables). Often, there are other ways of doing the same thing for cheaper.
So, if you don't have to use them, don't. The query below is functionally equivalent to the delete statement in the article, but it costs almost half.
delete p1
From Payment p1
JOIN Payment p2
on p1.CustomerNumber = p2.CustomerNumber and
p1.PostedDateTime = p2.PostedDateTime and
p1.PaymentAmt = p2.PaymentAmt
where p1.PaymentID < p2.PaymentID
Signature is NULL
July 26, 2004 at 11:55 am
Here's the full script:
if object_ID('tempdb..#Payment') is not null drop table #Payment
CREATE TABLE #Payment(
PaymentID int identity Not nUll Primary Key,
CustomerNumber int NOT NULL,
PostedDatetime datetime NOT NULL,
PaymentAmt money NOT NULL,
PaymentMemo varchar(200) NOT NULL,
CCRefNum char(10) NOT NULL
)
Insert #Payment Values (1, '01/01/2004', 1, '', '')
Insert #Payment Values (1, '01/01/2004', 1, '', '')
Insert #Payment Values (2, '01/02/2004', 2, '', '')
Insert #Payment Values (2, '01/02/2004', 2, '', '')
Insert #Payment Values (2, '01/02/2004', 2, '', '')
Insert #Payment Values (3, '01/03/2004', 3, '', '')
Insert #Payment Values (3, '01/03/2004', 3, '', '')
Insert #Payment Values (3, '01/03/2004', 3, '', '')
Insert #Payment Values (3, '01/03/2004', 3, '', '')
delete p1
From #Payment p1
JOIN #Payment p2
on p1.CustomerNumber = p2.CustomerNumber and
p1.PostedDateTime = p2.PostedDateTime and
p1.PaymentAmt = p2.PaymentAmt
where p1.PaymentID < p2.PaymentID
DELETE FROM
p1
FROM
#Payment p1
INNER JOIN
(
SELECT
MAX(PaymentID) AS PaymentID,
CustomerNumber,
PostedDatetime,
PaymentAmt
FROM
#Payment
GROUP BY
CustomerNumber,
PostedDatetime,
PaymentAmt
HAVING
COUNT(*) > 1
  p2
ON(
p1.CustomerNumber = p2.CustomerNumber
AND
p1.PostedDatetime = p2.PostedDatetime
AND
p1.PaymentAmt = p2.PaymentAmt
AND
p1.PaymentID <> p2.PaymentID
 
Signature is NULL
October 31, 2007 at 3:56 pm
Thank you to Calvin Lawsen, Posted 7/26/2004 6:55 PM. I tried and it worked first pass.
However, I had to modify a bit to make the process delete the records with the earliest date and time stamp, leaving only the most recent record. However, upon closer scrutiny, I noticed that the second part of your code segment, i.e. beginning with "DELETE FROM
p1......" didn't seem necessary?! The first code segment worked just fine.
Am I missing something or was this an oversight?
October 31, 2007 at 3:59 pm
To Calvin Lawson - I should hav added the modified code as follows:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE SP_TEST01
AS
BEGIN
SET NOCOUNT ON;
if object_ID('dbo.TBLTEST01') is not null drop table TBLTEST01
CREATE TABLE TBLTEST01(
ID int identity Not NUll Primary Key,
CustNum int NOT NULL,
PostedDate datetime NOT NULL,
PaymentAmt money NOT NULL,
PaymentMemo varchar(200) NOT NULL,
CCRefNum char(10) NOT NULL
)
Insert TBLTEST01 Values (1, '01/10/2004', 1, '', '')
Insert TBLTEST01 Values (1, '01/11/2004', 1, 'Keep', '')
Insert TBLTEST01 Values (2, '01/10/2004', 2, '', '')
Insert TBLTEST01 Values (2, '01/11/2004', 2, '', '')
Insert TBLTEST01 Values (2, '01/12/2004', 2, 'Keep', '')
Insert TBLTEST01 Values (3, '01/10/2004', 3, '', '')
Insert TBLTEST01 Values (3, '01/11/2004', 3, '', '')
Insert TBLTEST01 Values (3, '01/12/2004', 3, '', '')
Insert TBLTEST01 Values (3, '01/13/2004', 3, 'Keep', '')
delete p1
From TBLTEST01 p1
JOIN TBLTEST01 p2
on p1.CustNum = p2.CustNum and
p1.PostedDate < p2.PostedDate and
p1.PaymentAmt = p2.PaymentAmt
where p1.ID < p2.ID
END
Thanks again
March 13, 2009 at 4:12 am
Hi,
Great Post. This will definately help me. I tried manipulating your code to work for me and i get an "Msg 207, Level 16, State 1, Line 27 Invalid column name 'iID'." when i want o delete the duplicates and retail one copy of the duplicates. The error is with p2.iid. The column is there and i am not sure why i am getting this error.
Can you please help?
--deleteing duplicated rows with keeping 1 copy of the duplicated row.
DELETE FROM
p1
FROM
VehicleHistory p1
INNER JOIN
(
SELECT
MAX(iID) AS ColumnID,
ivehicleid,
dtdatetime,
iOdometer
FROM
VehicleHistory
GROUP BY
ivehicleid,
dtdatetime,
iOdometer
HAVING
COUNT(*) > 1) p2
ON
(p1.ivehicleid = p2.ivehicleid
AND
p1.dtdatetime = p2.dtdatetime
AND
p1.iOdometer = p2.iOdometer
and
p1.iID <> p2.iID)
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply