October 15, 2009 at 11:09 am
Hi experts,
I have a question that usually come up in the sql community but with a different twist (i think). I have a table that I am trying to eliminate duplicate data from. The only twist here is that the lngCompanyID, shtCompanyID and dtmModified must all be the same in different rows before it is considered duplicate. If 2 of the columns are the same and 1 is different, it won't be duplicate.
Is there a different way to go about this other than the regular eliminating duplicate "having count / rownumber() > 1" routine?
Below is my test data
CREATE TABLE tDupData
(
lngCompanyID INT,
shtCompanyID INT,
strCompanyName VARCHAR(20),
strAddress VARCHAR(10),
dtmModified DATETIME
)
--Create test data
INSERT INTO tDupData VALUES (1,10,'CompanyOne','Address1','01/15/2003')
INSERT INTO tDupData VALUES (2,20,'CompanyTwo','Address2','01/15/2003')
INSERT INTO tDupData VALUES (3,30,'CompanyThree','Address3','01/15/2003')
INSERT INTO tDupData VALUES (2,40,'CompanyTwo','Address','01/16/2003')
INSERT INTO tDupData VALUES (3,50,'CompanyThree','Address','01/16/2003')
-- Duplicate Data
INSERT INTO tDupData VALUES (1,10,'CompanyOne','Address1','01/15/2003')
GO
Thanks for reading
October 15, 2009 at 12:19 pm
Try this:
;with numbered as(SELECT rowno=row_number() over
(partition by lngCompanyID,shtCompanyID,dtmModified order by lngCompanyID)
, lngCompanyID,shtCompanyID,dtmModified,strAddress from #tDupData)
select * from numbered
Resulting in:
rownolngCompanyIDshtCompanyIDdtmModified strAddress
11 10 2003-01-15 00:00:00.000Address1
21 10 2003-01-15 00:00:00.000Address1
12 20 2003-01-15 00:00:00.000Address2
12 40 2003-01-16 00:00:00.000Address
13 30 2003-01-15 00:00:00.000Address3
13 50 2003-01-16 00:00:00.000Address
If after checking you can replace select * from numbered with a DELETE FROM #TdupData WHERE Rowno > 1 .. Check results again before moving to production.
October 15, 2009 at 3:04 pm
Right on the money. I really appreciate your input. So just pretty much the same thing but include all the 3 columns in the "partition by". Sweeet. You are the best !!!
October 15, 2009 at 3:33 pm
olsentimothy03
So just pretty much the same thing
No not at all close to the same thing. Using a group by with the HAVING clause
SELECT lngCompanyID,shtCompanyID,dtmModified, COUNT(*) AS '#Dup Entries' FROM #tDupData
GROUP BY lngCompanyID,shtCompanyID,dtmModified
HAVING COUNT(*) > 1;
Returned:
lngCompanyIDshtCompanyIDdtmModified #Dup Entries
1 102003-01-15 00:00:00.000 2
Here is where the difference makes life easy. With above result, how do indicate which single row to delete? Easy enough to delete both rows, but as your table was construced not a easy task to code to delete just one row, and think of the difficulty, if you had said 100 or 150 duplicate rows with some individnual rows being duplicated more than once.
October 15, 2009 at 4:49 pm
Understood. I have definitely learnt a lot from you today. Thanks again
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply