Find Duplicate Records in Table and Update one of them.

Question

Find Duplicate Records in Table and Update one of them.

kstarkes

Old Hand

Points: 316
More actions
May 6, 2016 at 11:39 am

#400122

Trying to figure a way to update the 2nd of 2 Duplicate Records
REQUESTID OTHERSYSTEMID
951 454215
952 454215
I would like to first search for all duplicate OTHERSYSTEMID'S in the table and then append '_D' to the end of the 2nd Duplicate Record for All duplicates found.
ie. REQUESTID OTHERSYSTEMID
951 454215
952 454215_D

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

have you considered using ROW_NUMBER() ?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Lowell SSC Guru Points: 323518 More actions · Answer 2

a really nice feature is you an update on a cte affects the base table, so if you generate an identifier for dupes, you update everything that is a dupe.

WITH MyCTE

AS

(

SELECT ROW_NUMBER() OVER (PARTITION BY REQUESTID,OTHERSYSTEMID ORDER BY REQUESTID,OTHERSYSTEMID) AS RW,*

FROM MyTable

)

--UPDATE MyTarget SET OTHERSYSTEMID = OTHERSYSTEMID + '_D' --what you asked

--UPDATE MyTarget SET OTHERSYSTEMID = OTHERSYSTEMID + '_D' + CONVERT(varchar,RW) --D2,D3,D4 etc

SELECT OTHERSYSTEMID , OTHERSYSTEMID + '_D' + CONVERT(varchar,RW) As NewID,* FROM MyTable AS MyTarget

WHERE RW >1 --is it ok that dupes 2 thru a million all have _D? maybe it shoudl count the dupes?

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Con Alexis Old Hand Points: 331 More actions · Answer 3

Depending on number of rows, and indexes etc... but I would usually do a "GROUP BY" to identify dupes together with MIN to identify original record, then joining back onto main table (ignoring the original record)

e.g.

DECLARE @Tbl TABLE ( REQUESTID int, OTHERSYSTEMID varchar(50) )

INSERT INTO @Tbl ( REQUESTID, OTHERSYSTEMID ) VALUES ( 951, '454215'), (952, '454215' ), ( 953, '454215' )

;

UPDATE Dat SET Dat.OTHERSYSTEMID = Dat.OTHERSYSTEMID + '_D'

FROM ( SELECT MIN( REQUESTID ) AS MIN_REQUESTID, OTHERSYSTEMID FROM @Tbl GROUP BY OTHERSYSTEMID HAVING COUNT(*) > 1 ) Dupes

INNER JOIN @Tbl Dat ON Dat.OTHERSYSTEMID = Dupes.OTHERSYSTEMID AND Dat.REQUESTID != Dupes.MIN_REQUESTID

SELECT * FROM @Tbl

drew.allen SSC Guru Points: 76996 More actions · Answer 4

Con Alexis (5/8/2016)
Depending on number of rows, and indexes etc... but I would usually do a "GROUP BY" to identify dupes together with MIN to identify original record, then joining back onto main table (ignoring the original record)
e.g.
DECLARE @Tbl TABLE ( REQUESTID int, OTHERSYSTEMID varchar(50) )
INSERT INTO @Tbl ( REQUESTID, OTHERSYSTEMID ) VALUES ( 951, '454215'), (952, '454215' ), ( 953, '454215' )
;
UPDATE Dat SET Dat.OTHERSYSTEMID = Dat.OTHERSYSTEMID + '_D'
FROM ( SELECT MIN( REQUESTID ) AS MIN_REQUESTID, OTHERSYSTEMID FROM @Tbl GROUP BY OTHERSYSTEMID HAVING COUNT(*) > 1 ) Dupes
INNER JOIN @Tbl Dat ON Dat.OTHERSYSTEMID = Dupes.OTHERSYSTEMID AND Dat.REQUESTID != Dupes.MIN_REQUESTID
SELECT * FROM @Tbl

It is usually much more efficient to use a CTE with a ROW_NUMBER(), because you only need to read the records once instead of twice (once for the GROUP BY and once for the actual update). ROW_NUMBER() also has the advantage that it's more flexible in that you can specify any order within the partition.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

prafuljemail SSC Veteran Points: 231 More actions · Answer 5

update t

set t.id1=999

from t b join

(

select id1,max(id) as id from t

group by id1 ) a on a.id=b.id

select * from t

Lowell SSC Guru Points: 323518 More actions · Answer 6

prafuljemail (5/9/2016)
update t
set t.id1=999
from t b join
(
select id1,max(id) as id from t
group by id1 ) a on a.id=b.id
select * from t

The group by is fine if you KNOW there is only one duplicate record per OtherSystemID.

if there is more than one duplicate per ID, group by logic will only fix one of them, you'd have to rerun it repetatively for N-1 times.

the row number version does not suffer from that issue.

try any of the above solutions in the situation where the data is like this:

REQUESTID OTHERSYSTEMID

951 454215

952 454215

953 454215

954 454215

955 454215

961 454216

962 454216

963 454217

964 454217

965 454217

;WITH MyCTE([REQUESTID],[OTHERSYSTEMID],[])

AS

(

SELECT '951','454215' UNION ALL

SELECT '952','454215' UNION ALL

SELECT '953','454215' UNION ALL

SELECT '954','454215' UNION ALL

SELECT '955','454215' UNION ALL

SELECT '961','454216' UNION ALL

SELECT '962','454216' UNION ALL

SELECT '963','454217' UNION ALL

SELECT '964','454217' UNION ALL

SELECT '965','454217'

)

SELECT * FROM MyCTE;

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

drew.allen SSC Guru Points: 76996 More actions · Answer 7

Lowell (5/9/2016)
prafuljemail (5/9/2016)
update t
set t.id1=999
from t b join
(
select id1,max(id) as id from t
group by id1 ) a on a.id=b.id
select * from t
The group by is fine if you KNOW there is only one duplicate record per OtherSystemID.
if there is more than one duplicate per ID, group by logic will only fix one of them, you'd have to rerun it repetatively for N-1 times.
the row number version does not suffer from that issue.
try any of the above solutions in the situation where the data is like this:
REQUESTID OTHERSYSTEMID
951 454215
952 454215
953 454215
954 454215
955 454215
961 454216
962 454216
963 454217
964 454217
965 454217
;WITH MyCTE([REQUESTID],[OTHERSYSTEMID],[])
AS
(
SELECT '951','454215' UNION ALL
SELECT '952','454215' UNION ALL
SELECT '953','454215' UNION ALL
SELECT '954','454215' UNION ALL
SELECT '955','454215' UNION ALL
SELECT '961','454216' UNION ALL
SELECT '962','454216' UNION ALL
SELECT '963','454217' UNION ALL
SELECT '964','454217' UNION ALL
SELECT '965','454217'
)
SELECT * FROM MyCTE;

I disagree. The GROUP BY causes approximately twice as many reads as the ROW_NUMBER in the general case, and in this specific case, he is creating duplicates on id 999 that he is not accounting for, so he will run into problems as he finds more duplicates.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

prafuljemail SSC Veteran Points: 231 More actions · Answer 8

hey, do you get solution or not.

i have updated my query with example and sample data on winmilestone com forum database queries

kstarkes Old Hand Points: 316 More actions · Answer 9

@drew,Lowel,prafuljemail,Con

Thanks guys,

The Group By Clause will definitely work in this case because I know there will only be 1 Duplicate OTHERSYSTEMID Per RequestID at any give time.

I'll test the solution(s) given above on a test DB to verify.

nialllinden SSC Enthusiast Points: 158 More actions · Answer 10

Id use rownumber.

its so flexible and powerful for when you have duplicate records

try the following.

CREATE TABLE ##X

(COL1 NVARCHAR(10),COL2 NVARCHAR(10))

INSERT INTO ##X(COL1,COL2)

SELECT

'951' AS COL1,'454215' AS COL2

UNION ALL

SELECT '952','454215' UNION ALL

SELECT '953','454215' UNION ALL

SELECT '954','454215' UNION ALL

SELECT '955','454215' UNION ALL

SELECT '961','454216' UNION ALL

SELECT '962','454216' UNION ALL

SELECT '963','454217' UNION ALL

SELECT '964','454217' UNION ALL

SELECT '965','454217'

UPDATE A

SET COL2=COL2+'_D'

FROM (

SELECT *,ROW_NUMBER () OVER (PARTITION BY COL2 ORDER BY COL2) NUM FROM ##X) AS a

WHERE NUM>1

SELECT * FROM ##X