Removing duplicate rows with IDs in two columns

Question

Removing duplicate rows with IDs in two columns

dbursey

SSC Eights!

Points: 831
More actions
August 13, 2010 at 1:57 pm

#227738

Hello - I have a program which scans the database looking for possible duplicates by matching according to various criteria (first name=first name and last name=last name and first 10 char's of address match, etc., etc.) - anyway, my problem is that, of course, each row of info appears twice - once with the first occurance of "John Smith" and then again with the possible match of "John Smith".
I am having a mental block on how to pick only one of these rows to show in the report. I have recreated a sample table below so you can feel my pain and show me the light, because I have a feeling it's not as knotted as I have worked it up to be today. 🙂
Thank you!
Donna
***************START UPDATE*****************
I have a work-around posted below, but I feel sure there is a more elegant way to do this, so I'm still looking for input, if you've got it. Thanks again!
****************END UPDATE******************
create table #dup_problem
(constituent_id varchar(20),
first_name varchar(50),
last_name varchar(100),
match_id varchar(20),
match_first varchar(50),
match_last varchar(100)
)
insert into #dup_problem values('0001','John','Smith','0010','John','Smith')
insert into #dup_problem values('0010','John','Smith','0001','John','Smith')
insert into #dup_problem values('0002','Mary','McDougall','0020','Mary','McDougall')
insert into #dup_problem values('0020','Mary','McDougall','0002','Mary','McDougall')
insert into #dup_problem values('0003','Guido','Iago','0030','Guido','Iago')
insert into #dup_problem values('0030','Guido','Iago','0003','Guido','Iago')
insert into #dup_problem values('0004','Juniper','Tree','0040','Juniper','Tree')
insert into #dup_problem values('0040','Juniper','Tree','0004','Juniper','Tree')
******************WORK AROUND******************
SELECT distinct P1.*
FROM #dup_problem P1
join #dup_problem P2
on P2.constituent_id = P1.match_id
where P1.constituent_id > P2.constituent_id

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

andrewd.smith SSCrazy Eights Points: 8444 More actions · Answer 1

Is your #dup_problem table the output that you are trying to generate (without the "duplicate" rows of course), or does it match the structure of your source data table?

If the latter is the case, then something similar to your workaround should also work on the source data table, with the join clause including an inequality on the constituent_id (? primary key) column.

For example:

SELECT D1.constituent_id, D1.first_name, D1.last_name,

D2.constituent_id AS match_id, D2.first_name AS match_first, D2.last_name AS match_last

FROM SourceData D1 INNER JOIN SourceData D2

ON (D1.constituent_id < D2.constituent_id)

AND (D1.first_name = D2.first_name)

AND (D1.last_name = D1.last_name)

Edit: If you're stuck with the "duplicate" data in the #dup_problem table, then your "workaround" looks ok to me. When I added a clustered index on the constituent_id column, the query plan used a clustered index scan in combination with a clustered index seek, which I doubt can be much improved on.

dbursey SSC Eights! Points: 831 More actions · Answer 2

Yeah, the sample I provided is representative of the output table I end up with - not a lot I can do with the way it's populated (I will do some brainstorming to verify that, though).

Thank you so much for checking it out - I really appreciate that input. I'm always looking for ways to optimize/improve my code, so this was great.

Donna

Paul White SSC Guru Points: 150467 More actions · Answer 3

Some other ideas:

SELECT *

FROM (

SELECT *,

rn = ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY constituent_id DESC)

FROM #dup_problem P1

) SQ

WHERE SQ.rn = 1;

SELECT P.*

FROM #dup_problem P

JOIN (

SELECT DP.first_name,

DP.last_name,

max_id = MAX(DP.constituent_id)

FROM #dup_problem DP

GROUP BY

DP.first_name,

DP.last_name

) G

ON G.first_name = P.first_name

AND G.last_name = P.last_name

AND G.max_id = P.constituent_id;

SELECT *

FROM (

SELECT *,

max_id = MAX(P.constituent_id) OVER (PARTITION BY first_name, last_name)

FROM #dup_problem P

) SQ

WHERE SQ.constituent_id = SQ.max_id;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

steve-893342 SSCarpal Tunnel Points: 4439 More actions · Answer 4

In terms of the self-determining pairs of duplicates supplied in the example dataset, this works quite nicely

WITH cte AS

(

SELECT constituent_id, first_name, last_name

FROM #dup_problem

)

SELECT MAX(constituent_id) AS constituent_id, first_name, last_name, MIN(constituent_id) AS match_id, first_name AS match_first, last_name AS match_last

FROM cte

GROUP BY first_name, last_name

dbursey SSC Eights! Points: 831 More actions · Answer 5

Hey, some nice variations on a theme! 🙂 One of the many things I love about programming. Thanks, guys!