August 13, 2010 at 1:57 pm
Hello - I have a program which scans the database looking for possible duplicates by matching according to various criteria (first name=first name and last name=last name and first 10 char's of address match, etc., etc.) - anyway, my problem is that, of course, each row of info appears twice - once with the first occurance of "John Smith" and then again with the possible match of "John Smith".
I am having a mental block on how to pick only one of these rows to show in the report. I have recreated a sample table below so you can feel my pain and show me the light, because I have a feeling it's not as knotted as I have worked it up to be today. 🙂
Thank you!
Donna
***************START UPDATE*****************
I have a work-around posted below, but I feel sure there is a more elegant way to do this, so I'm still looking for input, if you've got it. Thanks again!
****************END UPDATE******************
create table #dup_problem
(constituent_id varchar(20),
first_name varchar(50),
last_name varchar(100),
match_id varchar(20),
match_first varchar(50),
match_last varchar(100)
)
insert into #dup_problem values('0001','John','Smith','0010','John','Smith')
insert into #dup_problem values('0010','John','Smith','0001','John','Smith')
insert into #dup_problem values('0002','Mary','McDougall','0020','Mary','McDougall')
insert into #dup_problem values('0020','Mary','McDougall','0002','Mary','McDougall')
insert into #dup_problem values('0003','Guido','Iago','0030','Guido','Iago')
insert into #dup_problem values('0030','Guido','Iago','0003','Guido','Iago')
insert into #dup_problem values('0004','Juniper','Tree','0040','Juniper','Tree')
insert into #dup_problem values('0040','Juniper','Tree','0004','Juniper','Tree')
******************WORK AROUND******************
SELECT distinct P1.*
FROM #dup_problem P1
join #dup_problem P2
on P2.constituent_id = P1.match_id
where P1.constituent_id > P2.constituent_id
August 13, 2010 at 4:15 pm
Is your #dup_problem table the output that you are trying to generate (without the "duplicate" rows of course), or does it match the structure of your source data table?
If the latter is the case, then something similar to your workaround should also work on the source data table, with the join clause including an inequality on the constituent_id (? primary key) column.
For example:
SELECT D1.constituent_id, D1.first_name, D1.last_name,
D2.constituent_id AS match_id, D2.first_name AS match_first, D2.last_name AS match_last
FROM SourceData D1 INNER JOIN SourceData D2
ON (D1.constituent_id < D2.constituent_id)
AND (D1.first_name = D2.first_name)
AND (D1.last_name = D1.last_name)
Edit: If you're stuck with the "duplicate" data in the #dup_problem table, then your "workaround" looks ok to me. When I added a clustered index on the constituent_id column, the query plan used a clustered index scan in combination with a clustered index seek, which I doubt can be much improved on.
August 13, 2010 at 6:36 pm
Yeah, the sample I provided is representative of the output table I end up with - not a lot I can do with the way it's populated (I will do some brainstorming to verify that, though).
Thank you so much for checking it out - I really appreciate that input. I'm always looking for ways to optimize/improve my code, so this was great.
Donna
August 14, 2010 at 6:23 am
Some other ideas:
SELECT *
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY constituent_id DESC)
FROM #dup_problem P1
) SQ
WHERE SQ.rn = 1;
SELECT P.*
FROM #dup_problem P
JOIN (
SELECT DP.first_name,
DP.last_name,
max_id = MAX(DP.constituent_id)
FROM #dup_problem DP
GROUP BY
DP.first_name,
DP.last_name
) G
ON G.first_name = P.first_name
AND G.last_name = P.last_name
AND G.max_id = P.constituent_id;
SELECT *
FROM (
SELECT *,
max_id = MAX(P.constituent_id) OVER (PARTITION BY first_name, last_name)
FROM #dup_problem P
) SQ
WHERE SQ.constituent_id = SQ.max_id;
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
August 14, 2010 at 1:12 pm
In terms of the self-determining pairs of duplicates supplied in the example dataset, this works quite nicely
WITH cte AS
(
SELECT constituent_id, first_name, last_name
FROM #dup_problem
)
SELECT MAX(constituent_id) AS constituent_id, first_name, last_name, MIN(constituent_id) AS match_id, first_name AS match_first, last_name AS match_last
FROM cte
GROUP BY first_name, last_name
August 16, 2010 at 10:05 am
Hey, some nice variations on a theme! 🙂 One of the many things I love about programming. Thanks, guys!
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply