June 7, 2007 at 11:57 am
Hello,
I am working with sql 2000, and I am faced with a project that requires myself to find duplicate entries. To begin I will give an example of what I need to finish with and hopfully someone can tell me where to begin. I have been asked to find all the duplicate entries within a table of 18099597 rows of people and give the percentage of duplicates within that table. So duplicate people is what needs to be found along with the percentage. And with these people are of course identifying fields such as passport numbers, gender, data of birth etc. Where should I begin?
June 7, 2007 at 1:16 pm
I'd begin by deciding what columns have to match in order to consider two rows a duplicate of each other. Also ensure you have a candidate key in the table which doesn't include those columns that are in your must-match list. Then you can perform a self-join on the table specifying your matching criteria as the join criteria and a not-equal against the candidate key.
June 7, 2007 at 1:27 pm
I've come up with this query so far and it tells me some differences, but not the percentage. I read that I just have to devide the number to rows I get back from the query by the number of original rows in the table then multiply by 100. Can anyone confirm this?
select Last_Name, First_name, Date_of_Birth, Id_Num, count(*) AS NumOfTimes
from people_info
group by Last_name, First_Name, Date_of_Birth, Id_Num
having count(*) > 1
order by count(*) desc, Last_Name, First_Name
June 7, 2007 at 1:38 pm
Close... I'd check just for dupe ID_Num's first... the rest of the columns are ancillary in nature...
Or, just take out the ID_Num to see if the same person has been registered twice...
--Jeff Moden
Change is inevitable... Change for the better is not.
June 8, 2007 at 8:35 am
I agree, the fewer number of fields you check, the less likely you are to exclude someone due to a spelling (or name change ...) error. If you feel the ID_num is a reliable field than just do your grouping on it, get your count and yes divide by the total number of records for a percentage.
Now for the real kick. This will tell you the number of exact dupes. What about inexact dupes (id_num misentered...)? I call thes fuzzy or inexact dupes and they are much harder to find and much harder to quantify. I should have an article out on how to use the fuzzy grouping task in SSIS to discover fuzzy dupes soon.
But you could also join the table back to itself on a soundex of the first name, lastname ... This will likely overexaggerate your dups, so its not my favorite.
June 12, 2007 at 7:17 am
Try this:
SELECT NULL
FROM
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply