October 8, 2015 at 7:55 am
Does anyone have any useful SQL Queries that might be used to identify lists of potential duplicate records in a table?
For example I have Client Database that includes a table dbo.Clients. This table contains various columns which could be used to identify possible duplicate records, such as Surname | Forenames | DateOfBirth | NINumber | PostalCode etc. . The data contained in these columns is not always exactly the same due to differences caused by user data entry; so some records may have missing data from some of the columns and there could be spelling differences too. Like the following examples:
1 | Smith | John Raymond | NULL | NI990946B | SW12 8TQ
2 | Smith | John | 06/03/1967 | NULL | SW12 8TQ
3 | Smith | Jon Raymond | 06/03/1967 | NI 99 09 46 B | SW12 8TQ
The problem is that whilst it is easy for a human being to review these 3 entries and conclude that they are most likely the same Client entered in to the database 3 times; I cannot find a reliable way of identifying them using a SQL Query.
I've considered using some sort of concatenation to a new column, minus white space and then using a "WHERE column_name LIKE pattern" query, but so far I can't get anything to work well enough. Fuzzy Logic maybe?
Any ideas or help gratefully received. Ideally the results would produce a grid something like this for the example above:
ID | Surname | Forenames | DuplicateID | DupSurname | DupForenames
1 | Smith | John Raymond | 2 | Smith | John
1 | Smith | John Raymond | 3 | Smith | Jon Raymond
9 | Brown | Peter David | 343 | Brown | Pete D
next batch of duplicates etc etc . . . .
October 8, 2015 at 8:02 am
How about:
SELECT COUNT(*), Column_to_Check1, Column_to_Check2,
FROM Table
GROUP BY Column_to_Check1, Column_to_Check2,
HAVING COUNT(*) > 1
That will provide a list of duplicates based upon the columns provided.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
October 8, 2015 at 8:15 am
Unfortunately that only works for exact duplicates and so is not a solution to the problem as I explained it.
Thanks anyway.
October 8, 2015 at 8:38 am
The problem as you've described it is very unlikely to be solved using queries. The only thing you could look to would be something like SOUNDEX. That will help for 'John' vs. 'Jon', but won't help for 'John Raymond' vs. 'John'. For that, you need a mark I eyeball attached to a functioning brain. This kind of problem comes up a lot. It's a big reason why good data normalization, correct data types, enforced referential integrity, is all so important to getting good data out of a database. If you have LastName, FirstName and MiddleName as fields, it makes it much easier to employ something like SOUNDEX to find the places where it's just spelling differences.
Data cleansing is a huge task. It's funny, a lot of companies that went with NoSQL, unstructured data stores are now suddenly finding that their data is so filthy it's basically unusable. Structured storage is here to stay.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 8, 2015 at 8:54 am
There was a good article written on Soundex http://www.sqlservercentral.com/articles/Programming/101836%5B/url%5D. You may be able to leverage this to a certain degree.
One technique that I've seen used successfully is to take the Soundex rules, and instead of following them, use the wildcard character as a replacements.
As an example, the first Soundex rule is retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
So, Michael John would become M%ch%%l J%%n
And so forth with the rest of the rules.
I'm in agreement with Grant though. This is not a trivial undertaking. If there is any advice I have, it would be to break it into smaller chunks. Remove the obvious ones, and then work through the harder ones.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
October 8, 2015 at 9:14 am
One possibility is to cross join the table to itself (you may want to tone that down a bit if the table is large) and give the matching columns a weight, and then pick those with the highest weight
e.g. something like this
SELECT t1.surname, t2.surname.....,
CASE WHEN t1.surname = t2.surname THEN 1 ELSE 0 END +
CASE WHEN t1.forenames = t2.forenames THEN 1 ELSE 0 END
CASE WHEN t1.dateofbirth = t2.dateofbirth THEN 1 ELSE 0 END AS MatchWeight
FROM table t1 CROSS JOIN table t2
WHERE t1.primarkykey <> t2.primarykey
ORDER BY MatchWeight DESC
I only used 1 and 0 as example weights, so you could give exact matches a higher weight and give soundex matches something lower
October 8, 2015 at 9:18 am
Interesting idea. Thanks.
October 8, 2015 at 9:19 am
My thanks to both Grant and Michael for these comments and the advice.
I agree completely with the points made regarding the importance of a good structured database, unfortunately this doesn't help when you are looking after a database you did not design.
I will try working with soundex and will report back anything that looks remotely like success.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply