March 11, 2016 at 11:26 am
Hi,
I'm having some difficulties figuring out how to do this and would greatly appreciate any assistance anyone can offer.
We're in the process of trying to clean up some of our old data. We have a Customer table that contains some obvious duplicates and some not-so-obvious duplicates. For example, there might be a record with a CustomerName of "Bobs Bar and Grill" and another record with a CustomerName of "Bob'sBar and Gril". The Address might also be identical, and perhaps one digit was mistyped for the Phone field, but they are likely duplicates that we would want to identify.
CREATE TABLE #TempCustomers (CustomerPk INT, CustomerName VARCHAR(100), Address VARCHAR(100), Phone VARCHAR(100))
INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (1, 'Bobs Bar and Grill', '123 East Street','183-384-3847')
INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (2, 'Bob''sBar and Gril', '123 East Street','183-334-3847')
INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (4, 'Bill''s Bikes', '485 West Avenue','776-183-3811')
--What I want to see
SELECT
CustomerPk1 = 1,
CustomerName1 = 'Bobs Bar and Grill',
Address1 = '123 East Street',
Phone1 = '183-384-3847',
CustomerPk2 = 2,
CustomerName2 = 'Bob''sBar and Gril',
Address2 = '123 East Street',
Phone2 = '183-334-3847',
PercentageLikelihoodOfDuplicate = .93 --fudging this number
UNION
SELECT
CustomerPk2 = 2,
CustomerName2 = 'Bob''sBar and Gril',
Address2 = '123 East Street',
Phone2 = '183-334-3847',
CustomerPk1 = 3,
CustomerName1 = 'Bill''s Bikes',
Address1 = '485 West Avenue',
Phone1 = '776-183-3811',
PercentageLikelihoodOfDuplicate = .08 --fudging this number
DROP TABLE #TempCustomers
What I'm hoping to do is find a SQL function that will compare several column values for each record in the table and identify the likelihood of duplicates. Then we can filter out those with a low probability and send those with the higher probability to our users to correct.
Any ideas on how this might be accomplished?
March 12, 2016 at 2:27 pm
In my opinion, pure T-SQL is not a good tool for this. T-SQL expects exact and precise specification, and how do you specify "sort of matches". And even if you do come up with something, the result is probably going to be a very nasty and complex (and slow!) query.
One thing you might want to look into if you want to do it in T-SQL is the SOUNDEX function. It is designed to mostly catch phonetic near-matches (intended to be used when people say their name by telephone and you don't know the exact spelling and perhaps even misheard), and is allegedly not even very good at that, but it might be one step in the right direction for you.
However, I also recommend looking into SSIS. In an SSIS package, you can use components that do what's known as "fuzzy matching". I must admit that I don't know much about these components except that they exist, but I would definitely investigate this if I were you.
March 18, 2016 at 3:04 pm
Hugo,
Thanks for the response! I marked it as the correct solution because I know I've seen SSIS used for this purpose, but I actually ended up dumping the data into Excel and using the free Fuzzy Lookup add-in because it seemed like a quicker and easier solution for a time-sensitive issue.
March 19, 2016 at 1:11 pm
tarr94 (3/18/2016)
Hugo,Thanks for the response! I marked it as the correct solution because I know I've seen SSIS used for this purpose, but I actually ended up dumping the data into Excel and using the free Fuzzy Lookup add-in because it seemed like a quicker and easier solution for a time-sensitive issue.
I'm curious... Where did you get the add-on from?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 21, 2016 at 10:16 am
Here's the link: http://www.microsoft.com/en-us/download/details.aspx?id=15011
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply