October 28, 2014 at 10:56 pm
Hi Gurus
I am asked to compare the address fields (three columns of nvarchar(100) ) of a customer database (around 10,000 records) and find any duplicates. If it is a character by character match, I could have just GROUPed and get the result.
But, I am expected to produce a list with similar addresses which the guys who entered may have use slightly different spelling or more or less characters, or a "." here and there.
is there any way to create a query for this?
October 29, 2014 at 1:34 pm
SOUNDEX is one option, albeit not a great one. you might be able to get clever with it to get you close enough. You could pipe the results through a fuzzy grouping stage in SSIS.
It sort of depends on how accurate you need to be with the matching. You can probably get 75% of the way there with minimal effort. As you try to account for more edge cases, it's going to get exponentially more complicated.
October 29, 2014 at 2:01 pm
Soundex may be more trouble.
SELECT SOUNDEX('First'), Soundex('Firsyth'). Yeah, that's really a street name around here!
Try to standardize the addresses.
In other words, change "Ave.", "Avenue", "Ave", and so forth to "Ave"
Change "1st", "First", etc to "First"
That should get you to 80%. Like the previous poster said, the edge cases are going to kill you.
You may want to think about subscribing to one of the many GIS web services. Send the address, it will standardize it, and return you longitude, latitude, and any number of other data points. You could then make a better comparison.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
November 5, 2014 at 6:06 am
+1 for Michaels suggestion: Pay to have the list cleaned:-)
SOUNDEX is a good start but is designed to work with Anglo-Saxon pronuciations
You could try pattern matching by removing vowels and number: I have had some success with this
You may also need to replace common terms like Ave. St. Mr. Dr. to standardise the data.
SSIS fuzzy lookup, or the proprietary components from Konesans or pragmatic works might also help.
November 5, 2014 at 6:14 am
Hire two temps for a week & give them 5,000 addresses each to go through 🙂
November 5, 2014 at 6:16 am
I have to say that Michael's suggesting is probably the best one. Having clean addresses is the best way to find and remove duplicates. Come to think of it, if you do subscribe to a GIS service, you could validate and standardize the data before it's written to the table.
If you don't go with that, you could try removing all spaces, punctuation and standard nouns (Ave, Avenue, Street, etc.) and then comparing the results. You should expect to have some misses, but it'll get you part-way there.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply