Compare two varchar for similarities

Question

Compare two varchar for similarities

meelan

SSC Eights!

Points: 997
More actions
October 28, 2014 at 10:56 pm

#312089

Hi Gurus
I am asked to compare the address fields (three columns of nvarchar(100) ) of a customer database (around 10,000 records) and find any duplicates. If it is a character by character match, I could have just GROUPed and get the result.
But, I am expected to produce a list with similar addresses which the guys who entered may have use slightly different spelling or more or less characters, or a "." here and there.
is there any way to create a query for this?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Xedni SSCertifiable Points: 6951 More actions · Answer 1

SOUNDEX is one option, albeit not a great one. you might be able to get clever with it to get you close enough. You could pipe the results through a fuzzy grouping stage in SSIS.

It sort of depends on how accurate you need to be with the matching. You can probably get 75% of the way there with minimal effort. As you try to account for more edge cases, it's going to get exponentially more complicated.

Executive Junior Cowboy Developer, Esq.[/url]

Michael L John One Orange Chip Points: 27056 More actions · Answer 2

Soundex may be more trouble.

SELECT SOUNDEX('First'), Soundex('Firsyth'). Yeah, that's really a street name around here!

Try to standardize the addresses.

In other words, change "Ave.", "Avenue", "Ave", and so forth to "Ave"

Change "1st", "First", etc to "First"

That should get you to 80%. Like the previous poster said, the edge cases are going to kill you.

You may want to think about subscribing to one of the many GIS web services. Send the address, it will standardize it, and return you longitude, latitude, and any number of other data points. You could then make a better comparison.

Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/

aaron.reese SSChampion Points: 13498 More actions · Answer 3

+1 for Michaels suggestion: Pay to have the list cleaned:-)

SOUNDEX is a good start but is designed to work with Anglo-Saxon pronuciations

You could try pattern matching by removing vowels and number: I have had some success with this

You may also need to replace common terms like Ave. St. Mr. Dr. to standardise the data.

SSIS fuzzy lookup, or the proprietary components from Konesans or pragmatic works might also help.

Gazareth One Orange Chip Points: 27737 More actions · Answer 4

Hire two temps for a week & give them 5,000 addresses each to go through 🙂

Ed Wagner SSC Guru Points: 287026 More actions · Answer 5

I have to say that Michael's suggesting is probably the best one. Having clean addresses is the best way to find and remove duplicates. Come to think of it, if you do subscribe to a GIS service, you could validate and standardize the data before it's written to the table.

If you don't go with that, you could try removing all spaces, punctuation and standard nouns (Ave, Avenue, Street, etc.) and then comparing the results. You should expect to have some misses, but it'll get you part-way there.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions