Compare two varchar for similarities

  • Hi Gurus

    I am asked to compare the address fields (three columns of nvarchar(100) ) of a customer database (around 10,000 records) and find any duplicates. If it is a character by character match, I could have just GROUPed and get the result.

    But, I am expected to produce a list with similar addresses which the guys who entered may have use slightly different spelling or more or less characters, or a "." here and there.

    is there any way to create a query for this?

  • SOUNDEX is one option, albeit not a great one. you might be able to get clever with it to get you close enough. You could pipe the results through a fuzzy grouping stage in SSIS.

    It sort of depends on how accurate you need to be with the matching. You can probably get 75% of the way there with minimal effort. As you try to account for more edge cases, it's going to get exponentially more complicated.

    Executive Junior Cowboy Developer, Esq.[/url]

  • Soundex may be more trouble.

    SELECT SOUNDEX('First'), Soundex('Firsyth'). Yeah, that's really a street name around here!

    Try to standardize the addresses.

    In other words, change "Ave.", "Avenue", "Ave", and so forth to "Ave"

    Change "1st", "First", etc to "First"

    That should get you to 80%. Like the previous poster said, the edge cases are going to kill you.

    You may want to think about subscribing to one of the many GIS web services. Send the address, it will standardize it, and return you longitude, latitude, and any number of other data points. You could then make a better comparison.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • +1 for Michaels suggestion: Pay to have the list cleaned:-)

    SOUNDEX is a good start but is designed to work with Anglo-Saxon pronuciations

    You could try pattern matching by removing vowels and number: I have had some success with this

    You may also need to replace common terms like Ave. St. Mr. Dr. to standardise the data.

    SSIS fuzzy lookup, or the proprietary components from Konesans or pragmatic works might also help.

  • Hire two temps for a week & give them 5,000 addresses each to go through 🙂

  • I have to say that Michael's suggesting is probably the best one. Having clean addresses is the best way to find and remove duplicates. Come to think of it, if you do subscribe to a GIS service, you could validate and standardize the data before it's written to the table.

    If you don't go with that, you could try removing all spaces, punctuation and standard nouns (Ave, Avenue, Street, etc.) and then comparing the results. You should expect to have some misses, but it'll get you part-way there.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply