SOUNDEX and DIFFERENCE

  • Issue:

    SELECT SOUNDEX('SMITHWICK')--Returns S532

    SELECT SOUNDEX('SIMONS')--Returns S552

    As you can see S532 & S552 only have one character difference. Therefore the difference should be 3

    SELECT DIFFERENCE('SIMONS','SMITHWICK')--Returns 4 WHAT!?

    SELECT DIFFERENCE('SMITHWICK','SIMONS')--Returns 3 YAY!

    Can anyone tell me why?

  • I can't help. I ran your code and got the same puzzling results. Then I looked through books online and it reads as if the results should be identical regardless of which string appears first. Report it to Microsoft?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Joe Celko (7/12/2010)


    Both of these are VERY weak. Look up

    Levenshtein distance

    Metaphone

    Daitch-Mokotoff

    WagnerFischer

    and always check out

    http://en.wikipedia.org/wiki/Soundex%5B/quote%5D

    +1 - Soundex is, for all business intents and purposes I've seen, useless.

    Double Metaphone is actually pretty decent for "sound-based" matching, and there's an article here (read the discussion as well, as always) that has pure T-SQL code that does it, which I've used; there should also be quite a bit of CLR code that can do it. Always check results between implementations, of course. Note that Double Metaphone generates Primary Value and Secondary Value for String A.

    Jaro-Winkler appears to be pretty decent for "differently typed" matching; again, there's an article here with pure T-SQL. Note that Jaro-Winkler has to match string A against string B.

    I've found that real usage involves splitting up whatever you're matching (name, address) into distinct pieces, and figuring out which piece means what, then matching like meaning pieces together.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply