Using SOUNDEX

  • Hi People

    I have been requested by our client to remove similar businesses from a table in our database...

    so far i have this code which seems pretty good:

    SELECT sp_id,coname FROM procurement_sp WHERE coname

    IN

    (select a.coname FROM Procurement_SP a join

    procurement_sp b

    ON left(a.coname,5) = left(b.coname,5)

    GROUP

    BY a.coname HAVING count(b.coname) > 1) and Enabled = 1 ORDER BY coname

    returns this

    1507 24 SEVEN LOGISTICS

    1344 24 SEVEN LOGUSTICS

    1463 Abaphumelela Trading

    1806 Abaphumeleli Construction and General Services

    1027 ABSA Bank Ltd

    581 ABSA Brokers

    1854 ABSA Fleet Services

    1855 ABSA Vehcile Management Solutions

    1506 ADVIDATA TRADING 600 CC

    544 Advidata Trading 70cc

    1259 Africa Civil and Building Construction

    1289 Africa Legend Indigo

    605 African Genisis Construction

    12 African Plumbing JV

    1511 AFRICA'S UNIVERSAL ENTERPRISES

    558 rows

    from original data:

    1 3PI Water Services

    2 A + B Electrical

    3 A van Niekerk Construction

    4 A.D.Z. Construction

    5 ABS Construction

    6 AC Industrial Sales & Services

    7 ACM Civil / Africon Plumbing J/V

    8 ACM Water Project J/V

    9 Adal Projects

    10 Adams Contractor

    11 ADZ Construction

    12 African Plumbing JV

    13 Africon Plumbing

    14 Afriscan Construction (Pty) Ltd

    15 Afro Structure

    2879 rows

    but i really want to use soundex because i need more accurate matches that doesnt just match the first 5 chars...

    DOES ANYONE KNOW HOW TO USE SOUNDEX/DIFFERENCE PROPERLY?

    chris

    cbmorton FAT Gmale

  • The problem is that soundex / difference don't cut it.

    It doesn't work as well as one would like.

    As for how to use it, it's pretty well documented in BOL.

    If memory serves me, there's also some 'improved' algorithms out there (no links, sorry), though they work marginally better.

    Have to admit it was a couple of years ago that I tried to make use of the native functions as well as the improved code in an attempt to use on names and adresses...

    The bottom line then, unfortuantely, was that it proved to be unusable, more or less.

    /Kenneth

  • We've only ever used Soundex/Difference for things that are then spot checked by a human. Sort of a "shrinking of the haystack so the needle is found much faster". In other words, they can help you get to where you want to go, but you'll still want a navigator.

  • I guess I should answer your second question as well. If you want to do it, I'd use Difference in your case. You'd need to loop through your table, pulling out the value to compare, then do a difference with the rest of the records, looking for 4s (and possibly 3s) as the return value. Obviously, this could take some time, and you'd have some duplication (due to the switching of the two parameters as you move through the table) which you'd have to handle, but it might give you a decent starting point.

    Note that because of the way Soundex works, and therefore Difference as well, you're not going to get good matches on things like:

    Almagamated First Acme Widgets

    Widgets, First Acme Amalgamated

    even though you and I can tell that they are the same. Luckily, temps aren't usually terribly expensive.

  • thanks guys.

    i eventually came up with this but i think i am going to stick with my first bit of code:

    SELECT

    sp_id,coname FROM procurement_sp WHERE coname

    IN

    (select a.coname FROM Procurement_SP a join

    procurement_sp b

    ON left(a.coname,5) = left(b.coname,5)

    WHERE

    a.coname in (select a.coname FROM Procurement_SP a join

    procurement_sp b

    ON DIFFERENCE(a.coname,b.coname) > 3

    GROUP

    BY a.coname HAVING count(b.coname) > 1)

    GROUP

    BY a.coname HAVING count(b.coname) > 1)

    ORDER

    BY coname

     

    which returned 935 rows

  • Hi -- We make our business sorting out names (direct marketing).  Soundex just simply won't get you where you need to go without too much colateral damage.

    Do a Google on "metaphone" or "metafone" or "double metaphone".  This is a much more complex (accurate?) hashing of words according to phonic rules used by the major spell-checker tools.  There are several minor variants on the hashing, so once you pick one, stay with that one.

    The one I have settled on is based off original C++ code and algorithm by Lawrence Philips, Published in the C/C++ Users Journal:

    http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles  Original Metaphone presented in article in "Computer Language" in 1990.  It has also been ported to T-SQL as a UDF by a number of people - you should find several listed in your search (I did mine years ago, so I don't have particularly fresh URLs).  Note: For real production purposes, you should build a DLL - it's a lot faster if you have millions of records to sift through.

    Once you have the metafones of the names, you can use your difference() tests and such to decide when it is a dupe.

  • You can also find some more information on SSC regarding metaphone, just go to the SSC search page and search on metaphone.  I checked all three boxes for the search and there are articules, scripts, and discussion threads you can check out.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply