September 12, 2007 at 8:00 am
Hi People
I have been requested by our client to remove similar businesses from a table in our database...
so far i have this code which seems pretty good:
SELECT sp_id,coname FROM procurement_sp WHERE coname
IN
(select a.coname FROM Procurement_SP a join
procurement_sp b
ON left(a.coname,5) = left(b.coname,5)
GROUP
BY a.coname HAVING count(b.coname) > 1) and Enabled = 1 ORDER BY coname
returns this
1507 24 SEVEN LOGISTICS
1344 24 SEVEN LOGUSTICS
1463 Abaphumelela Trading
1806 Abaphumeleli Construction and General Services
1027 ABSA Bank Ltd
581 ABSA Brokers
1854 ABSA Fleet Services
1855 ABSA Vehcile Management Solutions
1506 ADVIDATA TRADING 600 CC
544 Advidata Trading 70cc
1259 Africa Civil and Building Construction
1289 Africa Legend Indigo
605 African Genisis Construction
12 African Plumbing JV
1511 AFRICA'S UNIVERSAL ENTERPRISES
558 rows
from original data:
1 3PI Water Services
2 A + B Electrical
3 A van Niekerk Construction
4 A.D.Z. Construction
5 ABS Construction
6 AC Industrial Sales & Services
7 ACM Civil / Africon Plumbing J/V
8 ACM Water Project J/V
9 Adal Projects
10 Adams Contractor
11 ADZ Construction
12 African Plumbing JV
13 Africon Plumbing
14 Afriscan Construction (Pty) Ltd
15 Afro Structure
2879 rows
but i really want to use soundex because i need more accurate matches that doesnt just match the first 5 chars...
DOES ANYONE KNOW HOW TO USE SOUNDEX/DIFFERENCE PROPERLY?
chris
cbmorton FAT Gmale
September 12, 2007 at 8:50 am
The problem is that soundex / difference don't cut it.
It doesn't work as well as one would like.
As for how to use it, it's pretty well documented in BOL.
If memory serves me, there's also some 'improved' algorithms out there (no links, sorry), though they work marginally better.
Have to admit it was a couple of years ago that I tried to make use of the native functions as well as the improved code in an attempt to use on names and adresses...
The bottom line then, unfortuantely, was that it proved to be unusable, more or less.
/Kenneth
September 12, 2007 at 12:13 pm
We've only ever used Soundex/Difference for things that are then spot checked by a human. Sort of a "shrinking of the haystack so the needle is found much faster". In other words, they can help you get to where you want to go, but you'll still want a navigator.
September 12, 2007 at 12:22 pm
I guess I should answer your second question as well. If you want to do it, I'd use Difference in your case. You'd need to loop through your table, pulling out the value to compare, then do a difference with the rest of the records, looking for 4s (and possibly 3s) as the return value. Obviously, this could take some time, and you'd have some duplication (due to the switching of the two parameters as you move through the table) which you'd have to handle, but it might give you a decent starting point.
Note that because of the way Soundex works, and therefore Difference as well, you're not going to get good matches on things like:
Almagamated First Acme Widgets
Widgets, First Acme Amalgamated
even though you and I can tell that they are the same. Luckily, temps aren't usually terribly expensive.
September 13, 2007 at 12:18 am
thanks guys.
i eventually came up with this but i think i am going to stick with my first bit of code:
SELECT
sp_id,coname FROM procurement_sp WHERE coname
IN
(select a.coname FROM Procurement_SP a join
procurement_sp b
ON left(a.coname,5) = left(b.coname,5)
WHERE
a.coname in (select a.coname FROM Procurement_SP a join
procurement_sp b
ON DIFFERENCE(a.coname,b.coname) > 3
GROUP
BY a.coname HAVING count(b.coname) > 1)
GROUP
BY a.coname HAVING count(b.coname) > 1)
ORDER
BY coname
which returned 935 rows
September 13, 2007 at 5:52 am
the following links may be useful for this exercise....
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=55210&SearchTerms=soundex,algo
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574&SearchTerms=soundex,algo
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=5238&SearchTerms=soundex,algo
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=66781&SearchTerms=soundex,algo
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=84806&SearchTerms=soundex,algo
September 13, 2007 at 7:40 am
Hi -- We make our business sorting out names (direct marketing). Soundex just simply won't get you where you need to go without too much colateral damage.
Do a Google on "metaphone" or "metafone" or "double metaphone". This is a much more complex (accurate?) hashing of words according to phonic rules used by the major spell-checker tools. There are several minor variants on the hashing, so once you pick one, stay with that one.
The one I have settled on is based off original C++ code and algorithm by Lawrence Philips, Published in the C/C++ Users Journal:
http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles Original Metaphone presented in article in "Computer Language" in 1990. It has also been ported to T-SQL as a UDF by a number of people - you should find several listed in your search (I did mine years ago, so I don't have particularly fresh URLs). Note: For real production purposes, you should build a DLL - it's a lot faster if you have millions of records to sift through.
Once you have the metafones of the names, you can use your difference() tests and such to decide when it is a dupe.
September 13, 2007 at 1:36 pm
You can also find some more information on SSC regarding metaphone, just go to the SSC search page and search on metaphone. I checked all three boxes for the search and there are articules, scripts, and discussion threads you can check out.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply