February 20, 2012 at 6:50 am
Hi,
I've used the SOUNDEX function many times, but it appears to be quite limited especially here where we use more than one language quite often. I found this which can deploy CLR versions of Double Metaphone and Levenshtein which seems quite good. I was curious what other approaches or algorithms others had used for this purpose?
February 20, 2012 at 7:40 pm
I wish I could offer up something helpful here, but I suspect the reason you haven't had a response is because the very nature of the use of something like SOUNDEX is that it gets to be CPU intensive with any sizable data volume, and my guess is, almost any other method, CLR or not, is ultimately going to have capacity concerns to deal with. The nature of performing such a task is inherently difficult, even for today's supercomputers, so once you add in additional languages, things get a LOT more complex very quickly. I'd imagine that your having found some form of solution approach may well be as good as it gets, given the lack of any other reply so far.
It may well be that the number of folks world-wide that have made attempts at this might be so small as to be difficult to even find, and folks so involved might well be too busy to find time for a forum post. Please do, if you can, post about your experiences with those finds, as chances are, if not in the short-term, eventually, someone will find them useful as at least a guide-post on the topic.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
February 21, 2012 at 12:01 pm
There's a fundamental difference between using SOUNDEX and using something like Levenshtein edit distance. I agree SOUNDEX is quite limited and other methods are much better, but other methods work very differently and have very different performance ramifications.
Suppose you are trying to do fuzzy matching between two sets of 10 rows. It will take 20 SOUNDEX calculations, one for each row of each set, then you can compare SOUNDEX values. It will take 100 Levenshtein calculations, one for each combination of rows, because the rows are compared directly instead of comparing generated (SOUNDEX) values. A single SOUNDEX value can represent many variations of a name/word, but it's all or nothing (match or not). However, a Levenshtein edit distance calculation gives you some indication of how closely two names/words match, so you can consider match quality instead of all or nothing (match or not).
I tend to do my fuzzy matching by handling as many rows as possible with standard T-SQL string functions and then resorting to CPU-intensive functions (such as Levenshtein edit distance) for the minimum number of rows.
February 21, 2012 at 6:06 pm
Sounds to me like a solid approach, with a bit of the old 80-20 rule in play, in that you use string functions for the "easy stuff" (or at least, "easier", anyway), and pass the harder matches on to a more comprehensive function for perhaps only apparently relevant rows. I suspect that kind of approach may still have an exponential CPU time growth curve based on the average number of rows passed along to the more comprehensive function, but as you're in the trenches performing this task, you'll have a far better handle on what your environment can do (as built) than any guesses I might make. I'd also be curious about your method for choosing whether a given match attempt is sent to the sting function or the more comprehensive one. That might also have an interesting overall performance curve that ideally, is in the linear category with respect to data volume.
Thanks for posting - chances are that at some point, this will be helpful information for anyone that needs to work with word/pattern matching algorithms.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply