October 13, 2009 at 6:25 am
I googled an open source function which generates a set of metrics returning a similarity coefficient when comparing two strings together. Comparing the results of that function to those generated by a package utilizing the fuzzy lookup transaformation I found that the NeedlemanWunch and SmithWaterman metrics are the best matches but due to relatively limited amount of test data I still can't decide if I should use either. My question is: is there any specific metric that SQL Server utilizes in the fuzzy lookup/grouping transformation?
The metrics available in the mentioned function are the following:
BlockDistance
CosineSimilarity
DiceSimilarity
EuclideanDistance
JaccardSimilarity
MatchingCoefficient
OverlapCoefficient
ChapmanMeanLength
QGramsDistance
Levenstein
MongeElkan
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
NeedlemanWunch
Jaro
JaroWinkler
ChapmanLengthDeviation
Thanks ,
Samer.
October 13, 2009 at 9:59 am
This is the basic edit distance function whereby the distance is given simply as the minimum edit distance which transforms string1 into string2. Edit Operations are listed as follows:
Copy character from string1 over to string2 (cost 0)
Delete a character in string1 (cost 1)
Insert a character in string2 (cost 1)
Substitute one character for another (cost 1)
D(i-1,j-1) + d(si,tj) //subst/copy
D(i,j) = min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
d(i,j) is a function whereby d(c,d)=0 if c=d, 1 else
There are many extensions to the Levenshtein distance function typically these alter the d(i,j) function, but further extensions can be made for instance, the Needleman-Wunch distance for which Levenshtein is equivalent if the gap distance is 1. The Levenshtein distance is calulated below for the term "sam chapman" and "sam john chapman", the final distance is given by the bottom right cell, i.e. 5. This score indicates that only 5 edit cost operations are required to match the strings (for example, insertion of the "john " characters, although a number of other routes can be traversed instead).
"http://kerjakeras.com/kenali-dan-kunjungi-objek-wisata-di-pandeglang/%5D Kenali Dan Kunjungi Objek Wisata Di Pandeglang">
[url= http://kerjakeras.com/kenali-dan-kunjungi-objek-wisata-di-pandeglang/%5D Kenali Dan Kunjungi Objek Wisata Di Pandeglang
"
October 14, 2009 at 4:03 am
Compared Fuzzy lookup similarity results on 20,000 rows of names to the similarity score returned by the following metrics: Levenstein, ChapmanLengthDeviation, Jaro, NeedleManWunch, SmithWaterman and QGramsDistance. Levenstein metric proved to be the best match with a maximum difference of 0.25 in the similarity coefficient. I will settle with this for the time being, but still not sure if SQL Server Fuzzy Lookup is based on that metric and has been modified.
December 9, 2011 at 5:58 pm
Can you please provide the link for the open source similarity metric ...
thank you in advance .....
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply