Similarity Metrics

Question

Similarity Metrics

smecharrafie

Ten Centuries

Points: 1135
More actions
October 13, 2009 at 6:25 am

#215450

I googled an open source function which generates a set of metrics returning a similarity coefficient when comparing two strings together. Comparing the results of that function to those generated by a package utilizing the fuzzy lookup transaformation I found that the NeedlemanWunch and SmithWaterman metrics are the best matches but due to relatively limited amount of test data I still can't decide if I should use either. My question is: is there any specific metric that SQL Server utilizes in the fuzzy lookup/grouping transformation?
The metrics available in the mentioned function are the following:
BlockDistance
CosineSimilarity
DiceSimilarity
EuclideanDistance
JaccardSimilarity
MatchingCoefficient
OverlapCoefficient
ChapmanMeanLength
QGramsDistance
Levenstein
MongeElkan
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
NeedlemanWunch
Jaro
JaroWinkler
ChapmanLengthDeviation
Thanks ,
Samer.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

raronyahmed1 Grasshopper Points: 16 More actions · Answer 1

This is the basic edit distance function whereby the distance is given simply as the minimum edit distance which transforms string1 into string2. Edit Operations are listed as follows:

Copy character from string1 over to string2 (cost 0)

Delete a character in string1 (cost 1)

Insert a character in string2 (cost 1)

Substitute one character for another (cost 1)

D(i-1,j-1) + d(si,tj) //subst/copy

D(i,j) = min D(i-1,j)+1 //insert

D(i,j-1)+1 //delete

d(i,j) is a function whereby d(c,d)=0 if c=d, 1 else

There are many extensions to the Levenshtein distance function typically these alter the d(i,j) function, but further extensions can be made for instance, the Needleman-Wunch distance for which Levenshtein is equivalent if the gap distance is 1. The Levenshtein distance is calulated below for the term "sam chapman" and "sam john chapman", the final distance is given by the bottom right cell, i.e. 5. This score indicates that only 5 edit cost operations are required to match the strings (for example, insertion of the "john " characters, although a number of other routes can be traversed instead).

"Mengembalikan Jati Diri Bangsa [/url]

Kenali Dan Kunjungi Objek Wisata Di Pandeglang [/url]

"

smecharrafie Ten Centuries Points: 1135 More actions · Answer 2

Compared Fuzzy lookup similarity results on 20,000 rows of names to the similarity score returned by the following metrics: Levenstein, ChapmanLengthDeviation, Jaro, NeedleManWunch, SmithWaterman and QGramsDistance. Levenstein metric proved to be the best match with a maximum difference of 0.25 in the similarity coefficient. I will settle with this for the time being, but still not sure if SQL Server Fuzzy Lookup is based on that metric and has been modified.

jihadalrawi Grasshopper Points: 17 More actions · Answer 3

Can you please provide the link for the open source similarity metric ...

thank you in advance .....