Fuzzy Search

Question

Fuzzy Search

Bill Talada

SSChampion

Points: 11954
More actions
August 2, 2017 at 10:20 pm

#402388

Comments posted to this topic are about the item Fuzzy Search

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Rod Weir Old Hand Points: 353 More actions · Answer 1

Love it. Brilliant work. So simple and effective. I just tried this on one of my databases, and it's a solid performer. I'm thinking of updating the algorithm to account for common transpositions. ie = ei and vice-versa..

David.Poole SSC Guru Points: 76000 More actions · Answer 2

I've tried this on MySQL with similar results.

As a homage to Jeff Modem I created a tally table with as least as many number as the longest string to be compared.

MySQL uses LOCATE rather than CHARINDEX but the principal is the same.

SELECT COUNT( DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2)) FROM tally_table WHERE tally_id<=LENGTH(@name1);

I then ran the following to make sure that I understood what was going on
SELECT LOCATE(SUBSTRING(@name1,tally_id,2),@name2),SUBSTRING(@name2,tally_id,2),@name1, @name2 FROM tally_table WHERE tally_id<=LENGTH(@name1);

This revealed a gotcha to consider. Non-alphabetic characters crop up in pairs and count towards the score. This can increase the score without improving the match thus suggesting that some string cleansing would be required in the source data or some considerations to be added to the function. The combinations that can inflate the score that I can see are as follows

Punctuation & space

Letter & space

Space &letter

Letter & non-letter (such as in hyphenated names)

Non letter & letter

Double space or indeed any double non-letter

As a quick bodge to get around this I tried changing the score to a fraction of the 2 character matches to 3 character matches.
SELECT CASE trigram_position_match WHEN 0 THEN 0 ELSE trigram_match_count/bigram_match_count END AS Score FROM ( SELECT COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2)) AS bigram_match_count, COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_match_count, MAX(LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_position_match FROM tally_table WHERE tally_id<=LENGTH(@name1) ) AS D;

By having the score as a 3 letter match / 2 letter match the score range is from 0 to 1 with 1 being a perfect match.

LinkedIn Profile

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

Tried this method years ago - it's not very accurate. Here's an alternative:

ChrisM@Work - Wednesday, November 19, 2008 5:43 AM
Token matching works quite well... [font="Courier New"]DROP TABLE #Table CREATE TABLE #Table (TheValues VARCHAR (20)) INSERT INTO #Table (TheValues) SELECT '0010' UNION ALL SELECT '11242' UNION ALL SELECT '0011246' UNION ALL SELECT '0011264' UNION ALL SELECT '11284' UNION ALL SELECT '11345' UNION ALL SELECT '585' UNION ALL SELECT '5852' DECLARE @mcode VARCHAR(20) SET @mcode= '5857275' SET @mcode= '00112476' SELECT TOP 1 c.TheValues, (COUNT(*) * 100) / (LEN(@mcode)-2.00) AS MatchLevel FROM [Numbers] n INNER JOIN #Table c ON CHARINDEX(SUBSTRING(c.TheValues, n.number, 3), @mcode) > 0 WHERE n.number <= LEN(@mcode) AND LEN(SUBSTRING(c.TheValues, n.number, 3)) = 3 GROUP BY c.TheValues ORDER BY COUNT(*) DESC [/font] Results: TheValues TokenMatches ---------- ------------ 585 2 0011246 4 Cheers ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Matt Algate SSC Enthusiast Points: 170 More actions · Answer 4

I like the "brute-force" approach to this - just whip through the data and count matches.

For organization matches I have a table of common abbreviations (Co., Company, LLC, Inc., Brothers, And Sons etc.) and industry-specific words (Construction, Contractors, Builders, etc.) that I toss when looking for matches.

ONE caveat to the function here - I would add a SUBTRACTION to account for extra characters.
Example:
[James Madison] --> [James Madison] *SHOULD SCORE HIGHER THAN: [James Madison, JR] --> [James Madison]

lauri.pietarinen SSC Veteran Points: 236 More actions · Answer 5

There has actually been some research on this approach, see for instance:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdf

And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
https://www.microsoft.com/en-us/research/project/data-cleaning/

Bill Talada SSChampion Points: 11954 More actions · Answer 6

lauri.pietarinen - Thursday, August 3, 2017 5:18 AM
There has actually been some research on this approach, see for instance:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdf
And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
https://www.microsoft.com/en-us/research/project/data-cleaning/

From reading those links, I believe those approaches use old procedural logical methodologies to solve the problem. I've tried many "intelligent" approaches and they are inferior. The approach in this article uses what I would call a holistic approach. Notice there is virtually no process to this function. I'm forced to loop because of the language but with the right construct there would be no need for looping and you would just "get" a score.

lauri.pietarinen SSC Veteran Points: 236 More actions · Answer 7

Well, not really procedural, since all those "side tables" can be created using "vanilla" SQL. Your approach may run into problems if you are comparing thousands of values with each other. (1000x1000 = 1M, 1M x 1 sec = a very long time!). So basically the idea is exactly the same, but some "preprocessing" is done in order to use SQL join power to process lot's of rows efficiently.

see:
https://www.dropbox.com/s/v6rn2n4e81sp63a/big_fuzzy.sql?dl=0

and
https://www.dropbox.com/s/7i84qf8p8m13g80/Fuzzy%20Matching%20with%20SQL%201.pptx?dl=0

That being said your solution is certainly very simple and elegant for certain situations!

Lauri

Sergiy SSC Guru Points: 110208 More actions · Answer 8

Did you try it for pairs like
"UCLA" <--> "University of California Los Angeles" ?
What kind of score does it return?

And which pair is a better match:
[James Madison, JR] --> [James Madison, SR]
or
[James Madison, JR] --> [James Madison Junior]
?

_____________
Code for TallyGenerator

Bill Talada SSChampion Points: 11954 More actions · Answer 9

Sergiy - Thursday, August 3, 2017 6:09 AM
Did you try it for pairs like
"UCLA" <--> "University of California Los Angeles" ?
What kind of score does it return?
And which pair is a better match:
[James Madison, JR] --> [James Madison, SR]
or
[James Madison, JR] --> [James Madison Junior]
?

In the first parameter include all known variations so any of them can be found in the second parameter.

SELECT dbo.FuzzyMatchString('UCLA','University of California Los Angeles'); -- returns 0 SELECT dbo.FuzzyMatchString('UCLA University of California Los Angeles','University of California Los Angeles'); -- returns 36

SELECT dbo.FuzzyMatchString('James Madison, JR','James Madison, Junior'); -- returns 15 SELECT dbo.FuzzyMatchString('Mr. Dr. James R. Madison, JR Junior','James Madison, Junior'); -- returns 22

Kimberly Blum SSC Enthusiast Points: 151 More actions · Answer 10

While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.

Bill Talada SSChampion Points: 11954 More actions · Answer 11

Kimberly Blum - Thursday, August 3, 2017 9:04 AM
While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.

Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching. I question any "automated name matching" routine as being 100% reliable. The magic is that the old saying "garbage in, garbage out" is no longer true. If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".

Matt Algate SSC Enthusiast Points: 170 More actions · Answer 12

Bill Talada - Thursday, August 3, 2017 10:09 AM
Kimberly Blum - Thursday, August 3, 2017 9:04 AM
While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.
Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching. I question any "automated name matching" routine as being 100% reliable. The magic is that the old saying "garbage in, garbage out" is no longer true. If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".

I see this as a FABULOUS approach to give the user a quick-and-dirty listing of what MIGHT be what they are looking for on a lookup. Hey...I didn't find EXACTLY what you are looking for (else I return that first) but here are some items that MIGHT be what you are looking for.

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 13

any links to code/data sources to generate this 1 million row "Names" table please?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Lowell SSC Guru Points: 323518 More actions · Answer 14

this was really neat Bill, I like it;
the thing i wanted was to also see a perfect score alongside the actual score, since longer strings would return larger scores
I slapped together a ITVF version, and my scrores seem to be +1 compared to your score, but i get a bit more detail for me to see
IF OBJECT_ID('[dbo].[FuzzyMatchString_ITVF]') IS NOT NULL DROP FUNCTION [dbo].[FuzzyMatchString_ITVF] GO CREATE FUNCTION dbo.FuzzyMatchString_ITVF (@s1 varchar(100), @s2 varchar(100)) RETURNS TABLE AS RETURN WITH MiniTally AS ( --yet another small tally generator: 256 values select ROW_NUMBER() over (order by num) as n from (values (1)) t (num) group by cube (num, num, num, num, num, num,num,num) --more num = another order power of 2 ) SELECT SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS SearchScore, SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS PerfectScore, CONVERT(DECIMAL(5,2),100 * ((SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0) / (SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0))) AS PercentScore, @s1 AS SearchTerm, @s2 AS ItemCompared FROM MiniTally WHERE MiniTally.n<=LEN(@s1) GO

and here's some sample data and the results

;WITH MovieStarNames([ID],[StarFirstName],[StarLastName]) AS ( SELECT '40','Harrison','Ford' UNION ALL SELECT '93','Jason','Robards' UNION ALL SELECT '88','Edward','Norton' UNION ALL SELECT '25','Orson','Welles' UNION ALL SELECT '3','Marlon','Brando' UNION ALL SELECT '61','Michael','Caine' UNION ALL SELECT '56','Denzel','Washington' ) SELECT * , dbo.FuzzyMatchString('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) AS ScalarVal FROM [MovieStarNames] AS [msn] CROSS APPLY dbo.FuzzyMatchString_ITVF('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) fn

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!