August 2, 2017 at 10:20 pm
Comments posted to this topic are about the item Fuzzy Search
August 2, 2017 at 11:09 pm
Love it. Brilliant work. So simple and effective. I just tried this on one of my databases, and it's a solid performer. I'm thinking of updating the algorithm to account for common transpositions. ie = ei and vice-versa..
August 3, 2017 at 2:10 am
I've tried this on MySQL with similar results.
As a homage to Jeff Modem I created a tally table with as least as many number as the longest string to be compared.
MySQL uses LOCATE rather than CHARINDEX but the principal is the same.
SELECT COUNT( DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2))
FROM tally_table
WHERE tally_id<=LENGTH(@name1);
I then ran the following to make sure that I understood what was going onSELECT LOCATE(SUBSTRING(@name1,tally_id,2),@name2),SUBSTRING(@name2,tally_id,2),@name1, @name2
FROM tally_table
WHERE tally_id<=LENGTH(@name1);
This revealed a gotcha to consider. Non-alphabetic characters crop up in pairs and count towards the score. This can increase the score without improving the match thus suggesting that some string cleansing would be required in the source data or some considerations to be added to the function. The combinations that can inflate the score that I can see are as follows
As a quick bodge to get around this I tried changing the score to a fraction of the 2 character matches to 3 character matches.SELECT
CASE trigram_position_match
WHEN 0 THEN 0
ELSE trigram_match_count/bigram_match_count END AS Score
FROM (
SELECT
COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2)) AS bigram_match_count,
COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_match_count,
MAX(LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_position_match
FROM tally_table
WHERE tally_id<=LENGTH(@name1)
) AS D;
By having the score as a 3 letter match / 2 letter match the score range is from 0 to 1 with 1 being a perfect match.
August 3, 2017 at 3:02 am
Tried this method years ago - it's not very accurate. Here's an alternative:
ChrisM@Work - Wednesday, November 19, 2008 5:43 AMToken matching works quite well...[font="Courier New"]DROP TABLE #Table CREATE TABLE #Table (TheValues VARCHAR (20)) INSERT INTO #Table (TheValues) SELECT '0010' UNION ALL SELECT '11242' UNION ALL SELECT '0011246' UNION ALL SELECT '0011264' UNION ALL SELECT '11284' UNION ALL SELECT '11345' UNION ALL SELECT '585' UNION ALL SELECT '5852' DECLARE @mcode VARCHAR(20) SET @mcode= '5857275' SET @mcode= '00112476' SELECT TOP 1 c.TheValues, (COUNT(*) * 100) / (LEN(@mcode)-2.00) AS MatchLevel FROM [Numbers] n INNER JOIN #Table c ON CHARINDEX(SUBSTRING(c.TheValues, n.number, 3), @mcode) > 0 WHERE n.number <= LEN(@mcode) AND LEN(SUBSTRING(c.TheValues, n.number, 3)) = 3 GROUP BY c.TheValues ORDER BY COUNT(*) DESC [/font]
Results:TheValues TokenMatches ---------- ------------ 585 2 0011246 4
Cheers ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 3, 2017 at 3:40 am
I like the "brute-force" approach to this - just whip through the data and count matches.
For organization matches I have a table of common abbreviations (Co., Company, LLC, Inc., Brothers, And Sons etc.) and industry-specific words (Construction, Contractors, Builders, etc.) that I toss when looking for matches.
ONE caveat to the function here - I would add a SUBTRACTION to account for extra characters.
Example:
[James Madison] --> [James Madison] *SHOULD SCORE HIGHER THAN: [James Madison, JR] --> [James Madison]
August 3, 2017 at 5:18 am
There has actually been some research on this approach, see for instance:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdf
And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
https://www.microsoft.com/en-us/research/project/data-cleaning/
August 3, 2017 at 5:31 am
lauri.pietarinen - Thursday, August 3, 2017 5:18 AMThere has actually been some research on this approach, see for instance:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdfAnd Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
https://www.microsoft.com/en-us/research/project/data-cleaning/
From reading those links, I believe those approaches use old procedural logical methodologies to solve the problem. I've tried many "intelligent" approaches and they are inferior. The approach in this article uses what I would call a holistic approach. Notice there is virtually no process to this function. I'm forced to loop because of the language but with the right construct there would be no need for looping and you would just "get" a score.
August 3, 2017 at 5:56 am
Well, not really procedural, since all those "side tables" can be created using "vanilla" SQL. Your approach may run into problems if you are comparing thousands of values with each other. (1000x1000 = 1M, 1M x 1 sec = a very long time!). So basically the idea is exactly the same, but some "preprocessing" is done in order to use SQL join power to process lot's of rows efficiently.
see:
https://www.dropbox.com/s/v6rn2n4e81sp63a/big_fuzzy.sql?dl=0
and
https://www.dropbox.com/s/7i84qf8p8m13g80/Fuzzy%20Matching%20with%20SQL%201.pptx?dl=0
That being said your solution is certainly very simple and elegant for certain situations!
Lauri
August 3, 2017 at 6:09 am
Did you try it for pairs like
"UCLA" <--> "University of California Los Angeles" ?
What kind of score does it return?
And which pair is a better match:
[James Madison, JR] --> [James Madison, SR]
or
[James Madison, JR] --> [James Madison Junior]
?
_____________
Code for TallyGenerator
August 3, 2017 at 6:22 am
Sergiy - Thursday, August 3, 2017 6:09 AMDid you try it for pairs like
"UCLA" <--> "University of California Los Angeles" ?
What kind of score does it return?And which pair is a better match:
[James Madison, JR] --> [James Madison, SR]
or
[James Madison, JR] --> [James Madison Junior]
?
In the first parameter include all known variations so any of them can be found in the second parameter.
SELECT dbo.FuzzyMatchString('UCLA','University of California Los Angeles'); -- returns 0
SELECT dbo.FuzzyMatchString('UCLA University of California Los Angeles','University of California Los Angeles'); -- returns 36
SELECT dbo.FuzzyMatchString('James Madison, JR','James Madison, Junior'); -- returns 15
SELECT dbo.FuzzyMatchString('Mr. Dr. James R. Madison, JR Junior','James Madison, Junior'); -- returns 22
August 3, 2017 at 9:04 am
While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.
August 3, 2017 at 10:09 am
Kimberly Blum - Thursday, August 3, 2017 9:04 AMWhile this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.
Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching. I question any "automated name matching" routine as being 100% reliable. The magic is that the old saying "garbage in, garbage out" is no longer true. If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".
August 3, 2017 at 10:31 am
Bill Talada - Thursday, August 3, 2017 10:09 AMKimberly Blum - Thursday, August 3, 2017 9:04 AMWhile this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching. I question any "automated name matching" routine as being 100% reliable. The magic is that the old saying "garbage in, garbage out" is no longer true. If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".
I see this as a FABULOUS approach to give the user a quick-and-dirty listing of what MIGHT be what they are looking for on a lookup. Hey...I didn't find EXACTLY what you are looking for (else I return that first) but here are some items that MIGHT be what you are looking for.
August 3, 2017 at 11:01 am
any links to code/data sources to generate this 1 million row "Names" table please?
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 3, 2017 at 11:24 am
this was really neat Bill, I like it;
the thing i wanted was to also see a perfect score alongside the actual score, since longer strings would return larger scores
I slapped together a ITVF version, and my scrores seem to be +1 compared to your score, but i get a bit more detail for me to seeIF OBJECT_ID('[dbo].[FuzzyMatchString_ITVF]') IS NOT NULL
DROP FUNCTION [dbo].[FuzzyMatchString_ITVF]
GO
CREATE FUNCTION dbo.FuzzyMatchString_ITVF (@s1 varchar(100), @s2 varchar(100))
RETURNS TABLE
AS
RETURN
WITH MiniTally
AS
(
--yet another small tally generator: 256 values
select
ROW_NUMBER() over (order by num) as n
from (values (1)) t (num)
group by cube (num, num, num, num, num, num,num,num)
--more num = another order power of 2
)
SELECT
SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS SearchScore,
SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS PerfectScore,
CONVERT(DECIMAL(5,2),100 * ((SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0) / (SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0))) AS PercentScore,
@s1 AS SearchTerm,
@s2 AS ItemCompared
FROM MiniTally
WHERE MiniTally.n<=LEN(@s1)
GO
and here's some sample data and the results
;WITH MovieStarNames([ID],[StarFirstName],[StarLastName])
AS
(
SELECT '40','Harrison','Ford' UNION ALL
SELECT '93','Jason','Robards' UNION ALL
SELECT '88','Edward','Norton' UNION ALL
SELECT '25','Orson','Welles' UNION ALL
SELECT '3','Marlon','Brando' UNION ALL
SELECT '61','Michael','Caine' UNION ALL
SELECT '56','Denzel','Washington'
)
SELECT * , dbo.FuzzyMatchString('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) AS ScalarVal
FROM [MovieStarNames] AS [msn]
CROSS APPLY dbo.FuzzyMatchString_ITVF('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) fn
Lowell
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply