Fuzzy match based on T-SQL only

  • I have Table1.ColA that contains a Varchar(50)  sting that I need to fuzzy-lookup against Table2.ColA.
    And insert the fuzzily-matched value Into Table1.ColA_FuzzyMatched column that originally contains Nulls only. All 3 columns are Varchar(50).

    I have never done this before, honestly. I am sure there are some better practices to do this that other practices.. There is a task to do this in SSIS but a lot of our desirable fuzzy lookups are not to be initiated via SSIS.

    Is there a good T-SQL /SP  out there that someone is aware of of that does this? 

    And/or is there may be a function(ality) in SQL 2016 and 2017 that had not existed before and now is providing additional help for  achieving such fuzzy goals as fuzzy lookups?
    Thank you!

  • The "Levenshtein Distance Algorithm" seems to be one of the most accurate.  Here's one link and, of course, Google is loaded for bear on this subject.

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/fuzzy-searches-sql-server/

    https://www.google.com/search?client=firefox-b-1-ab&q=levenshtein+distance+sql

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here are two functions - the first is a very close approximation of the Levenshtein Distance Algorithm, the second is a stripped-down slicked-back twin carburetor version which in practice produces very similar results.

    CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5]
       (@Reference VARCHAR(100) = NULL,
       @Target VARCHAR(100) = NULL)
    RETURNS table WITH SCHEMABINDING
    AS
    -- Chris Morris 2012
    -- Fuzzy-matching using shifted character positions
    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

    RETURN
    SELECT
       d.Method,
       MatchRatio = CAST(CASE
          WHEN d.Method = 1 THEN 100
          WHEN d.Method = 3 THEN LenTarget*100.00/LenReference
          WHEN d.Method = 4 THEN LenReference*100.00/LenTarget
          WHEN d.Method = 5 THEN
             (
             SELECT
                MatchPC = (100.00 * ISNULL(NULLIF(SUM(
                      CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
                               ),0)+2.00,0) / LenReference)
                      * CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END   
             FROM ( -- Tally
                SELECT TOP (CAST(LenReference AS INT)-2) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
                FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
                (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
                ) Tally
             CROSS APPLY (SELECT PosInTarget = CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target)) x
             )
          WHEN d.Method = 6 THEN       
             (
             SELECT
                MatchPC = (100.00 * ISNULL(NULLIF(SUM(
                      CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
                               ),0)+1.00,0) / LenReference) 
                      * CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END
             FROM ( -- Tally
                SELECT TOP (CAST(LenReference AS INT)-1) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) 
                FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)
             ) Tally
             CROSS APPLY (SELECT PosInTarget = CAST(CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) AS DECIMAL(5,2))) x
             )
          ELSE NULL     
          END AS DECIMAL(5,2))
         
    FROM (
       SELECT Method = CASE
          WHEN @Reference = @Target THEN 1
          WHEN @Reference IS NULL OR @Target IS NULL THEN 2
          WHEN @Reference LIKE '%'+@Target+'%' THEN 3
          WHEN @Target LIKE '%'+@Reference+'%' THEN 4
          WHEN DATALENGTH(@Reference) >= 7 AND DATALENGTH(@Target) >= 7 THEN 5
          WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6
          ELSE 7     
          END,
       LenTarget = CAST(DATALENGTH(@Target) AS DECIMAL(5,2)),
       LenReference = CAST(DATALENGTH(@Reference) AS DECIMAL(5,2))
    ) d  

    -----------------------------------------------------------------
    CREATE FUNCTION [dbo].[FuzzyMatch_iTVFV2]
       (@Reference VARCHAR(100) = NULL,
       @Target VARCHAR(100) = NULL)
    RETURNS table WITH SCHEMABINDING
    AS
    -- Chris Morris 2012
    -- Fuzzy-matching
    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf
    RETURN

    WITH Tally AS (
       SELECT TOP (DATALENGTH(@Reference)) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
       FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
       (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
    )
    SELECT MatchRatio = SUM(Matches)/DATALENGTH(@Reference) 
    FROM Tally
    CROSS APPLY (SELECT TestLetter = SUBSTRING(@Reference, Tally.n, 1)) x
    CROSS APPLY (
       SELECT -- start search one character position *before* n to catch switched letters
          PosInReference = CAST(CHARINDEX(TestLetter, @Reference, n-1) AS DECIMAL(5,2)),
          PosInTarget = CAST(CHARINDEX(TestLetter, @Target, n-1) AS DECIMAL(5,2))
    ) z
    CROSS APPLY (
       SELECT Matches = CASE WHEN PosInReference > PosInTarget
          THEN PosInTarget/PosInReference ELSE PosInReference/PosInTarget END
    ) m

    --------------------------------------------------------------------------------------------------------------------
    Here's an example of usage:

    ;WITH
     SampleData1 AS (SELECT * FROM (VALUES ('Victor Proteus'), ('Victor Protus'), ('Victor ProteEus')) d (String1)),
     SampleData2 AS (SELECT * FROM (VALUES ('Victor Proteus'), ('Victor Protes'), ('Proteus Victor')) d (String2))
    SELECT *
    FROM SampleData1 d1
    CROSS JOIN SampleData2 d2
    CROSS APPLY dbo.FuzzyMatch_iTVFV2 (d1.String1, d2.String2) m
    WHERE d1.String1 > d2.String2
     AND m.MatchRatio > 0.4
    ORDER BY d1.String1, d2.String2

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply