Levenshtein Distance Algorithm

  • http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

    The above URL giving clear picture about Levenshtein algorithm between two strings and it seems to be working fine as expected. How do I apply this into one table? My requirement here is, I want to show matched strings from Name Column in my DB table. Let's assume my DB table has 10 rows and should be displayed as follows..

    ABCD

    AB_B

    QWE_

    QWER

    PQRSWE

    P__S_W_

    AS23_R

    AS_3T_

    Is there any possiblity to apply this algorithm for DB table?

  • You could apply it to a table like so:

    USE tempdb

    GO

    CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))

    RETURNS int

    AS

    BEGIN

    DECLARE @s1_len int, @s2_len int, @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int,

    @cv0 varbinary(8000), @cv1 varbinary(8000)

    SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j-2 = 1, @i = 1, @C = 0

    WHILE @j-2 <= @s2_len

    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1

    WHILE @i <= @s1_len

    BEGIN

    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @C = @i, @cv0 = CAST(@i AS binary(2)), @j-2 = 1

    WHILE @j-2 <= @s2_len

    BEGIN

    SET @C = @C + 1

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +

    CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END

    IF @C > @c_temp SET @C = @c_temp

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1

    IF @C > @c_temp SET @C = @c_temp

    SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1

    END

    SELECT @cv1 = @cv0, @i = @i + 1

    END

    RETURN @C

    END

    GO

    DECLARE @strings TABLE (stID int identity, s1 varchar(8000), s2 varchar(8000));

    INSERT @strings(s1, s2) VALUES

    ('ABCD', 'AB_B'),

    ('QWE','QWER'),

    ('PQRSWE','P__S_W_'),

    ('AS23_R','AS_3T_');

    SELECT s1, s2, ED = dbo.edit_distance(s1,s2)

    FROM @strings;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I put together a real-life example of where the Levenshtein Distance is used.

    Say you had a SQL Job or ETL package that imported data into a data warehouse or data mart. You have a lookup table and only values that had a match in that table could be imported. Values that are not matched are inserted into a table for unmatched values for further processing either by some Data Quality Services (DQS) software or to be looked at by a real person....

    Now let's say we wanted to provide that software or person with the closest possible match (that's what Levenshtein is all about). You could do something like this:

    -- Sample Data

    DECLARE @CityLookup TABLE(city varchar(8000));

    DECLARE @UnmatchedValues TABLE (city varchar(8000));

    INSERT @CityLookup VALUES ('Chicago'),('Detroit'),('Mexico City');

    INSERT @UnmatchedValues VALUES ('Chicaggo'),('Detrot'),('MexicCity');

    -- Finding the "closest match"

    WITH BestGuess AS

    (

    SELECT

    UnmatchedValue = unmatched.city,

    ClosestMatch = c.city,

    ED = (dbo.edit_distance(unmatched.city, c.city)),

    LevRank = RANK() OVER

    (PARTITION BY unmatched.city

    ORDER BY (dbo.edit_distance(unmatched.city, c.city)))

    FROM

    (

    SELECT i.city

    FROM @UnmatchedValues i

    WHERE NOT EXISTS (SELECT 1 FROM @CityLookup c WHERE i.city=c.city)

    ) unmatched

    CROSS JOIN @CityLookup c

    )

    SELECT UnmatchedValue, ClosestMatch, ED

    FROM BestGuess

    WHERE LevRank = 1;

    Examining the above code you can see that for "Chicaggo" the closest possible match is Chicago, for "Detrot" - Detroit, for "MexicCity" - Mexico City.... I hope that makes sense, if not let us know.

    Note that the Levenshtein function you posted is actually very slow (note the comments on that page). There is a guy who developed a much faster version here: http://randomrumenations.blogspot.com/2009/06/improved-t-sql-levenshtein-distance.html. Lastly - the Damerau-Levenshtein Distance is a more accurate algorithm but it's going to be slower. You can learn more about that here: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/[/url].

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply