Levenshtein Distance Algorithm

Question

Post reply

Levenshtein Distance Algorithm

Whizkid

Right there with Babe

Points: 783
More actions
June 15, 2015 at 7:55 pm

#301532

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540
The above URL giving clear picture about Levenshtein algorithm between two strings and it seems to be working fine as expected. How do I apply this into one table? My requirement here is, I want to show matched strings from Name Column in my DB table. Let's assume my DB table has 10 rows and should be displayed as follows..
ABCD
AB_B
QWE_
QWER
PQRSWE
P__S_W_
AS23_R
AS_3T_
Is there any possiblity to apply this algorithm for DB table?

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61141 More actions · Answer 1

You could apply it to a table like so:

USE tempdb

GO

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))

RETURNS int

AS

BEGIN

DECLARE @s1_len int, @s2_len int, @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int,

@cv0 varbinary(8000), @cv1 varbinary(8000)

SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j-2 = 1, @i = 1, @C = 0

WHILE @j-2 <= @s2_len

SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1

WHILE @i <= @s1_len

BEGIN

SELECT @s1_char = SUBSTRING(@s1, @i, 1), @C = @i, @cv0 = CAST(@i AS binary(2)), @j-2 = 1

WHILE @j-2 <= @s2_len

BEGIN

SET @C = @C + 1

SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +

CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END

IF @C > @c_temp SET @C = @c_temp

SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1

IF @C > @c_temp SET @C = @c_temp

SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1

END

SELECT @cv1 = @cv0, @i = @i + 1

END

RETURN @C

END

GO

DECLARE @strings TABLE (stID int identity, s1 varchar(8000), s2 varchar(8000));

INSERT @strings(s1, s2) VALUES

('ABCD', 'AB_B'),

('QWE','QWER'),

('PQRSWE','P__S_W_'),

('AS23_R','AS_3T_');

SELECT s1, s2, ED = dbo.edit_distance(s1,s2)

FROM @strings;

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61141 More actions · Answer 2

I put together a real-life example of where the Levenshtein Distance is used.

Say you had a SQL Job or ETL package that imported data into a data warehouse or data mart. You have a lookup table and only values that had a match in that table could be imported. Values that are not matched are inserted into a table for unmatched values for further processing either by some Data Quality Services (DQS) software or to be looked at by a real person....

Now let's say we wanted to provide that software or person with the closest possible match (that's what Levenshtein is all about). You could do something like this:

-- Sample Data

DECLARE @CityLookup TABLE(city varchar(8000));

DECLARE @UnmatchedValues TABLE (city varchar(8000));

INSERT @CityLookup VALUES ('Chicago'),('Detroit'),('Mexico City');

INSERT @UnmatchedValues VALUES ('Chicaggo'),('Detrot'),('MexicCity');

-- Finding the "closest match"

WITH BestGuess AS

(

SELECT

UnmatchedValue = unmatched.city,

ClosestMatch = c.city,

ED = (dbo.edit_distance(unmatched.city, c.city)),

LevRank = RANK() OVER

(PARTITION BY unmatched.city

ORDER BY (dbo.edit_distance(unmatched.city, c.city)))

FROM

(

SELECT i.city

FROM @UnmatchedValues i

WHERE NOT EXISTS (SELECT 1 FROM @CityLookup c WHERE i.city=c.city)

) unmatched

CROSS JOIN @CityLookup c

)

SELECT UnmatchedValue, ClosestMatch, ED

FROM BestGuess

WHERE LevRank = 1;

Examining the above code you can see that for "Chicaggo" the closest possible match is Chicago, for "Detrot" - Detroit, for "MexicCity" - Mexico City.... I hope that makes sense, if not let us know.

Note that the Levenshtein function you posted is actually very slow (note the comments on that page). There is a guy who developed a much faster version here: http://randomrumenations.blogspot.com/2009/06/improved-t-sql-levenshtein-distance.html. Lastly - the Damerau-Levenshtein Distance is a more accurate algorithm but it's going to be slower. You can learn more about that here: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/[/url].

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001