June 15, 2015 at 7:55 pm
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540
The above URL giving clear picture about Levenshtein algorithm between two strings and it seems to be working fine as expected. How do I apply this into one table? My requirement here is, I want to show matched strings from Name Column in my DB table. Let's assume my DB table has 10 rows and should be displayed as follows..
ABCD
AB_B
QWE_
QWER
PQRSWE
P__S_W_
AS23_R
AS_3T_
Is there any possiblity to apply this algorithm for DB table?
June 15, 2015 at 8:57 pm
You could apply it to a table like so:
USE tempdb
GO
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j-2 = 1, @i = 1, @C = 0
WHILE @j-2 <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @C = @i, @cv0 = CAST(@i AS binary(2)), @j-2 = 1
WHILE @j-2 <= @s2_len
BEGIN
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END
IF @C > @c_temp SET @C = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1
IF @C > @c_temp SET @C = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @C
END
GO
DECLARE @strings TABLE (stID int identity, s1 varchar(8000), s2 varchar(8000));
INSERT @strings(s1, s2) VALUES
('ABCD', 'AB_B'),
('QWE','QWER'),
('PQRSWE','P__S_W_'),
('AS23_R','AS_3T_');
SELECT s1, s2, ED = dbo.edit_distance(s1,s2)
FROM @strings;
-- Itzik Ben-Gan 2001
June 15, 2015 at 9:47 pm
I put together a real-life example of where the Levenshtein Distance is used.
Say you had a SQL Job or ETL package that imported data into a data warehouse or data mart. You have a lookup table and only values that had a match in that table could be imported. Values that are not matched are inserted into a table for unmatched values for further processing either by some Data Quality Services (DQS) software or to be looked at by a real person....
Now let's say we wanted to provide that software or person with the closest possible match (that's what Levenshtein is all about). You could do something like this:
-- Sample Data
DECLARE @CityLookup TABLE(city varchar(8000));
DECLARE @UnmatchedValues TABLE (city varchar(8000));
INSERT @CityLookup VALUES ('Chicago'),('Detroit'),('Mexico City');
INSERT @UnmatchedValues VALUES ('Chicaggo'),('Detrot'),('MexicCity');
-- Finding the "closest match"
WITH BestGuess AS
(
SELECT
UnmatchedValue = unmatched.city,
ClosestMatch = c.city,
ED = (dbo.edit_distance(unmatched.city, c.city)),
LevRank = RANK() OVER
(PARTITION BY unmatched.city
ORDER BY (dbo.edit_distance(unmatched.city, c.city)))
FROM
(
SELECT i.city
FROM @UnmatchedValues i
WHERE NOT EXISTS (SELECT 1 FROM @CityLookup c WHERE i.city=c.city)
) unmatched
CROSS JOIN @CityLookup c
)
SELECT UnmatchedValue, ClosestMatch, ED
FROM BestGuess
WHERE LevRank = 1;
Examining the above code you can see that for "Chicaggo" the closest possible match is Chicago, for "Detrot" - Detroit, for "MexicCity" - Mexico City.... I hope that makes sense, if not let us know.
Note that the Levenshtein function you posted is actually very slow (note the comments on that page). There is a guy who developed a much faster version here: http://randomrumenations.blogspot.com/2009/06/improved-t-sql-levenshtein-distance.html. Lastly - the Damerau-Levenshtein Distance is a more accurate algorithm but it's going to be slower. You can learn more about that here: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/[/url].
-- Itzik Ben-Gan 2001
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply