January 11, 2013 at 2:22 pm
dwain.c
I for one have never heard of this famous "Levenshtein Edit Distance" problem, but you can rest assured that now I'm going to have to take a look at it as well!
I guess you're really busy Dwain (is that the Jeopardy theme in the background?), since you've not come back with anything yet.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
January 11, 2013 at 2:50 pm
Alan.B
I came up with this:
-- strings to compare
DECLARE@s1 varchar(8000)='diner',
@s2 varchar(8000)='dinerr';
DECLARE @Ld int=ABS(LEN(@s1)-LEN(@s2));
IF ((@s1=@s2) OR ((ISNULL(LEN(@s1)*LEN(@s2),0)=0))) BEGIN GOTO LD END;
DECLARE@minlen int=CASE WHEN LEN(@s1)>LEN(@s2) THEN LEN(@s2) ELSE LEN(@s1) END;
;WITH
nums(n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM [master].dbo.spt_values Tally),
matrix AS (SELECT SUBSTRING(@s1,n,1) s1, SUBSTRING(@s2,n,1) s2 FROM nums WHERE n<=@minlen)
SELECT @Ld+=COUNT(*) FROM matrix WHERE s1<>s2;
LD:
SELECT @Ld AS LD;
Alan, I had never heard of this either. If the Wiki description Dwain posted is to be trusted, I think you might want to re-visit this. For example, for the two strings you provided, 'diner' and 'dinerr', the output is 1. This makes sense, because all you have to do is delete the second 'r' in string 2 (or the first 'r' for that matter), and you end up with the same two strings. Next, let's say I change the 'd' in string 1 to 'a'. The code returns 2, which I think is correct, since all I have to do is substitute 'a' in string 1 to 'd', then delete one of the 'r' in string 2, and I have two of the same string. Now, consider this: let's say I insert the 'a' in string 1, but leave the rest intact, leaving me with 'adiner' for string 1, and I leave string 2 as 'dinerr'. The code returns a value of 5. This does not make sense, again, if Wiki is to be trusted, as it states the allowable actions are single character insertions, deletions, and substitutions. So, in order to make 'adiner' = 'dinerr', I only need two actions: delete the 'a' in 'adiner', and delete one of the 'r' in 'dinerr'. Does that make sense, or am I missing something? (the latter is entirely possible.) (maybe even likely).
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
January 11, 2013 at 7:02 pm
Greg Snidow (1/11/2013)
dwain.c
I for one have never heard of this famous "Levenshtein Edit Distance" problem, but you can rest assured that now I'm going to have to take a look at it as well!
I guess you're really busy Dwain (is that the Jeopardy theme in the background?), since you've not come back with anything yet.
You betcha! That crazy Vertex Covers puzzle has me losing sleep! :w00t:
Looks like you may have beaten me to it.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 12, 2013 at 4:22 am
dwain.c
Looks like you may have beaten me to it.
Oh, don't worry about that, Dwain. Admittedly, I did look at it thinking it was only a minor tweaking of Alan's code, but I think it may be more complicated than that when you really ponder on it.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
January 14, 2013 at 9:11 am
Greg Snidow (1/11/2013)
Alan.B
I came up with this:
-- strings to compare
DECLARE@s1 varchar(8000)='diner',
@s2 varchar(8000)='dinerr';
DECLARE @Ld int=ABS(LEN(@s1)-LEN(@s2));
IF ((@s1=@s2) OR ((ISNULL(LEN(@s1)*LEN(@s2),0)=0))) BEGIN GOTO LD END;
DECLARE@minlen int=CASE WHEN LEN(@s1)>LEN(@s2) THEN LEN(@s2) ELSE LEN(@s1) END;
;WITH
nums(n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM [master].dbo.spt_values Tally),
matrix AS (SELECT SUBSTRING(@s1,n,1) s1, SUBSTRING(@s2,n,1) s2 FROM nums WHERE n<=@minlen)
SELECT @Ld+=COUNT(*) FROM matrix WHERE s1<>s2;
LD:
SELECT @Ld AS LD;
Alan, I had never heard of this either. If the Wiki description Dwain posted is to be trusted, I think you might want to re-visit this. For example, for the two strings you provided, 'diner' and 'dinerr', the output is 1. This makes sense, because all you have to do is delete the second 'r' in string 2 (or the first 'r' for that matter), and you end up with the same two strings. Next, let's say I change the 'd' in string 1 to 'a'. The code returns 2, which I think is correct, since all I have to do is substitute 'a' in string 1 to 'd', then delete one of the 'r' in string 2, and I have two of the same string. Now, consider this: let's say I insert the 'a' in string 1, but leave the rest intact, leaving me with 'adiner' for string 1, and I leave string 2 as 'dinerr'. The code returns a value of 5. This does not make sense, again, if Wiki is to be trusted, as it states the allowable actions are single character insertions, deletions, and substitutions. So, in order to make 'adiner' = 'dinerr', I only need two actions: delete the 'a' in 'adiner', and delete one of the 'r' in 'dinerr'. Does that make sense, or am I missing something? (the latter is entirely possible.) (maybe even likely).
Sorry for checking back in late...
Levenshtrein was (is) new to me too. I did a little research and, yes, you are correct - I did not get it correct. Not only that, I was not able to get it correct (without a loop) after a lot of effort. That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.
Nonetheless - I was successful at creating a purely set-based version of what I thought the would produce the Levenshtein Distance; and did so using a tally table :). The goal was to improve my tally table skills more than anything.
If you feed my function 2 strings of equally length it will provide you with the Hamming Distance (not as big of a deal).
-- Itzik Ben-Gan 2001
January 14, 2013 at 9:36 am
Jeff Moden (1/10/2013)
Alan.B (1/10/2013)
This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).
Heh... if that's how you warm up for the day, then you've got me beat by a mile.
Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.
-- Itzik Ben-Gan 2001
January 14, 2013 at 9:55 am
Alan.B
That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.
I kind of thought it would be complicated. Think about this. My last semester in grad school I took a Python class just for fun, and all the work was done on unix machines. The professor had this text editor that would open two files, and display the number of differences between the two files. You could then type some commands, and it would highlight the differences. Pretty nifty for comparing script files. I never really thought about how it worked until I saw your post. Maybe that is like my discovering that the sun comes up in the east, but it was new to me. I'm eager to see what you come up with.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
January 14, 2013 at 6:20 pm
Alan.B (1/14/2013)
That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.
Good grief! Now I suppose I'm going to need to put up or shut up.
Unfortunately, for the last couple of days I've been plagued by problems on my laptop (unexpected shutdowns that occur without warning). So I haven't been able to get much of anything done.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 14, 2013 at 8:11 pm
dwain.c (1/14/2013)
Good grief! Now I suppose I'm going to need to put up or shut up.
Heh... glad it's not me this time. π
--Jeff Moden
Change is inevitable... Change for the better is not.
January 14, 2013 at 8:33 pm
Jeff Moden (1/14/2013)
dwain.c (1/14/2013)
Good grief! Now I suppose I'm going to need to put up or shut up.Heh... glad it's not me this time. π
Maybe I don't need to because it's already been done.
http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 15, 2013 at 6:29 am
Alan.B (1/14/2013)
Jeff Moden (1/10/2013)
Alan.B (1/10/2013)
This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).
Heh... if that's how you warm up for the day, then you've got me beat by a mile.
Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.
Not yet - at least, not in half an hour over lunch. However, I don't think this is far off and is easily modified:
DECLARE @Reference VARCHAR(100) = 'maltesers',
@Target VARCHAR(100) = 'maltster',
@WordLength INT
--------------------------------------------------------------------------
-- demonstration version - see how it works
--------------------------------------------------------------------------
SELECT @WordLength = MAX(WordLength) FROM (SELECT WordLength = DATALENGTH(@Reference) UNION ALL SELECT DATALENGTH(@Target)) d
SET @Reference = LEFT(@Reference + REPLICATE('_',@WordLength),@WordLength)
SET @Target = LEFT(@Target + REPLICATE('_',@WordLength),@WordLength)
;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),
Calculator1 AS (
SELECT
t.row, tLetter,
r.col, rLetter,
x.result,
best_result = MAX(x.result) OVER(PARTITION BY r.Col)
FROM ( -- matrix columns from reference word
SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1)
FROM Tally
) r
CROSS JOIN ( -- matrix rows from target word
SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1)
FROM Tally
) t
CROSS APPLY (
SELECT result = (CASE
WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal
WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/
(LEN(@Reference+@Target)/2) -- movement
WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution
WHEN r.col = t.[row] THEN 0 -- missing (end of word)
ELSE 0 END)
) x
)
SELECT * FROM Calculator1 ORDER BY col,row
--------------------------------------------------------------
-- working version
--------------------------------------------------------------
;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),
Calculator1 AS (
SELECT
r.col, rLetter,
best_result = MAX(x.result) OVER(PARTITION BY r.Col)
FROM (SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1) FROM Tally) r
CROSS JOIN (SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1) FROM Tally) t
CROSS APPLY (
SELECT result = (CASE
WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal
WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/
(LEN(@Reference+@Target)/2) -- movement
WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution
WHEN r.col = t.[row] THEN 0 -- missing (end of word)
ELSE 0 END)
) x
WHERE x.result > 0
)
SELECT LikenessRatio = SUM(Score)/LEN(@Reference)
FROM (
SELECT Score = MAX(best_result)
FROM Calculator1
GROUP BY col, rLetter
) d
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
January 15, 2013 at 11:52 am
ChrisM@Work (1/15/2013)
Alan.B (1/14/2013)
Jeff Moden (1/10/2013)
Alan.B (1/10/2013)
This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).
Heh... if that's how you warm up for the day, then you've got me beat by a mile.
Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.
Not yet - at least, not in half an hour over lunch. However, I don't think this is far off and is easily modified:
DECLARE @Reference VARCHAR(100) = 'maltesers',
@Target VARCHAR(100) = 'maltster',
@WordLength INT
--------------------------------------------------------------------------
-- demonstration version - see how it works
--------------------------------------------------------------------------
SELECT @WordLength = MAX(WordLength) FROM (SELECT WordLength = DATALENGTH(@Reference) UNION ALL SELECT DATALENGTH(@Target)) d
SET @Reference = LEFT(@Reference + REPLICATE('_',@WordLength),@WordLength)
SET @Target = LEFT(@Target + REPLICATE('_',@WordLength),@WordLength)
;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),
Calculator1 AS (
SELECT
t.row, tLetter,
r.col, rLetter,
x.result,
best_result = MAX(x.result) OVER(PARTITION BY r.Col)
FROM ( -- matrix columns from reference word
SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1)
FROM Tally
) r
CROSS JOIN ( -- matrix rows from target word
SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1)
FROM Tally
) t
CROSS APPLY (
SELECT result = (CASE
WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal
WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/
(LEN(@Reference+@Target)/2) -- movement
WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution
WHEN r.col = t.[row] THEN 0 -- missing (end of word)
ELSE 0 END)
) x
)
SELECT * FROM Calculator1 ORDER BY col,row
--------------------------------------------------------------
-- working version
--------------------------------------------------------------
;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),
Calculator1 AS (
SELECT
r.col, rLetter,
best_result = MAX(x.result) OVER(PARTITION BY r.Col)
FROM (SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1) FROM Tally) r
CROSS JOIN (SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1) FROM Tally) t
CROSS APPLY (
SELECT result = (CASE
WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal
WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/
(LEN(@Reference+@Target)/2) -- movement
WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution
WHEN r.col = t.[row] THEN 0 -- missing (end of word)
ELSE 0 END)
) x
WHERE x.result > 0
)
SELECT LikenessRatio = SUM(Score)/LEN(@Reference)
FROM (
SELECT Score = MAX(best_result)
FROM Calculator1
GROUP BY col, rLetter
) d
Hey Chris and Alan,
I haven't fully dissected Chris's code yet, but how does the "Likeness Ratio" produced by this code compare to the Levenshtein Distance?
I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.
Jason
Jason Wolfkill
January 15, 2013 at 12:02 pm
wolfkillj
it's been about six weeks since my "real" work has allowed me any time to tinker with it.
Don't you just hate when that happens? I think I'm going for a brute force, no holds barred approach.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
January 15, 2013 at 5:34 pm
wolfkillj (1/15/2013)
I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.Jason
Jason - DLD is actually the approach applied in the article I linked to in my prior post. Forgot to mention it. You may want to take a look because the author spent a lot of time trying to optimize his solution and it might save you a bit of effort.
Don't mean for it to take away the challenge though. π
dwain.c (1/14/2013)
Jeff Moden (1/14/2013)
dwain.c (1/14/2013)
Good grief! Now I suppose I'm going to need to put up or shut up.Heh... glad it's not me this time. π
Maybe I don't need to because it's already been done.
http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 15, 2013 at 5:57 pm
dwain.c (1/15/2013)
wolfkillj (1/15/2013)
I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.Jason
Jason - DLD is actually the approach applied in the article I linked to in my prior post. Forgot to mention it. You may want to take a look because the author spent a lot of time trying to optimize his solution and it might save you a bit of effort.
Don't mean for it to take away the challenge though. π
dwain.c (1/14/2013)
Jeff Moden (1/14/2013)
dwain.c (1/14/2013)
Good grief! Now I suppose I'm going to need to put up or shut up.Heh... glad it's not me this time. π
Maybe I don't need to because it's already been done.
http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.
Actually, that article was my starting point in challenging myself to come up with a true set-based DLD solution. The code posted there is functional but requires a lot of control-of-flow language. I hope to come up with a solution that can be used as an inline table-valued function. I'd say I've got it about 90% solved, but there are still some inconsistencies to resolve.
Jason Wolfkill
Viewing 15 posts - 16 through 30 (of 31 total)
You must be logged in to reply to this topic. Login to reply