Better Way to Perform this Query

  • dwain.c

    I for one have never heard of this famous "Levenshtein Edit Distance" problem, but you can rest assured that now I'm going to have to take a look at it as well!

    I guess you're really busy Dwain (is that the Jeopardy theme in the background?), since you've not come back with anything yet.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Alan.B

    I came up with this:

    -- strings to compare

    DECLARE@s1 varchar(8000)='diner',

    @s2 varchar(8000)='dinerr';

    DECLARE @Ld int=ABS(LEN(@s1)-LEN(@s2));

    IF ((@s1=@s2) OR ((ISNULL(LEN(@s1)*LEN(@s2),0)=0))) BEGIN GOTO LD END;

    DECLARE@minlen int=CASE WHEN LEN(@s1)>LEN(@s2) THEN LEN(@s2) ELSE LEN(@s1) END;

    ;WITH

    nums(n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM [master].dbo.spt_values Tally),

    matrix AS (SELECT SUBSTRING(@s1,n,1) s1, SUBSTRING(@s2,n,1) s2 FROM nums WHERE n<=@minlen)

    SELECT @Ld+=COUNT(*) FROM matrix WHERE s1<>s2;

    LD:

    SELECT @Ld AS LD;

    Alan, I had never heard of this either. If the Wiki description Dwain posted is to be trusted, I think you might want to re-visit this. For example, for the two strings you provided, 'diner' and 'dinerr', the output is 1. This makes sense, because all you have to do is delete the second 'r' in string 2 (or the first 'r' for that matter), and you end up with the same two strings. Next, let's say I change the 'd' in string 1 to 'a'. The code returns 2, which I think is correct, since all I have to do is substitute 'a' in string 1 to 'd', then delete one of the 'r' in string 2, and I have two of the same string. Now, consider this: let's say I insert the 'a' in string 1, but leave the rest intact, leaving me with 'adiner' for string 1, and I leave string 2 as 'dinerr'. The code returns a value of 5. This does not make sense, again, if Wiki is to be trusted, as it states the allowable actions are single character insertions, deletions, and substitutions. So, in order to make 'adiner' = 'dinerr', I only need two actions: delete the 'a' in 'adiner', and delete one of the 'r' in 'dinerr'. Does that make sense, or am I missing something? (the latter is entirely possible.) (maybe even likely).

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg Snidow (1/11/2013)


    dwain.c

    I for one have never heard of this famous "Levenshtein Edit Distance" problem, but you can rest assured that now I'm going to have to take a look at it as well!

    I guess you're really busy Dwain (is that the Jeopardy theme in the background?), since you've not come back with anything yet.

    You betcha! That crazy Vertex Covers puzzle has me losing sleep! :w00t:

    Looks like you may have beaten me to it.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c

    Looks like you may have beaten me to it.

    Oh, don't worry about that, Dwain. Admittedly, I did look at it thinking it was only a minor tweaking of Alan's code, but I think it may be more complicated than that when you really ponder on it.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg Snidow (1/11/2013)


    Alan.B

    I came up with this:

    -- strings to compare

    DECLARE@s1 varchar(8000)='diner',

    @s2 varchar(8000)='dinerr';

    DECLARE @Ld int=ABS(LEN(@s1)-LEN(@s2));

    IF ((@s1=@s2) OR ((ISNULL(LEN(@s1)*LEN(@s2),0)=0))) BEGIN GOTO LD END;

    DECLARE@minlen int=CASE WHEN LEN(@s1)>LEN(@s2) THEN LEN(@s2) ELSE LEN(@s1) END;

    ;WITH

    nums(n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM [master].dbo.spt_values Tally),

    matrix AS (SELECT SUBSTRING(@s1,n,1) s1, SUBSTRING(@s2,n,1) s2 FROM nums WHERE n<=@minlen)

    SELECT @Ld+=COUNT(*) FROM matrix WHERE s1<>s2;

    LD:

    SELECT @Ld AS LD;

    Alan, I had never heard of this either. If the Wiki description Dwain posted is to be trusted, I think you might want to re-visit this. For example, for the two strings you provided, 'diner' and 'dinerr', the output is 1. This makes sense, because all you have to do is delete the second 'r' in string 2 (or the first 'r' for that matter), and you end up with the same two strings. Next, let's say I change the 'd' in string 1 to 'a'. The code returns 2, which I think is correct, since all I have to do is substitute 'a' in string 1 to 'd', then delete one of the 'r' in string 2, and I have two of the same string. Now, consider this: let's say I insert the 'a' in string 1, but leave the rest intact, leaving me with 'adiner' for string 1, and I leave string 2 as 'dinerr'. The code returns a value of 5. This does not make sense, again, if Wiki is to be trusted, as it states the allowable actions are single character insertions, deletions, and substitutions. So, in order to make 'adiner' = 'dinerr', I only need two actions: delete the 'a' in 'adiner', and delete one of the 'r' in 'dinerr'. Does that make sense, or am I missing something? (the latter is entirely possible.) (maybe even likely).

    Sorry for checking back in late...

    Levenshtrein was (is) new to me too. I did a little research and, yes, you are correct - I did not get it correct. Not only that, I was not able to get it correct (without a loop) after a lot of effort. That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.

    Nonetheless - I was successful at creating a purely set-based version of what I thought the would produce the Levenshtein Distance; and did so using a tally table :). The goal was to improve my tally table skills more than anything.

    If you feed my function 2 strings of equally length it will provide you with the Hamming Distance (not as big of a deal).

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jeff Moden (1/10/2013)


    Alan.B (1/10/2013)


    This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).

    Heh... if that's how you warm up for the day, then you've got me beat by a mile.

    Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B

    That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.

    I kind of thought it would be complicated. Think about this. My last semester in grad school I took a Python class just for fun, and all the work was done on unix machines. The professor had this text editor that would open two files, and display the number of differences between the two files. You could then type some commands, and it would highlight the differences. Pretty nifty for comparing script files. I never really thought about how it worked until I saw your post. Maybe that is like my discovering that the sun comes up in the east, but it was new to me. I'm eager to see what you come up with.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Alan.B (1/14/2013)


    That Levenshtein business is a little more complicated than I thought. I am going to keep at it and I'm dying to see what Dwain puts together.

    Good grief! Now I suppose I'm going to need to put up or shut up.

    Unfortunately, for the last couple of days I've been plagued by problems on my laptop (unexpected shutdowns that occur without warning). So I haven't been able to get much of anything done.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (1/14/2013)


    Good grief! Now I suppose I'm going to need to put up or shut up.

    Heh... glad it's not me this time. πŸ˜›

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/14/2013)


    dwain.c (1/14/2013)


    Good grief! Now I suppose I'm going to need to put up or shut up.

    Heh... glad it's not me this time. πŸ˜›

    Maybe I don't need to because it's already been done.

    http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Alan.B (1/14/2013)


    Jeff Moden (1/10/2013)


    Alan.B (1/10/2013)


    This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).

    Heh... if that's how you warm up for the day, then you've got me beat by a mile.

    Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.

    Not yet - at least, not in half an hour over lunch. However, I don't think this is far off and is easily modified:

    DECLARE @Reference VARCHAR(100) = 'maltesers',

    @Target VARCHAR(100) = 'maltster',

    @WordLength INT

    --------------------------------------------------------------------------

    -- demonstration version - see how it works

    --------------------------------------------------------------------------

    SELECT @WordLength = MAX(WordLength) FROM (SELECT WordLength = DATALENGTH(@Reference) UNION ALL SELECT DATALENGTH(@Target)) d

    SET @Reference = LEFT(@Reference + REPLICATE('_',@WordLength),@WordLength)

    SET @Target = LEFT(@Target + REPLICATE('_',@WordLength),@WordLength)

    ;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),

    Calculator1 AS (

    SELECT

    t.row, tLetter,

    r.col, rLetter,

    x.result,

    best_result = MAX(x.result) OVER(PARTITION BY r.Col)

    FROM ( -- matrix columns from reference word

    SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1)

    FROM Tally

    ) r

    CROSS JOIN ( -- matrix rows from target word

    SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1)

    FROM Tally

    ) t

    CROSS APPLY (

    SELECT result = (CASE

    WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal

    WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/

    (LEN(@Reference+@Target)/2) -- movement

    WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution

    WHEN r.col = t.[row] THEN 0 -- missing (end of word)

    ELSE 0 END)

    ) x

    )

    SELECT * FROM Calculator1 ORDER BY col,row

    --------------------------------------------------------------

    -- working version

    --------------------------------------------------------------

    ;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),

    Calculator1 AS (

    SELECT

    r.col, rLetter,

    best_result = MAX(x.result) OVER(PARTITION BY r.Col)

    FROM (SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1) FROM Tally) r

    CROSS JOIN (SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1) FROM Tally) t

    CROSS APPLY (

    SELECT result = (CASE

    WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal

    WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/

    (LEN(@Reference+@Target)/2) -- movement

    WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution

    WHEN r.col = t.[row] THEN 0 -- missing (end of word)

    ELSE 0 END)

    ) x

    WHERE x.result > 0

    )

    SELECT LikenessRatio = SUM(Score)/LEN(@Reference)

    FROM (

    SELECT Score = MAX(best_result)

    FROM Calculator1

    GROUP BY col, rLetter

    ) d

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (1/15/2013)


    Alan.B (1/14/2013)


    Jeff Moden (1/10/2013)


    Alan.B (1/10/2013)


    This morning before work for example, after a lot of effort, I finally figured out how to get the Levenshtein Edit Distance between 2 strings without a loop (just for fun because it's something I've never seen done).

    Heh... if that's how you warm up for the day, then you've got me beat by a mile.

    Perhaps, if I got it right ;). I did not get it correct but I certainly found a more interesting SQL excercize than Fizzbuzz. I'm going to keep trying to come up with a 100% set-based Levenshtein unless someone beats me to it.

    Not yet - at least, not in half an hour over lunch. However, I don't think this is far off and is easily modified:

    DECLARE @Reference VARCHAR(100) = 'maltesers',

    @Target VARCHAR(100) = 'maltster',

    @WordLength INT

    --------------------------------------------------------------------------

    -- demonstration version - see how it works

    --------------------------------------------------------------------------

    SELECT @WordLength = MAX(WordLength) FROM (SELECT WordLength = DATALENGTH(@Reference) UNION ALL SELECT DATALENGTH(@Target)) d

    SET @Reference = LEFT(@Reference + REPLICATE('_',@WordLength),@WordLength)

    SET @Target = LEFT(@Target + REPLICATE('_',@WordLength),@WordLength)

    ;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),

    Calculator1 AS (

    SELECT

    t.row, tLetter,

    r.col, rLetter,

    x.result,

    best_result = MAX(x.result) OVER(PARTITION BY r.Col)

    FROM ( -- matrix columns from reference word

    SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1)

    FROM Tally

    ) r

    CROSS JOIN ( -- matrix rows from target word

    SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1)

    FROM Tally

    ) t

    CROSS APPLY (

    SELECT result = (CASE

    WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal

    WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/

    (LEN(@Reference+@Target)/2) -- movement

    WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution

    WHEN r.col = t.[row] THEN 0 -- missing (end of word)

    ELSE 0 END)

    ) x

    )

    SELECT * FROM Calculator1 ORDER BY col,row

    --------------------------------------------------------------

    -- working version

    --------------------------------------------------------------

    ;WITH Tally AS (SELECT TOP(@WordLength) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a (n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b (n)),

    Calculator1 AS (

    SELECT

    r.col, rLetter,

    best_result = MAX(x.result) OVER(PARTITION BY r.Col)

    FROM (SELECT [col] = n, rLetter = SUBSTRING(@Reference, Tally.n, 1) FROM Tally) r

    CROSS JOIN (SELECT [row] = n, tLetter = SUBSTRING(@Target, Tally.n, 1) FROM Tally) t

    CROSS APPLY (

    SELECT result = (CASE

    WHEN r.col = t.[row] AND rLetter = tLetter THEN 1 -- equal

    WHEN rLetter = tLetter THEN (LEN(@Reference)-ABS(r.col-t.[row]))*1.00/

    (LEN(@Reference+@Target)/2) -- movement

    WHEN r.col = t.[row] AND '_' NOT IN (rLetter,tLetter) THEN 0.1 -- substitution

    WHEN r.col = t.[row] THEN 0 -- missing (end of word)

    ELSE 0 END)

    ) x

    WHERE x.result > 0

    )

    SELECT LikenessRatio = SUM(Score)/LEN(@Reference)

    FROM (

    SELECT Score = MAX(best_result)

    FROM Calculator1

    GROUP BY col, rLetter

    ) d

    Hey Chris and Alan,

    I haven't fully dissected Chris's code yet, but how does the "Likeness Ratio" produced by this code compare to the Levenshtein Distance?

    I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.

    Jason

    Jason Wolfkill

  • wolfkillj

    it's been about six weeks since my "real" work has allowed me any time to tinker with it.

    Don't you just hate when that happens? I think I'm going for a brute force, no holds barred approach.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • wolfkillj (1/15/2013)


    I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.

    Jason

    Jason - DLD is actually the approach applied in the article I linked to in my prior post. Forgot to mention it. You may want to take a look because the author spent a lot of time trying to optimize his solution and it might save you a bit of effort.

    Don't mean for it to take away the challenge though. πŸ˜€

    dwain.c (1/14/2013)


    Jeff Moden (1/14/2013)


    dwain.c (1/14/2013)


    Good grief! Now I suppose I'm going to need to put up or shut up.

    Heh... glad it's not me this time. πŸ˜›

    Maybe I don't need to because it's already been done.

    http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (1/15/2013)


    wolfkillj (1/15/2013)


    I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.

    Jason

    Jason - DLD is actually the approach applied in the article I linked to in my prior post. Forgot to mention it. You may want to take a look because the author spent a lot of time trying to optimize his solution and it might save you a bit of effort.

    Don't mean for it to take away the challenge though. πŸ˜€

    dwain.c (1/14/2013)


    Jeff Moden (1/14/2013)


    dwain.c (1/14/2013)


    Good grief! Now I suppose I'm going to need to put up or shut up.

    Heh... glad it's not me this time. πŸ˜›

    Maybe I don't need to because it's already been done.

    http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.

    Actually, that article was my starting point in challenging myself to come up with a true set-based DLD solution. The code posted there is functional but requires a lot of control-of-flow language. I hope to come up with a solution that can be used as an inline table-valued function. I'd say I've got it about 90% solved, but there are still some inconsistencies to resolve.

    Jason Wolfkill

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply