Levenshtein UDF performance 'tuning' advise needed

  • mike--p (11/7/2014)


    ...

    Yes my join was correct, here the results from your query. As you can see it scans through all the columns (plus one too much highligthed in red). I don't see why I would use this over the query I posted? Thanks 😛

    ...

    Your join cannot have been correct to display the results shown in your post 😛

    The query you posted

    SELECT T1.*, T3.* FROM table_1 AS T1

    LEFT JOIN table_2 AS T2 ON (T1.[column_T1] = T2.[column_T2)

    LEFT JOIN table_2 AS T2_2 ON (Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage

    reads table 1 once and table 2 twice as a crossjoin (6000x6000 = 36000000 rows for your full set), calculating each match twice. My query reads each table only once and performs the match only once, using a triangular join.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • To get all the results with your queries I should do both T1.rij <= T2.rij and T2.rij <= T1.rij. If I only do the first I won't get 'AQQ313' as result, due to its higher row number. If I do the T2.rij <= T1.rij than I indeed get 'AQQ313' as result :-).

    I see that my query is reading table2 twice, but how is it possible my query runs twice as fast as yours?

    Besides that I get too much results back from your query because it is matching with the levenshtein function even if the partnumbers are exactly the same between the two files.

    Again what I'm trying to do is only match(with the use of levenshtein) on the description whenever NO match was found on the part numbers. That is why I thought I had to use two JOINS. The first JOIN searches for exact matches on part numbers. If no result is found you get a NULL returned. The second JOIN is then executed whenever a null is found in the previous join.

    Again if I have these two tables (highlighted in red should be the match):

    To filter out the exact matches I'll first do a check on the partnumbers (note: the user should chose this column. The more 100% matches between the columns, the less levenshtein calculations will be done).

    This filtering I do with a single LEFT JOIN. The result set of this single LEFT join looks like this, as expected only the object with partnumber 105 isn't having an exact match, thus the right side is having a NULL as value:

    Then the second LEFT JOIN is again using the second table (2nd time scanning this table?), but now the condition depends on the similarity between the description strings. The second JOIN is only executed whenever the previous JOIN has a value of NULL, thus is only executed on partnumber 105. The result looks like the following:

    And the query once more :-D:

    SELECT T1.*, T3.*, master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) AS percentage FROM Vergelijkingstabel_1 AS T1

    LEFT join Vergelijkingstabel_2 AS T2 ON (T1.Partnumber = T2.Partnumber)

    LEFT join Vergelijkingstabel_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)

    WHERE T2.Partnumber IS NULL;

    If I use your query I get way too much results as I'm not filtering the partnumbers and directly search on the descriptions, the query: SELECT *, master.dbo.fn_Levenshtein(T1.Description, T2.Description, 80) AS [Percentage(%)]

    FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Description) FROM Vergelijkingstabel_1) AS T1

    INNER JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Description) FROM Vergelijkingstabel_2) AS T2

    ON T1.rij <= T2.rij WHERE master.dbo.fn_Levenshtein(T1.Description, T2.Description, 80) >= 80;

    And the result set:

    Maybe I'm just not understanding you correctly, haha! :blush:

  • -- Yes, you are still missing the point I'm trying to make.

    -- Also, your strange query is equivalent to this

    SELECT

    T1.*,

    T3.*,

    master.dbo.fn_Levenshtein(T1.Omschrijving, T3.Omschrijving, 80) AS percentage

    FROM Vergelijkingstabel_1 AS T1

    LEFT JOIN Vergelijkingstabel_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)

    WHERE NOT EXISTS (SELECT 1 FROM Vergelijkingstabel_2 AS T2 WHERE T2.Bestelnummer = T1.Bestelnummer)

    -- I think you want to omit comparisons where Bestelnummer matches between the two tables. Instead, you're omitting rows

    -- from Vergelijkingstabel_2 if the Bestelnummer exists anywhere in Vergelijkingstabel_1.

    -- Are you sure this is what you intend?

    -- Try this (if you can provide sample tables I could try it for you)

    -- Measure the highest value of Bestelnummer in both tables.

    -- If the highest value is in T3, use this,

    -- where x is the name of the column returned by master.dbo.fn_Levenshtein

    SELECT

    T1.*,

    T3.*,

    ou.x AS percentage

    FROM #Vergelijkingstabel_1 AS T1

    OUTER APPLY (

    SELECT T3.*, x

    FROM #Vergelijkingstabel_2 AS T3

    CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)

    WHERE T3.Bestelnummer > T1.Bestelnummer AND x >= 80

    ) ou

    -- If the highest value is in T1, use this:

    SELECT

    T1.*,

    T3.*,

    ou.x AS percentage

    FROM #Vergelijkingstabel_1 AS T1

    OUTER APPLY (

    SELECT T3.*, x

    FROM #Vergelijkingstabel_2 AS T3

    CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)

    WHERE T3.Bestelnummer < T1.Bestelnummer AND x >= 80

    ) ou

    -- An example of sample tables

    DROP TABLE #Vergelijkingstabel_1

    CREATE TABLE #Vergelijkingstabel_1 (Bestelnummer INT, Omschrijving VARCHAR(20))

    INSERT INTO #Vergelijkingstabel_1 (Bestelnummer, Omschrijving)

    VALUES (101,'ABN101'),(102,'ZBN'),(103,'KDN154'),(104,'ADN235'),(105,'QND999'),(110,'QQQ313'),(112,'ABL513')

    -- 7 rows

    DROP TABLE #Vergelijkingstabel_2

    CREATE TABLE #Vergelijkingstabel_2 (Bestelnummer INT, Omschrijving VARCHAR(20))

    INSERT INTO #Vergelijkingstabel_2 (Bestelnummer, Omschrijving)

    VALUES (100,'QQQ312'),(101,'ABN101'),(102,'ZBN103'),(103,'KDN154'),(104,'ADN235'),(106,'QND999'),

    (365,'AAA555'),(34343,'QND999'),(4444,'QND999'),(111,'ZQQ319'),(11233,'AQQ313')

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Sorry for the late reply!! Tommorow I can give the queries you provided a try at our database.

    What I want to achieve:

    - First the user has to choose two columns that have the most 100% matches (for instance the part numbers).

    - Then the user has to choose two columns (for instance the descriptions) that will be matched by the levenshtein algorithm.

    - So ONLY the records that has no 100% match to the second table (no part number was found in the second table) should be matched (on the description columns) with the use of the levenshtein algorithm.

    Have a nice weekend :-D;-)

  • ChrisM@Work (11/7/2014)


    -- Yes, you are still missing the point I'm trying to make.

    -- Try this (if you can provide sample tables I could try it for you)

    -- Measure the highest value of Bestelnummer in both tables.

    -- If the highest value is in T3, use this,

    -- where x is the name of the column returned by master.dbo.fn_Levenshtein

    SELECT

    T1.*,

    T3.*,

    ou.x AS percentage

    FROM #Vergelijkingstabel_1 AS T1

    OUTER APPLY (

    SELECT T3.*, x

    FROM #Vergelijkingstabel_2 AS T3

    CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)

    WHERE T3.Bestelnummer > T1.Bestelnummer AND x >= 80

    ) ou

    -- If the highest value is in T1, use this:

    SELECT

    T1.*,

    T3.*,

    ou.x AS percentage

    FROM #Vergelijkingstabel_1 AS T1

    OUTER APPLY (

    SELECT T3.*, x

    FROM #Vergelijkingstabel_2 AS T3

    CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)

    WHERE T3.Bestelnummer < T1.Bestelnummer AND x >= 80

    ) ou

    The first query you posted is indeed working the same as mine. However yours makes more sense than the two joins then I came up with, thanks.

    The two queries in the quote I can't get them to execute. It's saying that the function is an invalid object name.

    Shouldn't the cross apply be referenced 'AS x'? Also shouldn't there be a SELECT in the cross apply (the same you did in the outer apply?).

  • Hi Mike

    Working without sample data always comes with the risk of syntax errors - can you prepare some please?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • -- sample table 1

    DROP TABLE #Table_1

    CREATE TABLE #Table_1 (Partnumber INT, Description VARCHAR(20))

    INSERT INTO #Table_1 (Partnumber, Description)

    VALUES (101,'QQQ313'),(102,'DFE911'),(103,'QKL721'),(104,'QQL555'),(105,'DLL323'), (121, 'QQQ313')

    -- sample table 2

    DROP TABLE #Table_2

    CREATE TABLE #Table_2 (Partnumber INT, Description VARCHAR(20))

    INSERT INTO #Table_2 (Partnumber, Description)

    VALUES (101,'QQQ313'),(102,'DFE900'),(103,'QKL721'),(104,'QLL555'),(115,'DLL323'), (125, 'QQQ313'), (189, 'QQQ312')

    -- 1. Search for exact matches on the partnumbers

    -- 2. Search for approximate match with the use of Levenshtein in the descriptions. ONLY search the non mathced records.

    -- Expected result sets:

    -- Exact result set (Partnumbers):

    -- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 |

    -- | 101 | QQQ313 | 101 | QQQ313 |

    -- | 102 | DFE911 | 102 | DFE900 |

    -- | 103 | QKL721 | 103 | QKL721 |

    -- | 104 | QQL555 | 104 | QLL555 |

    -- Approximate result set (similarity has to be higher than 70% on the descriptions):

    -- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 | Percentage |

    -- | 105 | DLL323 | 115 | DLL323 | 100 |

    -- | 121 | QQQ313 | 125 | QQQ313 | 100 |

    -- | 121 | QQQ313 | 189 | QQQ312 | 83,3 |

    -- NOTE that the levenshtein should not be executed on part number 101, as this is already matched on the first result set!

    Here you go 🙂

  • Are you expecting two result sets? If so, here's the second:

    -- Using UDF

    SELECT

    t1.*,

    t3.*

    FROM #Vergelijkingstabel_1 t1

    OUTER APPLY (

    SELECT t2.*, x.percentage

    FROM #Vergelijkingstabel_2 t2

    CROSS APPLY (SELECT percentage = dbo.fn_Levenshtein (t1.Omschrijving, t2.Omschrijving, 80)) x

    WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 AS tx WHERE tx.Bestelnummer = t2.Bestelnummer)

    AND x.percentage>= 80

    ) t3

    -- Using iTVF

    SELECT

    t1.*,

    t3.*

    FROM #Vergelijkingstabel_1 t1

    OUTER APPLY (

    SELECT t2.*, percentage = x.[Score %]

    FROM #Vergelijkingstabel_2 t2

    CROSS APPLY dbo.IF_Levenshtein02 (t1.Omschrijving, t2.Omschrijving) x

    WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 AS tx WHERE tx.Bestelnummer = t2.Bestelnummer)

    AND x.[Score %]>= 80

    ) t3

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • If i do this I get all the rows returned. What I want is two seperate tables. To get the first result set (exact matches on the partnumbers) I do the following:

    SELECT T1.*, T2.* FROM Table_1 AS T1

    INNER JOIN Table_2 AS T2 ON(T1.Partnumber = T2.Partnumber);

    To get the approximate result set I use the query from a couple posts back:

    SELECT T1.*, T3.*, master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) AS percentage

    FROM Table_1 AS T1

    INNER JOIN Table_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)

    WHERE NOT EXISTS (SELECT T2.Partnumber FROM Table_2 AS T2 WHERE T2.Partnumber= T1.Partnumber);

    This is working, except that it is matching on records that are already matched exactly in the first result set from the query above. Like I said in my previous post. In the example below, you see a result set that is exactly matched on partnumbers. The second result set are approximate matches on the descriptions. However we don't want to match descriptions on records that are already EXACTLY matched in the first query. So that means we don't want to match partnumber 105 from the first table against partnumber 101 from the second table even though they have the exact same description. With the query from above I do get a match between partnumber 105 from Table 1 and 101 from Table 2, this is exactly what I want to prevent. Because partnumber 101 from the second table already has a match with partnumber 101 from the first table. I'll hope you understand it 😛

    -- sample table 1

    DROP TABLE #Table_1

    CREATE TABLE #Table_1 (Partnumber INT, Description VARCHAR(20))

    INSERT INTO #Table_1 (Partnumber, Description)

    VALUES (101,'QQQ313'),(102,'DFE911'),(103,'QKL721'),(104,'QQL555'),(105,'DLL323'), (121, 'QQQ313')

    -- sample table 2

    DROP TABLE #Table_2

    CREATE TABLE #Table_2 (Partnumber INT, Description VARCHAR(20))

    INSERT INTO #Table_2 (Partnumber, Description)

    VALUES (101,'QQQ313'),(102,'DFE900'),(103,'QKL721'),(104,'QLL555'),(115,'DLL323'), (125, 'QQQ313'), (189, 'QQQ312')

    -- 1. Search for exact matches on the partnumbers

    -- 2. Search for approximate match with the use of Levenshtein in the descriptions. ONLY search the non mathced records.

    -- Expected result sets:

    -- Exact result set (Partnumbers):

    -- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 |

    -- | 101 | QQQ313 | 101 | QQQ313 |

    -- | 102 | DFE911 | 102 | DFE900 |

    -- | 103 | QKL721 | 103 | QKL721 |

    -- | 104 | QQL555 | 104 | QLL555 |

    -- Approximate result set (similarity has to be higher than 70% on the descriptions):

    -- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 | Percentage |

    -- | 105 | DLL323 | 115 | DLL323 | 100 |

    -- | 121 | QQQ313 | 125 | QQQ313 | 100 |

    -- | 121 | QQQ313 | 189 | QQQ312 | 83,3 |

    -- NOTE that the levenshtein should not be executed on part number 101, as this is already matched on the first result set!

  • Okay, think I get it. Does this match your output requirement?

    SELECT

    t1.*,

    oa.*

    FROM #Vergelijkingstabel_1 t1

    OUTER APPLY (

    SELECT t3.*, x.[Score %]

    FROM #Vergelijkingstabel_2 t3

    CROSS APPLY dbo.IF_Levenshtein02 (t3.Omschrijving, t1.Omschrijving) x

    WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 m WHERE m.Bestelnummer = t3.Bestelnummer)

    AND x.[Score %]>= 80

    ) oa

    WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_2 m WHERE m.Bestelnummer = t1.Bestelnummer)

    UNION ALL

    SELECT t1.*, t2.*, NULL

    FROM #Vergelijkingstabel_1 t1

    INNER JOIN #Vergelijkingstabel_2 t2 ON t2.Bestelnummer = t1.Bestelnummer

    ORDER BY 1

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Yes this should be it, thanks :). I will be working from here:-D

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply