November 7, 2014 at 6:53 am
mike--p (11/7/2014)
...Yes my join was correct, here the results from your query. As you can see it scans through all the columns (plus one too much highligthed in red). I don't see why I would use this over the query I posted? Thanks 😛
...
Your join cannot have been correct to display the results shown in your post 😛
The query you posted
SELECT T1.*, T3.* FROM table_1 AS T1
LEFT JOIN table_2 AS T2 ON (T1.[column_T1] = T2.[column_T2)
LEFT JOIN table_2 AS T2_2 ON (Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage
reads table 1 once and table 2 twice as a crossjoin (6000x6000 = 36000000 rows for your full set), calculating each match twice. My query reads each table only once and performs the match only once, using a triangular join.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 7, 2014 at 8:15 am
To get all the results with your queries I should do both T1.rij <= T2.rij and T2.rij <= T1.rij. If I only do the first I won't get 'AQQ313' as result, due to its higher row number. If I do the T2.rij <= T1.rij than I indeed get 'AQQ313' as result :-).
I see that my query is reading table2 twice, but how is it possible my query runs twice as fast as yours?
Besides that I get too much results back from your query because it is matching with the levenshtein function even if the partnumbers are exactly the same between the two files.
Again what I'm trying to do is only match(with the use of levenshtein) on the description whenever NO match was found on the part numbers. That is why I thought I had to use two JOINS. The first JOIN searches for exact matches on part numbers. If no result is found you get a NULL returned. The second JOIN is then executed whenever a null is found in the previous join.
Again if I have these two tables (highlighted in red should be the match):
To filter out the exact matches I'll first do a check on the partnumbers (note: the user should chose this column. The more 100% matches between the columns, the less levenshtein calculations will be done).
This filtering I do with a single LEFT JOIN. The result set of this single LEFT join looks like this, as expected only the object with partnumber 105 isn't having an exact match, thus the right side is having a NULL as value:
Then the second LEFT JOIN is again using the second table (2nd time scanning this table?), but now the condition depends on the similarity between the description strings. The second JOIN is only executed whenever the previous JOIN has a value of NULL, thus is only executed on partnumber 105. The result looks like the following:
And the query once more :-D:
SELECT T1.*, T3.*, master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) AS percentage FROM Vergelijkingstabel_1 AS T1
LEFT join Vergelijkingstabel_2 AS T2 ON (T1.Partnumber = T2.Partnumber)
LEFT join Vergelijkingstabel_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)
WHERE T2.Partnumber IS NULL;
If I use your query I get way too much results as I'm not filtering the partnumbers and directly search on the descriptions, the query: SELECT *, master.dbo.fn_Levenshtein(T1.Description, T2.Description, 80) AS [Percentage(%)]
FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Description) FROM Vergelijkingstabel_1) AS T1
INNER JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Description) FROM Vergelijkingstabel_2) AS T2
ON T1.rij <= T2.rij WHERE master.dbo.fn_Levenshtein(T1.Description, T2.Description, 80) >= 80;
And the result set:
Maybe I'm just not understanding you correctly, haha! :blush:
November 7, 2014 at 9:33 am
-- Yes, you are still missing the point I'm trying to make.
-- Also, your strange query is equivalent to this
SELECT
T1.*,
T3.*,
master.dbo.fn_Levenshtein(T1.Omschrijving, T3.Omschrijving, 80) AS percentage
FROM Vergelijkingstabel_1 AS T1
LEFT JOIN Vergelijkingstabel_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)
WHERE NOT EXISTS (SELECT 1 FROM Vergelijkingstabel_2 AS T2 WHERE T2.Bestelnummer = T1.Bestelnummer)
-- I think you want to omit comparisons where Bestelnummer matches between the two tables. Instead, you're omitting rows
-- from Vergelijkingstabel_2 if the Bestelnummer exists anywhere in Vergelijkingstabel_1.
-- Are you sure this is what you intend?
-- Try this (if you can provide sample tables I could try it for you)
-- Measure the highest value of Bestelnummer in both tables.
-- If the highest value is in T3, use this,
-- where x is the name of the column returned by master.dbo.fn_Levenshtein
SELECT
T1.*,
T3.*,
ou.x AS percentage
FROM #Vergelijkingstabel_1 AS T1
OUTER APPLY (
SELECT T3.*, x
FROM #Vergelijkingstabel_2 AS T3
CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)
WHERE T3.Bestelnummer > T1.Bestelnummer AND x >= 80
) ou
-- If the highest value is in T1, use this:
SELECT
T1.*,
T3.*,
ou.x AS percentage
FROM #Vergelijkingstabel_1 AS T1
OUTER APPLY (
SELECT T3.*, x
FROM #Vergelijkingstabel_2 AS T3
CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)
WHERE T3.Bestelnummer < T1.Bestelnummer AND x >= 80
) ou
-- An example of sample tables
DROP TABLE #Vergelijkingstabel_1
CREATE TABLE #Vergelijkingstabel_1 (Bestelnummer INT, Omschrijving VARCHAR(20))
INSERT INTO #Vergelijkingstabel_1 (Bestelnummer, Omschrijving)
VALUES (101,'ABN101'),(102,'ZBN'),(103,'KDN154'),(104,'ADN235'),(105,'QND999'),(110,'QQQ313'),(112,'ABL513')
-- 7 rows
DROP TABLE #Vergelijkingstabel_2
CREATE TABLE #Vergelijkingstabel_2 (Bestelnummer INT, Omschrijving VARCHAR(20))
INSERT INTO #Vergelijkingstabel_2 (Bestelnummer, Omschrijving)
VALUES (100,'QQQ312'),(101,'ABN101'),(102,'ZBN103'),(103,'KDN154'),(104,'ADN235'),(106,'QND999'),
(365,'AAA555'),(34343,'QND999'),(4444,'QND999'),(111,'ZQQ319'),(11233,'AQQ313')
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 9, 2014 at 7:24 am
Hi Chris,
Sorry for the late reply!! Tommorow I can give the queries you provided a try at our database.
What I want to achieve:
- First the user has to choose two columns that have the most 100% matches (for instance the part numbers).
- Then the user has to choose two columns (for instance the descriptions) that will be matched by the levenshtein algorithm.
- So ONLY the records that has no 100% match to the second table (no part number was found in the second table) should be matched (on the description columns) with the use of the levenshtein algorithm.
Have a nice weekend :-D;-)
November 10, 2014 at 1:50 am
ChrisM@Work (11/7/2014)
-- Yes, you are still missing the point I'm trying to make.-- Try this (if you can provide sample tables I could try it for you)
-- Measure the highest value of Bestelnummer in both tables.
-- If the highest value is in T3, use this,
-- where x is the name of the column returned by master.dbo.fn_Levenshtein
SELECT
T1.*,
T3.*,
ou.x AS percentage
FROM #Vergelijkingstabel_1 AS T1
OUTER APPLY (
SELECT T3.*, x
FROM #Vergelijkingstabel_2 AS T3
CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)
WHERE T3.Bestelnummer > T1.Bestelnummer AND x >= 80
) ou
-- If the highest value is in T1, use this:
SELECT
T1.*,
T3.*,
ou.x AS percentage
FROM #Vergelijkingstabel_1 AS T1
OUTER APPLY (
SELECT T3.*, x
FROM #Vergelijkingstabel_2 AS T3
CROSS APPLY master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80)
WHERE T3.Bestelnummer < T1.Bestelnummer AND x >= 80
) ou
The first query you posted is indeed working the same as mine. However yours makes more sense than the two joins then I came up with, thanks.
The two queries in the quote I can't get them to execute. It's saying that the function is an invalid object name.
Shouldn't the cross apply be referenced 'AS x'? Also shouldn't there be a SELECT in the cross apply (the same you did in the outer apply?).
November 10, 2014 at 2:07 am
Hi Mike
Working without sample data always comes with the risk of syntax errors - can you prepare some please?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 10, 2014 at 2:53 am
-- sample table 1
DROP TABLE #Table_1
CREATE TABLE #Table_1 (Partnumber INT, Description VARCHAR(20))
INSERT INTO #Table_1 (Partnumber, Description)
VALUES (101,'QQQ313'),(102,'DFE911'),(103,'QKL721'),(104,'QQL555'),(105,'DLL323'), (121, 'QQQ313')
-- sample table 2
DROP TABLE #Table_2
CREATE TABLE #Table_2 (Partnumber INT, Description VARCHAR(20))
INSERT INTO #Table_2 (Partnumber, Description)
VALUES (101,'QQQ313'),(102,'DFE900'),(103,'QKL721'),(104,'QLL555'),(115,'DLL323'), (125, 'QQQ313'), (189, 'QQQ312')
-- 1. Search for exact matches on the partnumbers
-- 2. Search for approximate match with the use of Levenshtein in the descriptions. ONLY search the non mathced records.
-- Expected result sets:
-- Exact result set (Partnumbers):
-- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 |
-- | 101 | QQQ313 | 101 | QQQ313 |
-- | 102 | DFE911 | 102 | DFE900 |
-- | 103 | QKL721 | 103 | QKL721 |
-- | 104 | QQL555 | 104 | QLL555 |
-- Approximate result set (similarity has to be higher than 70% on the descriptions):
-- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 | Percentage |
-- | 105 | DLL323 | 115 | DLL323 | 100 |
-- | 121 | QQQ313 | 125 | QQQ313 | 100 |
-- | 121 | QQQ313 | 189 | QQQ312 | 83,3 |
-- NOTE that the levenshtein should not be executed on part number 101, as this is already matched on the first result set!
Here you go 🙂
November 10, 2014 at 3:12 am
Are you expecting two result sets? If so, here's the second:
-- Using UDF
SELECT
t1.*,
t3.*
FROM #Vergelijkingstabel_1 t1
OUTER APPLY (
SELECT t2.*, x.percentage
FROM #Vergelijkingstabel_2 t2
CROSS APPLY (SELECT percentage = dbo.fn_Levenshtein (t1.Omschrijving, t2.Omschrijving, 80)) x
WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 AS tx WHERE tx.Bestelnummer = t2.Bestelnummer)
AND x.percentage>= 80
) t3
-- Using iTVF
SELECT
t1.*,
t3.*
FROM #Vergelijkingstabel_1 t1
OUTER APPLY (
SELECT t2.*, percentage = x.[Score %]
FROM #Vergelijkingstabel_2 t2
CROSS APPLY dbo.IF_Levenshtein02 (t1.Omschrijving, t2.Omschrijving) x
WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 AS tx WHERE tx.Bestelnummer = t2.Bestelnummer)
AND x.[Score %]>= 80
) t3
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 10, 2014 at 3:53 am
If i do this I get all the rows returned. What I want is two seperate tables. To get the first result set (exact matches on the partnumbers) I do the following:
SELECT T1.*, T2.* FROM Table_1 AS T1
INNER JOIN Table_2 AS T2 ON(T1.Partnumber = T2.Partnumber);
To get the approximate result set I use the query from a couple posts back:
SELECT T1.*, T3.*, master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) AS percentage
FROM Table_1 AS T1
INNER JOIN Table_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) >= 80)
WHERE NOT EXISTS (SELECT T2.Partnumber FROM Table_2 AS T2 WHERE T2.Partnumber= T1.Partnumber);
This is working, except that it is matching on records that are already matched exactly in the first result set from the query above. Like I said in my previous post. In the example below, you see a result set that is exactly matched on partnumbers. The second result set are approximate matches on the descriptions. However we don't want to match descriptions on records that are already EXACTLY matched in the first query. So that means we don't want to match partnumber 105 from the first table against partnumber 101 from the second table even though they have the exact same description. With the query from above I do get a match between partnumber 105 from Table 1 and 101 from Table 2, this is exactly what I want to prevent. Because partnumber 101 from the second table already has a match with partnumber 101 from the first table. I'll hope you understand it 😛
-- sample table 1
DROP TABLE #Table_1
CREATE TABLE #Table_1 (Partnumber INT, Description VARCHAR(20))
INSERT INTO #Table_1 (Partnumber, Description)
VALUES (101,'QQQ313'),(102,'DFE911'),(103,'QKL721'),(104,'QQL555'),(105,'DLL323'), (121, 'QQQ313')
-- sample table 2
DROP TABLE #Table_2
CREATE TABLE #Table_2 (Partnumber INT, Description VARCHAR(20))
INSERT INTO #Table_2 (Partnumber, Description)
VALUES (101,'QQQ313'),(102,'DFE900'),(103,'QKL721'),(104,'QLL555'),(115,'DLL323'), (125, 'QQQ313'), (189, 'QQQ312')
-- 1. Search for exact matches on the partnumbers
-- 2. Search for approximate match with the use of Levenshtein in the descriptions. ONLY search the non mathced records.
-- Expected result sets:
-- Exact result set (Partnumbers):
-- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 |
-- | 101 | QQQ313 | 101 | QQQ313 |
-- | 102 | DFE911 | 102 | DFE900 |
-- | 103 | QKL721 | 103 | QKL721 |
-- | 104 | QQL555 | 104 | QLL555 |
-- Approximate result set (similarity has to be higher than 70% on the descriptions):
-- | Partnumber T1 | Description T1 | Partnumber T2 | Description T2 | Percentage |
-- | 105 | DLL323 | 115 | DLL323 | 100 |
-- | 121 | QQQ313 | 125 | QQQ313 | 100 |
-- | 121 | QQQ313 | 189 | QQQ312 | 83,3 |
-- NOTE that the levenshtein should not be executed on part number 101, as this is already matched on the first result set!
November 10, 2014 at 4:07 am
Okay, think I get it. Does this match your output requirement?
SELECT
t1.*,
oa.*
FROM #Vergelijkingstabel_1 t1
OUTER APPLY (
SELECT t3.*, x.[Score %]
FROM #Vergelijkingstabel_2 t3
CROSS APPLY dbo.IF_Levenshtein02 (t3.Omschrijving, t1.Omschrijving) x
WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_1 m WHERE m.Bestelnummer = t3.Bestelnummer)
AND x.[Score %]>= 80
) oa
WHERE NOT EXISTS (SELECT 1 FROM #Vergelijkingstabel_2 m WHERE m.Bestelnummer = t1.Bestelnummer)
UNION ALL
SELECT t1.*, t2.*, NULL
FROM #Vergelijkingstabel_1 t1
INNER JOIN #Vergelijkingstabel_2 t2 ON t2.Bestelnummer = t1.Bestelnummer
ORDER BY 1
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 12, 2014 at 12:17 am
Yes this should be it, thanks :). I will be working from here:-D
Viewing 11 posts - 16 through 25 (of 25 total)
You must be logged in to reply to this topic. Login to reply