January 19, 2014 at 8:22 am
Hey,
I have a scalar function, which calculates the similarity of two strings. I use the following query, to compare the entries of one table against the value 'Test' and return the entries, which have a value > 50:
;WITH cte1 AS (
SELECT b.FirstName,
(SELECT fn_similarity('Test', b.FirstName)) AS [Value],
b.LastName
FROM [AdventureWorks2012].[Person].[Person] b
)
SELECT *
FROM cte1
WHERE [Value] > 50.00
ORDER BY [Value] DESC
Now I want to use this query against the first 50 entries of the [Person] table, so that the resultset includes all the values of the first 50 persons and the entries, which are similar to them.
At the moment I use a WHILE-loop and write the five single resultsets in a temporary table. Is there another way / a better way, maybe via a join?
Best regards
January 19, 2014 at 10:57 am
-- you have to test each row against all the other rows in the table i.e. CROSS JOIN:
;WITH [50RandomRows] AS (SELECT TOP 50 * FROM [AdventureWorks2012].[Person].[Person] ORDER BY NEWID())
SELECT
a.BusinessEntityID,
a.FirstName,
b.BusinessEntityID,
b.FirstName,
x.[Value]
FROM [50RandomRows] a
CROSS JOIN [50RandomRows] b
CROSS APPLY (
SELECT fn_similarity('Test', b.FirstName) AS [Value]
) x
WHERE a.BusinessEntityID <> b.BusinessEntityID -- don't match to the same row
AND x.[Value] > 50.00
An inline table-valued function is likely to perform considerably better than this UDF.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 19, 2014 at 11:48 am
Perfect, thank you very much, that helped alot!
Best regards!
January 19, 2014 at 1:16 pm
You're welcome. If you've run the code you will have noticed that you get duplicates - the same name pair appears twice in the result set, a.firstname + b.firstname and b.firstname + a.firstname.
If you change this
WHERE a.BusinessEntityID <> b.BusinessEntityID
to this
WHERE a.BusinessEntityID > b.BusinessEntityID
you will eliminate the dupes, and make the query a little faster too. It's always going to be expensive (read "slow"), and you would be well advised, as I mentioned earlier, to change the UDF to an iTVF. Post up the function code if you're not sure how to do this.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 26, 2014 at 7:31 am
Thank you very much ChrisM@home, that helped me alot!
Best regards!
January 26, 2014 at 11:40 am
vip.blade (1/19/2014)
I have a scalar function, which calculates the similarity of two strings.
Hi.
Would you post that function, please? I could use something like that and I might be able to turn it into an iTVF for you.
Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 27, 2014 at 8:43 am
There are some points that confuse me.
You say you need the first 50 entries from your table. How do you define the first 50? What order do you have? Why don't you have a TOP 50? Would those be the first 50 with a value greater than 50? or only the rows with a value greater than 50 within the first 50 rows?
And what confuses me more. Why do you use a subquery to call the scalar function? instead of something as simple as:
SELECT b.FirstName,
fn_similarity('Test', b.FirstName) AS [Value],
b.LastName
FROM [AdventureWorks2012].[Person].[Person] b
It won't help the performance but it's easier to read.
I support the idea of sharing the code of the function to help you to improve it.
January 27, 2014 at 12:31 pm
Hi,
thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
Best regards!
January 27, 2014 at 1:23 pm
vip.blade (1/27/2014)
Hi,thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
Best regards!
That amazing, comprehensive, detailed work of art written by Thomas Keller!
I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.
USE [Matching]
GO
/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- this will score around 10,000 word pairs per second on 2010 laptop technology
ALTER FUNCTION [dbo].[IF_Levenshtein02]
(
@Reference VARCHAR(20), @Target VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
( -- output query
SELECT [Score %] = CASE
WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))
WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts
ELSE
(SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))
FROM ( -- do
SELECT
seq = t1.n,
ref.Letter,
v.WordLength,
LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),
Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),
WordLength = WordLength
FROM ( -- di
SELECT WordLength = MAX(WordLength)
FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)
) di
) v
CROSS APPLY ( -- t1
SELECT TOP(WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
) t1
CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref
OUTER APPLY ( -- tgt
SELECT TOP(WordLength) n = ABS(t1.n - t2.n)
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter, v.WordLength
) do
)
END
) -- output query
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 27, 2014 at 2:28 pm
Hi,
thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/ [/url]
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
First and foremost, the link you posted to Thomas Keller's article is how to calculate the Damerau–Levenshtein Distance. The Damerau–Levenshtein Distance (DLD) is a more complex version of the Levenshtein Distance. I mention that so that everyone is on the same page about what algorithm we're talking about.
The main difference between the two is that the Levenshtein Distance (LD) calculates how many inserts, deletes, or character substitutions it would take to transform one word into another. DLD measures the sames things but considers a transposition of two adjacent characters a single edit operation. For example the LD between their and thier is 2 (change e to i, and i to e [or vise-versa]; the DLD, however, between their and thier is 1 (you can swith [or transpose] i & e. DLD is a more powerful string metric but it's also much more complex and slower.
I mention this because, depending on how accurate your metric needs to be, the Levenshtein Distance may also be an option. There are a number of options for LD. My favorite is a CLR you can use that is super-easy to setup called mdq.Similarity. Using steps 1 & 2 in this article[/url] you can create the assembly.
CREATE FUNCTION mdq.[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000), @method [tinyint], @containmentBias [float], @minScoreHint [float])
RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
AS
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]
GO
This CLR ships with all SQL Server 2008R2 & 20012 Master Data Services/Data Quality Services.
With mdq.similarity you get: Levenshtein, the Jaccard similarity coefficient, a form of the Jaro-Winkler distance and the longest common subsequence algorithm. Here's a good example of mdq.Similarity[/url] in use.
Regarding a Levenshtein iTVF:
I tried for many months to do a loopless levenshtein and failed. It was a worth-while endeavor because I learned a ton about SQL, T-SQL, adjacency matrices and math. I can't even think about a loopless Damerau-Levenshtein.
-- Itzik Ben-Gan 2001
January 27, 2014 at 2:34 pm
ChrisM@home (1/27/2014)
vip.blade (1/27/2014)
Hi,thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
Best regards!
That amazing, comprehensive, detailed work of art written by Thomas Keller!
I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.
USE [Matching]
GO
/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- this will score around 10,000 word pairs per second on 2010 laptop technology
ALTER FUNCTION [dbo].[IF_Levenshtein02]
(
@Reference VARCHAR(20), @Target VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
( -- output query
SELECT [Score %] = CASE
WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))
WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts
ELSE
(SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))
FROM ( -- do
SELECT
seq = t1.n,
ref.Letter,
v.WordLength,
LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),
Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),
WordLength = WordLength
FROM ( -- di
SELECT WordLength = MAX(WordLength)
FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)
) di
) v
CROSS APPLY ( -- t1
SELECT TOP(WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
) t1
CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref
OUTER APPLY ( -- tgt
SELECT TOP(WordLength) n = ABS(t1.n - t2.n)
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter, v.WordLength
) do
)
END
) -- output query
I did not see this when I posted my response. This is really amazing work Chris!
-- Itzik Ben-Gan 2001
January 28, 2014 at 6:17 am
Thanks Alan. I should point out that it's quite a long way from being a full Levenshtein implementation! I don't think it would be impossible - just too costly.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply