How to use scalar function without WHILE

Question

Post reply

How to use scalar function without WHILE

vip.blade

Right there with Babe

Points: 770
More actions
January 19, 2014 at 8:22 am

#303423

Hey,
I have a scalar function, which calculates the similarity of two strings. I use the following query, to compare the entries of one table against the value 'Test' and return the entries, which have a value > 50:
;WITH cte1 AS (
SELECT b.FirstName,
(SELECT fn_similarity('Test', b.FirstName)) AS [Value],
b.LastName
FROM [AdventureWorks2012].[Person].[Person] b
)
SELECT *
FROM cte1
WHERE [Value] > 50.00
ORDER BY [Value] DESC
Now I want to use this query against the first 50 entries of the [Person] table, so that the resultset includes all the values of the first 50 persons and the entries, which are similar to them.
At the moment I use a WHILE-loop and write the five single resultsets in a temporary table. Is there another way / a better way, maybe via a join?
Best regards

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 1

-- you have to test each row against all the other rows in the table i.e. CROSS JOIN:

;WITH [50RandomRows] AS (SELECT TOP 50 * FROM [AdventureWorks2012].[Person].[Person] ORDER BY NEWID())

SELECT

a.BusinessEntityID,

a.FirstName,

b.BusinessEntityID,

b.FirstName,

x.[Value]

FROM [50RandomRows] a

CROSS JOIN [50RandomRows] b

CROSS APPLY (

SELECT fn_similarity('Test', b.FirstName) AS [Value]

) x

WHERE a.BusinessEntityID <> b.BusinessEntityID -- don't match to the same row

AND x.[Value] > 50.00

An inline table-valued function is likely to perform considerably better than this UDF.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

vip.blade Right there with Babe Points: 770 More actions · Answer 2

Perfect, thank you very much, that helped alot!

Best regards!

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 3

You're welcome. If you've run the code you will have noticed that you get duplicates - the same name pair appears twice in the result set, a.firstname + b.firstname and b.firstname + a.firstname.

If you change this

WHERE a.BusinessEntityID <> b.BusinessEntityID

to this

WHERE a.BusinessEntityID > b.BusinessEntityID

you will eliminate the dupes, and make the query a little faster too. It's always going to be expensive (read "slow"), and you would be well advised, as I mentioned earlier, to change the UDF to an iTVF. Post up the function code if you're not sure how to do this.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

vip.blade Right there with Babe Points: 770 More actions · Answer 4

Thank you very much ChrisM@home, that helped me alot!

Best regards!

Jeff Moden SSC Guru Points: 1004378 More actions · Answer 5

vip.blade (1/19/2014)
I have a scalar function, which calculates the similarity of two strings.

Hi.

Would you post that function, please? I could use something like that and I might be able to turn it into an iTVF for you.

Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Luis Cazares SSC Guru Points: 183706 More actions · Answer 6

There are some points that confuse me.

You say you need the first 50 entries from your table. How do you define the first 50? What order do you have? Why don't you have a TOP 50? Would those be the first 50 with a value greater than 50? or only the rows with a value greater than 50 within the first 50 rows?

And what confuses me more. Why do you use a subquery to call the scalar function? instead of something as simple as:

SELECT b.FirstName,

fn_similarity('Test', b.FirstName) AS [Value],

b.LastName

FROM [AdventureWorks2012].[Person].[Person] b

It won't help the performance but it's easier to read.

I support the idea of sharing the code of the function to help you to improve it.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

vip.blade Right there with Babe Points: 770 More actions · Answer 7

Hi,

thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.

@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

Best regards!

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 8

vip.blade (1/27/2014)
Hi,
thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
Best regards!

That amazing, comprehensive, detailed work of art written by Thomas Keller!

I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.

USE [Matching]

GO

/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

-- this will score around 10,000 word pairs per second on 2010 laptop technology

ALTER FUNCTION [dbo].[IF_Levenshtein02]

(

@Reference VARCHAR(20), @Target VARCHAR(20)

)

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

( -- output query

SELECT [Score %] = CASE

WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

ELSE

(SELECT

[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

FROM ( -- do

SELECT

seq = t1.n,

ref.Letter,

v.WordLength,

LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

FROM ( -- v

SELECT

Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

WordLength = WordLength

FROM ( -- di

SELECT WordLength = MAX(WordLength)

FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

) di

) v

CROSS APPLY ( -- t1

SELECT TOP(WordLength) n

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

) t1

CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

OUTER APPLY ( -- tgt

SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

) tgt

GROUP BY t1.n, ref.Letter, v.WordLength

) do

)

END

) -- output query

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Alan Burstein SSC Guru Points: 61141 More actions · Answer 9

Hi,
thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/ [/url]
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

First and foremost, the link you posted to Thomas Keller's article is how to calculate the Damerau–Levenshtein Distance. The Damerau–Levenshtein Distance (DLD) is a more complex version of the Levenshtein Distance. I mention that so that everyone is on the same page about what algorithm we're talking about.

The main difference between the two is that the Levenshtein Distance (LD) calculates how many inserts, deletes, or character substitutions it would take to transform one word into another. DLD measures the sames things but considers a transposition of two adjacent characters a single edit operation. For example the LD between their and thier is 2 (change e to i, and i to e [or vise-versa]; the DLD, however, between their and thier is 1 (you can swith [or transpose] i & e. DLD is a more powerful string metric but it's also much more complex and slower.

I mention this because, depending on how accurate your metric needs to be, the Levenshtein Distance may also be an option. There are a number of options for LD. My favorite is a CLR you can use that is super-easy to setup called mdq.Similarity. Using steps 1 & 2 in this article[/url] you can create the assembly.

CREATE FUNCTION mdq.[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000), @method [tinyint], @containmentBias [float], @minScoreHint [float])

RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT

AS

EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]

GO

This CLR ships with all SQL Server 2008R2 & 20012 Master Data Services/Data Quality Services.

With mdq.similarity you get: Levenshtein, the Jaccard similarity coefficient, a form of the Jaro-Winkler distance and the longest common subsequence algorithm. Here's a good example of mdq.Similarity[/url] in use.

Regarding a Levenshtein iTVF:

I tried for many months to do a loopless levenshtein and failed. It was a worth-while endeavor because I learned a ton about SQL, T-SQL, adjacency matrices and math. I can't even think about a loopless Damerau-Levenshtein.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61141 More actions · Answer 10

ChrisM@home (1/27/2014)
vip.blade (1/27/2014)
Hi,
thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.
@luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)
Best regards!
That amazing, comprehensive, detailed work of art written by Thomas Keller!
I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.
USE [Matching]
GO
/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- this will score around 10,000 word pairs per second on 2010 laptop technology
ALTER FUNCTION [dbo].[IF_Levenshtein02]
(
@Reference VARCHAR(20), @Target VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
( -- output query
SELECT [Score %] = CASE
WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))
WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts
ELSE
(SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))
FROM ( -- do
SELECT
seq = t1.n,
ref.Letter,
v.WordLength,
LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),
Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),
WordLength = WordLength
FROM ( -- di
SELECT WordLength = MAX(WordLength)
FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)
) di
) v
CROSS APPLY ( -- t1
SELECT TOP(WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
) t1
CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref
OUTER APPLY ( -- tgt
SELECT TOP(WordLength) n = ABS(t1.n - t2.n)
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter, v.WordLength
) do
)
END
) -- output query

I did not see this when I posted my response. This is really amazing work Chris!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

Thanks Alan. I should point out that it's quite a long way from being a full Levenshtein implementation! I don't think it would be impossible - just too costly.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden