January 4, 2016 at 11:16 am
Eirikur Eiriksson (1/4/2016)
Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.
I agree with this 99.9999% of the time but I did find one exception here... For the Longest Common Substring; I've been playing around with this for a couple years and here's my iTVF solution:
CREATE FUNCTION dbo.NGrams8K (@string varchar ( 8000), @n int )
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 09/09/2015
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the "Gram" in the string.
Revision History:
Rev 00 - 03/10/2014 Initial Development - Alan Burstein
Rev 01 - 05/22/2015 Removed DQS N-Grams functionality,
improved iTally - Alan Burstein
Rev 02 - 05/22/2015 Changed TOP logic to remove implicit conversion
- Alan Burstein
Rev 03 - 9/9/2015 Added logic to only return values if @n is greater
than 0 and less then length of @string - Alan Burstein
********************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1( N) AS
(
SELECT 1
FROM ( VALUES
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t( N)
),
iTally ( N) AS
(
SELECT TOP ( CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))
ROW_ NUMBER( ) OVER ( ORDER BY ( SELECT NULL))
FROM L1 a CROSS JOIN L1 b -- add two more cross joins to support varchar(max)
)
SELECT
position = N,
token = SUBSTRING(@ string,N,@n )
FROM iTally
WHERE @n > 0 AND @n <= DATALENGTH ( @string);
GO
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2013
Last Updated on: 09/20/2015
Use:
Returns the longest common substring between two strings.
Revision History:
Rev 00 - 03/10/2013 Initial Development - Alan Burstein
Rev 03 - 09/20/2015 Performance tuned using NGrams8K - Alan Burstein
********************************************************************/
CREATE FUNCTION dbo.LCSS8K
(
@string1 varchar (8000),
@string2 varchar (8000)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
Strings AS
(
SELECT
String1 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string1 ELSE @string2 END,
String2 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string2 ELSE @string1 END
),
I( N) AS (SELECT position FROM Strings CROSS APPLY dbo.NGrams8K(String2,1))
SELECT TOP (1) WITH TIES
TokenLength = I.N,
NG.Token
FROM I CROSS APPLY Strings s CROSS APPLY dbo.NGrams8K( String2,I.N) NG
WHERE CHARINDEX ( NG.token,String1) > 0
ORDER BY N DESC;
GO
Phil Factor has a mTVF solution here: https://www.simple-talk.com/blogs/2014/12/20/string-comparisons-in-sql-the-longest-common-substring/[/url]
My solution is much, much faster than Phil's for shorter strings but slows down as the strings get longer with the tipping point being somewhere around 100 characters. The problem with the iTVF version is that, without an updateable variable, I have to use a more brute-force approach. I can't beat the mTVF in this case for longer strings (yet). For me, that's been the 0.0001% exception.
-- Itzik Ben-Gan 2001
January 4, 2016 at 11:21 am
Alan.B (1/4/2016)
--For me, that's been the 0.0001% exception.
Wow. That suggests you have written/tested 10,000 (or more) TVFs :w00t:
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
January 4, 2016 at 11:30 am
TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
I'm guessing that if you have to buy a book to learn why UDF's are bad, folks will skip buying the book and use UDF's. Its a shame the information isn't more readily available.
edit: at least the article author in this post has tried to make a case against them and this article is freely available so that's a plus!
edit 2: I found an old comment that I think might offer a possible explanation regarding why UDF's are inherently slow once the parallelism killing and table scans / UDF predicate smashing is taken out of consideration, but its more of a general comment about RDBMS implementation, the comment is at the bottom of the thread offered up by Serge Rielau.
https://bytes.com/topic/sql-server/answers/81173-why-scalar-functions-soooo-sloooow
edit 3: Here's another great article on UDF's by one of our locals 🙂 What I like about Hugo's article is that he makes a direct comparison using an expression both inside and outside of a UDF and its easier to zero in on the costs of changing an expression directly into a UDF.
January 4, 2016 at 12:08 pm
Phil Parkin (1/4/2016)
Alan.B (1/4/2016)
--For me, that's been the 0.0001% exception.
Wow. That suggests you have written/tested 10,000 (or more) TVFs :w00t:
I've been busy 😉
-- Itzik Ben-Gan 2001
January 4, 2016 at 12:17 pm
patrickmcginnis59 10839 (1/4/2016)
TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
I'm guessing that if you have to buy a book to learn why UDF's are bad, folks will skip buying the book and use UDF's. Its a shame the information isn't more readily available.
edit: at least the article author in this post has tried to make a case against them and this article is freely available so that's a plus!
edit 2: I found an old comment that I think might offer a possible explanation regarding why UDF's are inherently slow once the parallelism killing and table scans / UDF predicate smashing is taken out of consideration, but its more of a general comment about RDBMS implementation, the comment is at the bottom of the thread offered up by Serge Rielau.
https://bytes.com/topic/sql-server/answers/81173-why-scalar-functions-soooo-sloooow
It is a shame that there's not more about the pitfalls of scalar UDFs but the content is out there.
Jeff Moden did this great article: http://www.sqlservercentral.com/articles/T-SQL/91724/[/url]
Do a google search for "Paul White Scalar Function" you will find a couple good articles. Some articles with links to other good articles in the comment section.
It's great that Micky did this article because I think there should be more content out there.
-- Itzik Ben-Gan 2001
January 4, 2016 at 1:34 pm
Thanks for the article, Mickey.
Have you tested your scalar UDF using schemabinding? We're testing that now to see if we can bring some of the cost down, but so far the improvement seems negligible. (a recent change to a sproc to use UDF to modularize a chunk of code that returns patient int age on date and used in WHERE criteria seems to have tanked efficiency). I'm just curious if you had any considerations for schemabinding.
I came across this last night:
We're most likely going to use your suggestion with tvf with cross apply...the case you've laid out is compelling.
Thanks again!
January 4, 2016 at 1:56 pm
winkl (1/4/2016)
Thanks for the article, Mickey.Have you tested your scalar UDF using schemabinding? We're testing that now to see if we can bring some of the cost down, but so far the improvement seems negligible. (a recent change to a sproc to use UDF to modularize a chunk of code that returns patient int age on date and used in WHERE criteria seems to have tanked efficiency). I'm just curious if you had any considerations for schemabinding.
I came across this last night:
We're most likely going to use your suggestion with tvf with cross apply...the case you've laid out is compelling.
Thanks again!
Adding Schemabinding is a good practice but it does not fix the other problems with scalar UDFs such as preventing parallel query plans. Using inline table valued functions is the way to go for sure. The key is Inline tvf's multi-line tvf's are generally terrible.
Here's a good article about the topic of changing scalar udfs to iTVFs: http://www.sqlservercentral.com/Forums/Topic1338291-203-1.aspx
-- Itzik Ben-Gan 2001
January 4, 2016 at 2:02 pm
Alan.B (1/4/2016)
Eirikur Eiriksson (1/4/2016)
Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.I agree with this 99.9999% of the time but I did find one exception here... For the Longest Common Substring; I've been playing around with this for a couple years and here's my iTVF solution:
CREATE FUNCTION dbo.NGrams8K (@string varchar ( 8000), @n int )
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 09/09/2015
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the "Gram" in the string.
Revision History:
Rev 00 - 03/10/2014 Initial Development - Alan Burstein
Rev 01 - 05/22/2015 Removed DQS N-Grams functionality,
improved iTally - Alan Burstein
Rev 02 - 05/22/2015 Changed TOP logic to remove implicit conversion
- Alan Burstein
Rev 03 - 9/9/2015 Added logic to only return values if @n is greater
than 0 and less then length of @string - Alan Burstein
********************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1( N) AS
(
SELECT 1
FROM ( VALUES
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t( N)
),
iTally ( N) AS
(
SELECT TOP ( CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))
ROW_ NUMBER( ) OVER ( ORDER BY ( SELECT NULL))
FROM L1 a CROSS JOIN L1 b -- add two more cross joins to support varchar(max)
)
SELECT
position = N,
token = SUBSTRING(@ string,N,@n )
FROM iTally
WHERE @n > 0 AND @n <= DATALENGTH ( @string);
GO
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2013
Last Updated on: 09/20/2015
Use:
Returns the longest common substring between two strings.
Revision History:
Rev 00 - 03/10/2013 Initial Development - Alan Burstein
Rev 03 - 09/20/2015 Performance tuned using NGrams8K - Alan Burstein
********************************************************************/
CREATE FUNCTION dbo.LCSS8K
(
@string1 varchar (8000),
@string2 varchar (8000)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
Strings AS
(
SELECT
String1 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string1 ELSE @string2 END,
String2 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string2 ELSE @string1 END
),
I( N) AS (SELECT position FROM Strings CROSS APPLY dbo.NGrams8K(String2,1))
SELECT TOP (1) WITH TIES
TokenLength = I.N,
NG.Token
FROM I CROSS APPLY Strings s CROSS APPLY dbo.NGrams8K( String2,I.N) NG
WHERE CHARINDEX ( NG.token,String1) > 0
ORDER BY N DESC;
GO
Phil Factor has a mTVF solution here: https://www.simple-talk.com/blogs/2014/12/20/string-comparisons-in-sql-the-longest-common-substring/[/url]
My solution is much, much faster than Phil's for shorter strings but slows down as the strings get longer with the tipping point being somewhere around 100 characters. The problem with the iTVF version is that, without an updateable variable, I have to use a more brute-force approach. I can't beat the mTVF in this case for longer strings (yet). For me, that's been the 0.0001% exception.
Pop over a test harness Alan and we can work on eliminating that 0.0001%, I'll dig out the improvements I did with the likes of implementing and improving Boyer–Moore etc. in T-SQL.
😎
January 4, 2016 at 2:13 pm
Eirikur Eiriksson (1/4/2016)
Alan.B (1/4/2016)
Eirikur Eiriksson (1/4/2016)
Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.I agree with this 99.9999% of the time but I did find one exception here... For the Longest Common Substring; I've been playing around with this for a couple years and here's my iTVF solution:
CREATE FUNCTION dbo.NGrams8K (@string varchar ( 8000), @n int )
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 09/09/2015
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the "Gram" in the string.
Revision History:
Rev 00 - 03/10/2014 Initial Development - Alan Burstein
Rev 01 - 05/22/2015 Removed DQS N-Grams functionality,
improved iTally - Alan Burstein
Rev 02 - 05/22/2015 Changed TOP logic to remove implicit conversion
- Alan Burstein
Rev 03 - 9/9/2015 Added logic to only return values if @n is greater
than 0 and less then length of @string - Alan Burstein
********************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1( N) AS
(
SELECT 1
FROM ( VALUES
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t( N)
),
iTally ( N) AS
(
SELECT TOP ( CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))
ROW_ NUMBER( ) OVER ( ORDER BY ( SELECT NULL))
FROM L1 a CROSS JOIN L1 b -- add two more cross joins to support varchar(max)
)
SELECT
position = N,
token = SUBSTRING(@ string,N,@n )
FROM iTally
WHERE @n > 0 AND @n <= DATALENGTH ( @string);
GO
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2013
Last Updated on: 09/20/2015
Use:
Returns the longest common substring between two strings.
Revision History:
Rev 00 - 03/10/2013 Initial Development - Alan Burstein
Rev 03 - 09/20/2015 Performance tuned using NGrams8K - Alan Burstein
********************************************************************/
CREATE FUNCTION dbo.LCSS8K
(
@string1 varchar (8000),
@string2 varchar (8000)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
Strings AS
(
SELECT
String1 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string1 ELSE @string2 END,
String2 = CASE WHEN LEN (@string1)>LEN(@string2) THEN @string2 ELSE @string1 END
),
I( N) AS (SELECT position FROM Strings CROSS APPLY dbo.NGrams8K(String2,1))
SELECT TOP (1) WITH TIES
TokenLength = I.N,
NG.Token
FROM I CROSS APPLY Strings s CROSS APPLY dbo.NGrams8K( String2,I.N) NG
WHERE CHARINDEX ( NG.token,String1) > 0
ORDER BY N DESC;
GO
Phil Factor has a mTVF solution here: https://www.simple-talk.com/blogs/2014/12/20/string-comparisons-in-sql-the-longest-common-substring/[/url]
My solution is much, much faster than Phil's for shorter strings but slows down as the strings get longer with the tipping point being somewhere around 100 characters. The problem with the iTVF version is that, without an updateable variable, I have to use a more brute-force approach. I can't beat the mTVF in this case for longer strings (yet). For me, that's been the 0.0001% exception.
Pop over a test harness Alan and we can work on eliminating that 0.0001%, I'll dig out the improvements I did with the likes of implementing and improving Boyer–Moore etc. in T-SQL.
😎
Excellent. Sorry that I did not have the test harness handy - it's on another system I don't have access to. I'll post that up here later tonight when I get home. I been banging my head trying to beat that mTVF for several months :hehe:
-- Itzik Ben-Gan 2001
January 4, 2016 at 2:18 pm
You are absolutely correct here. A test I just completed using an iTVF instead of the scalar function (source table(s) ~110M records) has turned a 8:42 duration to :03. I'll dump proccache at maintenance window just to make sure, but this looks extremely promising.
Learning is fun. I heart this forum.
January 4, 2016 at 2:27 pm
Alan.B (1/4/2016)
I been banging my head trying to beat that mTVF for several months :hehe:
Anything to ease the pain cause that got to hurt:-D
😎
January 4, 2016 at 4:41 pm
Hi Everyone,
Thank you for all the comments and questions. I'm swamped at work today, so I will reply this evening when I get home (California time).
Cheers!
Mickey
Mickey Stuewe
Sr Database Developer
My blog
Follow me on twitter: @SQLMickey
Connect with me on LinkedIn
--------------------------------------------------------------------------
I laugh loudly and I laugh often. Just ask anyone who knows me.
January 4, 2016 at 7:02 pm
Not even for Check Constraints since you can get bad data. Search for sql server udf check constraint bug and you can go here: http://sqlblog.com/blogs/tibor_karaszi/archive/2009/12/17/be-careful-with-constraints-calling-udfs.aspx
Addressing someone else's comment about WITH SCHEMABINDING: that is a MUST, especially if you use UDFs in UPDATES. You can avoid a nasty table spool put in place to prevent the Halloween Problem in some cases. I recommend using SCHEMABINDING all the time it is an option though - build a box around the user, and in this case the user is YOU, the developer. 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 4, 2016 at 7:39 pm
i believe database are set-oriented, not record-oriented. i found many app developers fundamentally are record-oriented. for just a few records, their record-oriented operations run fine until the day records reach their performance limit. most cases include 1. testing in a staging system (QA/DEV) where only a few records exists while millions exist in the production or 2. starting in a new feature where new data is just MBs until months pass by it become GBs. therefore, to avoid surprises, we analyze and reserve set-intensive challenges to only set-oriented developers. the remains are for any body. unnessessarily using scarlar functions is one of the popular issues from record-oriented thinking :-P, don't you think
January 4, 2016 at 8:25 pm
Hi Eirikur,
Thank you for the SQL Example. I agree whole heartedly with you. This article is one of three articles. The first, focusing on User Defined Scalar Functions and how you can write them a different way. I didn't want to show all the various ways to write them though.
Cheers!
Mickey
Mickey Stuewe
Sr Database Developer
My blog
Follow me on twitter: @SQLMickey
Connect with me on LinkedIn
--------------------------------------------------------------------------
I laugh loudly and I laugh often. Just ask anyone who knows me.
Viewing 15 posts - 16 through 30 (of 53 total)
You must be logged in to reply to this topic. Login to reply