January 3, 2016 at 11:24 pm
Comments posted to this topic are about the item Why Scalar Functions Can Be Costly
Mickey Stuewe
Sr Database Developer
My blog
Follow me on twitter: @SQLMickey
Connect with me on LinkedIn
--------------------------------------------------------------------------
I laugh loudly and I laugh often. Just ask anyone who knows me.
January 4, 2016 at 2:01 am
Thank you Mickey for this article, good Monday morning reading.
The part I find hard to agree with is "When to Use Scalar Functions", particularly when the example given can easily be converted into a much more efficient iTVFN with hardly any changes in the syntax. In my opinion, the only time scalar UDF should be used is when that is the only choice, such as on calculated columns etc..
😎
The following example shows that the iTVFN version of the function is roughly 15 times faster than the scalar version, the former adding roughly 0.0006 milliseconds per row while the latter adds 0.009 milliseconds per row on my old i5 laptop. In other words the iTVFN takes 0.6 seconds on a 1,000,000 row sample set and the scalar 9 seconds on the same set.
Test data
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'dbo.USP_CREATE_PERSON_ADDRESS') IS NOT NULL DROP PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS;
GO
CREATE PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS
(
@SAMPLE_SIZE INT
,@MSGLANGID SMALLINT = 1033 -- Default to English
)
AS
/******************************************************************************
2015 Eirikur Eiriksson
---------------------------------------
Create "realistic" sample data set (SDP_ID, FirstName, LastName, Address)
in any of the languages found in sys.syslanguages.
---------------------------------------
Parameters
1. @SAMPLE_SIZE INT, number of rows produced, max 10^9
2. @MSGLANGID SMALLINT, msglangid of the language to use from the
sys.syslanguages table.
---------------------------------------
Note: to view the applicable languages and corresponding msglangid run the
following query.
---------------------------------------
SELECT
SL.alias
,SL.msglangid
FROM sys.syslanguages SL;
---------------------------------------
SELECT DISTINCT
SL.msglangid
FROM sys.syslanguages SL;
---------------------------------------
Output structure:
---------------------------------------
SDP_ID BIGINT NOT NULL
FirstName NVARCHAR(50) NOT NULL
LastName NVARCHAR(50) NOT NULL
[Address] NVARCHAR(100) NOT NULL
---------------------------------------
*******************************************************************************/
WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP (@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
,NAME_STRING(SXNAME) AS
(
SELECT
(SELECT
REPLACE( -- 32
REPLACE( -- 34 "
REPLACE( -- 37 %
REPLACE( -- 39 '
REPLACE( -- 40 (
REPLACE( -- 41 )
REPLACE( -- 42 *
REPLACE( -- 44 ,
REPLACE( -- 45 -
REPLACE( -- 46 .
REPLACE( -- 47 /
REPLACE( -- 58 :
REPLACE( -- 59 ;
REPLACE( -- 61 =
REPLACE( -- 64 @
REPLACE( -- 95 _
REPLACE( -- 124 |
SM.text
,NCHAR(32),N'') -- 32
,NCHAR(34),N'') -- 34 "
,NCHAR(37),N'') -- 37 %
,NCHAR(39),N'') -- 39 '
,NCHAR(40),N'') -- 40 (
,NCHAR(41),N'') -- 41 )
,NCHAR(42),N'') -- 42 *
,NCHAR(44),N'') -- 44 ,
,NCHAR(45),N'') -- 45 -
,NCHAR(46),N'') -- 46 .
,NCHAR(47),N'') -- 47 /
,NCHAR(58),N'') -- 58 :
,NCHAR(59),N'') -- 59 ;
,NCHAR(61),N'') -- 61 =
,NCHAR(64),N'') -- 64 @
,NCHAR(95),N'') -- 95 _
,NCHAR(124),N'') -- 124 |
FROM sys.messages SM
WHERE SM.language_id = @MSGLANGID
FOR XML PATH(''), TYPE).value('.[1]','NVARCHAR(4000)') AS SXNAME
)
,BASE_DATA AS
(
SELECT
NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)
+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS FirstName
,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)
+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS LastName
,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)
+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,7 + ABS(CHECKSUM(NEWID())) % 10))
+ NCHAR(32) + CONVERT(NVARCHAR(4),1 + (ABS(CHECKSUM(NEWID())) % 1000),0) AS [Address]
FROM NUMS NM
CROSS APPLY NAME_STRING NS
)
SELECT
ROW_NUMBER() OVER
(
ORDER BY (SELECT NULL)
) AS SDP_ID
,BD.FirstName
,BD.LastName
,BD.Address
FROM BASE_DATA BD
;
GO
IF OBJECT_ID(N'dbo.SAMPLE_Person') IS NOT NULL DROP TABLE dbo.SAMPLE_Person;
CREATE TABLE dbo.SAMPLE_Person
(
SDP_ID BIGINT NOT NULL CONSTRAINT PK_DBO_PERSON_PERSON_ID PRIMARY KEY CLUSTERED
,FirstName NVARCHAR(50) NOT NULL
,LastName NVARCHAR(50) NOT NULL
,[Address] NVARCHAR(100) NOT NULL
);
GO
INSERT INTO dbo.SAMPLE_Person (SDP_ID,FirstName,LastName,[Address])
EXEC dbo.USP_CREATE_PERSON_ADDRESS 1000000,1033;
The two functions
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'dbo.ITVFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.ITVFN_PROPER_CASE;
GO
CREATE FUNCTION dbo.ITVFN_PROPER_CASE
(
@INPUT NVARCHAR(50)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT
STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))) AS PROPER_CASED
;
GO
IF OBJECT_ID(N'dbo.SFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.SFN_PROPER_CASE;
GO
CREATE FUNCTION dbo.SFN_PROPER_CASE
(
@INPUT NVARCHAR(50)
)
RETURNS NVARCHAR(50)
WITH SCHEMABINDING
AS
BEGIN
RETURN
(SELECT
STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))))
;
END
GO
Test harness
DECLARE @NCHAR_BUCKET_01 NVARCHAR(50) = N'';
DECLARE @NCHAR_BUCKET_02 NVARCHAR(50) = N'';
DECLARE @timer TABLE (T_TEXT VARCHAR(100) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));
INSERT INTO @timer(T_TEXT) VALUES ('DRY RUN');
SELECT
@NCHAR_BUCKET_01 = SP.FirstName
,@NCHAR_BUCKET_02 = SP.LastName
FROM dbo.SAMPLE_Person SP;
INSERT INTO @timer(T_TEXT) VALUES ('DRY RUN');
INSERT INTO @timer(T_TEXT) VALUES ('SCALAR');
SELECT
@NCHAR_BUCKET_01 = dbo.SFN_PROPER_CASE(SP.FirstName)
,@NCHAR_BUCKET_02 = dbo.SFN_PROPER_CASE(SP.LastName)
FROM dbo.SAMPLE_Person SP;
INSERT INTO @timer(T_TEXT) VALUES ('SCALAR');
INSERT INTO @timer(T_TEXT) VALUES ('ITVFN');
SELECT
@NCHAR_BUCKET_01 = X.PROPER_CASED
,@NCHAR_BUCKET_02 = Y.PROPER_CASED
FROM dbo.SAMPLE_Person SP
CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS X
CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS Y;
INSERT INTO @timer(T_TEXT) VALUES ('ITVFN');
-- Test results
SELECT
T.T_TEXT
,DATEDIFF(millisecond,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
FROM @timer T
GROUP BY T.T_TEXT
ORDER BY DURATION ASC;
Results
T_TEXT DURATION
---------- -----------
DRY RUN 230
ITVFN 547
SCALAR 8994
Edit: typo
January 4, 2016 at 2:33 am
Small correction required - UDFs came in as part of 2000 didn't they?
January 4, 2016 at 6:13 am
bjh1977 (1/4/2016)
Small correction required - UDFs came in as part of 2000 didn't they?
I think they started earlier with stored procs ('90): sql 6.0
January 4, 2016 at 6:20 am
1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx
2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 4, 2016 at 6:28 am
Thanks for the good article.
January 4, 2016 at 6:35 am
I'm not understanding here, doesn't the summing operation still have to occur for each row in the "cross apply" version?
edit: Another thing that bothers me about the article is that the author implies that the "execution" count is the ultimate measure of T-SQL performance goodness, whereas, as shown by Eirikur Eiriksson, even with the "better" execution count with that second scalar UDF example, theres obviously more to the story here.
January 4, 2016 at 6:59 am
Eirikur's response rendered what I was going to write mostly redundant. But I do have some minor points:
1) If you 'obfuscate' something, you make it unclear or difficult to interpret. The execution plan does not do that: instead it completely hides the multi-execution issue.
2) It would be interesting to see how you created/configured that extended events session. Your 'how to identify the cost' comment made me think that you were going to find a way to identify the 'real' cost of using scalar functions within an execution plan.
3) (typo) I think it's 'formatted' rather than 'formated'.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 4, 2016 at 7:04 am
TheSQLGuru (1/4/2016)
1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx
Both Kevin and Eirikur beat me to it. As Eirikur demonstrated, an inline table valued function is almost always the way to go. Killing the possibility of a parallel query plan is the #1 problem work scalr UDF's.
That said, great work Micky. I particularly like how you used Extended Events to show what was going on under the hood.
-- Itzik Ben-Gan 2001
January 4, 2016 at 7:53 am
The example given selects rows based on the result of a mathematical operation performed on two columns within each row. This derived result does not exist on its own; it must be calculated for each row, so it actually seems to make sense that each row must be read. It appears that the engine then creates a tally table with the same number of rows, includes the result of the calculation as a key column, then scans this table for qualifying results.
But if the selection criteria was based on an existing column value - and if this column was properly indexed - would that not reduce the number of reads to just the number of qualifying rows? Even if the function then performed a similar computation using existing columns to derive a new result, it seems to me that the rows read should still be held to just "qualifying" rows, avoiding a full-table scan.
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?
January 4, 2016 at 10:14 am
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?
NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 4, 2016 at 10:27 am
TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
Not even for CHECK constraints?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 4, 2016 at 10:39 am
Phil Parkin (1/4/2016)
TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
Not even for CHECK constraints?
That's about the only exception
😎
Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.
January 4, 2016 at 10:41 am
bjh1977 (1/4/2016)
Small correction required - UDFs came in as part of 2000 didn't they?
That's correct.
😎
January 4, 2016 at 10:43 am
I remember that Simon Sabin did an article called "Scalar functions are evil", but whenever I needed to send someone a link to it I'd Google for "Simon Sabin is evil" 😛
Viewing 15 posts - 1 through 15 (of 53 total)
You must be logged in to reply to this topic. Login to reply