Why Scalar Functions Can Be Costly

Question

Why Scalar Functions Can Be Costly

Mickey Stuewe

SSC Eights!

Points: 879
More actions
January 3, 2016 at 11:24 pm

#310582

Comments posted to this topic are about the item Why Scalar Functions Can Be Costly
Mickey Stuewe
Sr Database Developer
My blog
Follow me on twitter: @SQLMickey
Connect with me on LinkedIn
--------------------------------------------------------------------------
I laugh loudly and I laugh often. Just ask anyone who knows me.

Viewing 15 posts - 1 through 15 (of 54 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 1

Thank you Mickey for this article, good Monday morning reading.

The part I find hard to agree with is "When to Use Scalar Functions", particularly when the example given can easily be converted into a much more efficient iTVFN with hardly any changes in the syntax. In my opinion, the only time scalar UDF should be used is when that is the only choice, such as on calculated columns etc..

😎

The following example shows that the iTVFN version of the function is roughly 15 times faster than the scalar version, the former adding roughly 0.0006 milliseconds per row while the latter adds 0.009 milliseconds per row on my old i5 laptop. In other words the iTVFN takes 0.6 seconds on a 1,000,000 row sample set and the scalar 9 seconds on the same set.

Test data

USE tempdb;

GO

SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.USP_CREATE_PERSON_ADDRESS') IS NOT NULL DROP PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS;

GO

CREATE PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS

(

@SAMPLE_SIZE INT

,@MSGLANGID SMALLINT = 1033 -- Default to English

)

AS

/******************************************************************************

2015 Eirikur Eiriksson

---------------------------------------

Create "realistic" sample data set (SDP_ID, FirstName, LastName, Address)

in any of the languages found in sys.syslanguages.

---------------------------------------

Parameters

1. @SAMPLE_SIZE INT, number of rows produced, max 10^9

2. @MSGLANGID SMALLINT, msglangid of the language to use from the

sys.syslanguages table.

---------------------------------------

Note: to view the applicable languages and corresponding msglangid run the

following query.

---------------------------------------

SELECT

SL.alias

,SL.msglangid

FROM sys.syslanguages SL;

---------------------------------------

SELECT DISTINCT

SL.msglangid

FROM sys.syslanguages SL;

---------------------------------------

Output structure:

---------------------------------------

SDP_ID BIGINT NOT NULL

FirstName NVARCHAR(50) NOT NULL

LastName NVARCHAR(50) NOT NULL

[Address] NVARCHAR(100) NOT NULL

---------------------------------------

*******************************************************************************/

WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP (@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

,NAME_STRING(SXNAME) AS

(

SELECT

(SELECT

REPLACE( -- 32

REPLACE( -- 34 "

REPLACE( -- 37 %

REPLACE( -- 39 '

REPLACE( -- 40 (

REPLACE( -- 41 )

REPLACE( -- 42 *

REPLACE( -- 44 ,

REPLACE( -- 45 -

REPLACE( -- 46 .

REPLACE( -- 47 /

REPLACE( -- 58 :

REPLACE( -- 59 ;

REPLACE( -- 61 =

REPLACE( -- 64 @

REPLACE( -- 95 _

REPLACE( -- 124 |

SM.text

,NCHAR(32),N'') -- 32

,NCHAR(34),N'') -- 34 "

,NCHAR(37),N'') -- 37 %

,NCHAR(39),N'') -- 39 '

,NCHAR(40),N'') -- 40 (

,NCHAR(41),N'') -- 41 )

,NCHAR(42),N'') -- 42 *

,NCHAR(44),N'') -- 44 ,

,NCHAR(45),N'') -- 45 -

,NCHAR(46),N'') -- 46 .

,NCHAR(47),N'') -- 47 /

,NCHAR(58),N'') -- 58 :

,NCHAR(59),N'') -- 59 ;

,NCHAR(61),N'') -- 61 =

,NCHAR(64),N'') -- 64 @

,NCHAR(95),N'') -- 95 _

,NCHAR(124),N'') -- 124 |

FROM sys.messages SM

WHERE SM.language_id = @MSGLANGID

FOR XML PATH(''), TYPE).value('.[1]','NVARCHAR(4000)') AS SXNAME

)

,BASE_DATA AS

(

SELECT

NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS FirstName

,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS LastName

,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

+ LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,7 + ABS(CHECKSUM(NEWID())) % 10))

+ NCHAR(32) + CONVERT(NVARCHAR(4),1 + (ABS(CHECKSUM(NEWID())) % 1000),0) AS [Address]

FROM NUMS NM

CROSS APPLY NAME_STRING NS

)

SELECT

ROW_NUMBER() OVER

(

ORDER BY (SELECT NULL)

) AS SDP_ID

,BD.FirstName

,BD.LastName

,BD.Address

FROM BASE_DATA BD

;

GO

IF OBJECT_ID(N'dbo.SAMPLE_Person') IS NOT NULL DROP TABLE dbo.SAMPLE_Person;

CREATE TABLE dbo.SAMPLE_Person

(

SDP_ID BIGINT NOT NULL CONSTRAINT PK_DBO_PERSON_PERSON_ID PRIMARY KEY CLUSTERED

,FirstName NVARCHAR(50) NOT NULL

,LastName NVARCHAR(50) NOT NULL

,[Address] NVARCHAR(100) NOT NULL

);

GO

INSERT INTO dbo.SAMPLE_Person (SDP_ID,FirstName,LastName,[Address])

EXEC dbo.USP_CREATE_PERSON_ADDRESS 1000000,1033;

The two functions

USE tempdb;

GO

SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.ITVFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.ITVFN_PROPER_CASE;

GO

CREATE FUNCTION dbo.ITVFN_PROPER_CASE

(

@INPUT NVARCHAR(50)

)

RETURNS TABLE

WITH SCHEMABINDING

AS

RETURN

SELECT

STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))) AS PROPER_CASED

;

GO

IF OBJECT_ID(N'dbo.SFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.SFN_PROPER_CASE;

GO

CREATE FUNCTION dbo.SFN_PROPER_CASE

(

@INPUT NVARCHAR(50)

)

RETURNS NVARCHAR(50)

WITH SCHEMABINDING

AS

BEGIN

RETURN

(SELECT

STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))))

;

END

GO

Test harness

DECLARE @NCHAR_BUCKET_01 NVARCHAR(50) = N'';

DECLARE @NCHAR_BUCKET_02 NVARCHAR(50) = N'';

DECLARE @TIMER TABLE (T_TEXT VARCHAR(100) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

INSERT INTO @TIMER(T_TEXT) VALUES ('DRY RUN');

SELECT

@NCHAR_BUCKET_01 = SP.FirstName

,@NCHAR_BUCKET_02 = SP.LastName

FROM dbo.SAMPLE_Person SP;

INSERT INTO @TIMER(T_TEXT) VALUES ('DRY RUN');

INSERT INTO @TIMER(T_TEXT) VALUES ('SCALAR');

SELECT

@NCHAR_BUCKET_01 = dbo.SFN_PROPER_CASE(SP.FirstName)

,@NCHAR_BUCKET_02 = dbo.SFN_PROPER_CASE(SP.LastName)

FROM dbo.SAMPLE_Person SP;

INSERT INTO @TIMER(T_TEXT) VALUES ('SCALAR');

INSERT INTO @TIMER(T_TEXT) VALUES ('ITVFN');

SELECT

@NCHAR_BUCKET_01 = X.PROPER_CASED

,@NCHAR_BUCKET_02 = Y.PROPER_CASED

FROM dbo.SAMPLE_Person SP

CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS X

CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS Y;

INSERT INTO @TIMER(T_TEXT) VALUES ('ITVFN');

-- Test results

SELECT

T.T_TEXT

,DATEDIFF(millisecond,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

FROM @TIMER T

GROUP BY T.T_TEXT

ORDER BY DURATION ASC;

Results

T_TEXT DURATION

---------- -----------

DRY RUN 230

ITVFN 547

SCALAR 8994

Edit: typo

bjh1977 Right there with Babe Points: 740 More actions · Answer 2

Small correction required - UDFs came in as part of 2000 didn't they?

Carlo Romagnano SSC-Insane Points: 22858 More actions · Answer 3

bjh1977 (1/4/2016)
Small correction required - UDFs came in as part of 2000 didn't they?

I think they started earlier with stored procs ('90): sql 6.0

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx

2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 5

akljfhnlaflkj

SSC Guru

Points: 76202

January 4, 2016 at 6:28 am

#1849196

Thanks for the good article.

x SSC-Insane Points: 23660 More actions · Answer 6

I'm not understanding here, doesn't the summing operation still have to occur for each row in the "cross apply" version?

edit: Another thing that bothers me about the article is that the author implies that the "execution" count is the ultimate measure of T-SQL performance goodness, whereas, as shown by Eirikur Eiriksson, even with the "better" execution count with that second scalar UDF example, theres obviously more to the story here.

Phil Parkin SSC Guru Points: 247227 More actions · Answer 7

Eirikur's response rendered what I was going to write mostly redundant. But I do have some minor points:

1) If you 'obfuscate' something, you make it unclear or difficult to interpret. The execution plan does not do that: instead it completely hides the multi-execution issue.

2) It would be interesting to see how you created/configured that extended events session. Your 'how to identify the cost' comment made me think that you were going to find a way to identify the 'real' cost of using scalar functions within an execution plan.

3) (typo) I think it's 'formatted' rather than 'formated'.

Alan Burstein SSC Guru Points: 61152 More actions · Answer 8

TheSQLGuru (1/4/2016)
1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx
2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

Both Kevin and Eirikur beat me to it. As Eirikur demonstrated, an inline table valued function is almost always the way to go. Killing the possibility of a parallel query plan is the #1 problem work scalr UDF's.

That said, great work Micky. I particularly like how you used Extended Events to show what was going on under the hood.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

lucien.jacquet SSC Enthusiast Points: 143 More actions · Answer 9

The example given selects rows based on the result of a mathematical operation performed on two columns within each row. This derived result does not exist on its own; it must be calculated for each row, so it actually seems to make sense that each row must be read. It appears that the engine then creates a tally table with the same number of rows, includes the result of the calculation as a key column, then scans this table for qualifying results.

But if the selection criteria was based on an existing column value - and if this column was properly indexed - would that not reduce the number of reads to just the number of qualifying rows? Even if the function then performed a similar computation using existing columns to derive a new result, it seems to me that the rows read should still be held to just "qualifying" rows, avoiding a full-table scan.

My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Phil Parkin SSC Guru Points: 247227 More actions · Answer 11

TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?
NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.

Not even for CHECK constraints?

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 12

Phil Parkin (1/4/2016)
TheSQLGuru (1/4/2016)
lucien.jacquet (1/4/2016)
My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?
NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.
Not even for CHECK constraints?

That's about the only exception

😎

Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 13

bjh1977 (1/4/2016)
Small correction required - UDFs came in as part of 2000 didn't they?

That's correct.

😎

Recce70 Say Hey Kid Points: 693 More actions · Answer 14

I remember that Simon Sabin did an article called "Scalar functions are evil", but whenever I needed to send someone a link to it I'd Google for "Simon Sabin is evil" 😛