Scalar-valued, user-defined functions (sUDFs) in T-SQL are often maligned, with people normally pointing to their lack of performance. This is a reputation that is unfortunately well-deserved. Anyone that knows me also knows of my preference for in-line, table-valued functions (iTVFs), because they’ll generally perform better. In fact, I’ve written and published articles on several utility iTVFs that are an essential part of my own, personal T-SQL tool kit.
Before anyone goes and makes generalized assertions on the performance of sUDFs, one must be familiar with one quirk of measuring their performance. This was well documented by SQL MVP Jeff Moden in his SQL Spackle article whimsically entitled How to Make Scalar UDFs Run Faster, where he doesn’t really “make them run faster,” but rather shows how to more accurately measure their true performance.
Regardless of their performance, or lack thereof, there remains some use cases where the ability to create a sUDF is essential, because no other method will truly suffice as a replacement. So today I’d like to look at a few of these cases, because perhaps you haven’t seen some of them before. For each of our use cases, we’ll construct a somewhat contrived, simple example, but where we can we’ll also try to show you something that is perhaps a little more useful in real-world practice.
An Example Table and Some Test Data
In order to expound upon our sUDF use cases, we’ll need to create a test table and populate it with some sample data.
IF OBJECT_ID(N'dbo.SampleTableforSUDFUseCases', N'U') IS NOT NULL DROP TABLE dbo.SampleTableforSUDFUseCases; CREATE TABLE dbo.SampleTableforSUDFUseCases ( ID INT PRIMARY KEY , C1 INT NOT NULL ); GO DECLARE @NoRows INT = 10000; WITH Tally ( n ) AS ( SELECT TOP ( @NoRows ) ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) FROM sys.all_columns a CROSS JOIN sys.all_columns b ) INSERT INTO dbo.SampleTableforSUDFUseCases ( ID , C1 ) SELECT n , 1 + ABS(CHECKSUM(NEWID())) % 50 FROM Tally; GO
This creates 10,000 rows of sample data where column C1 contains a random number between 1 and 50.
Using a sUDF in a DEFAULT CONSTRAINT
For an example of using a sUDF in a DEFAULT CONSTRAINT, we’ll first need to construct a function for this purpose. This function will calculate the statistical mode. To put it simply, the mode in a set of sample data is the most frequently occurring value.
We can construct a sUDF that calculates mode:
CREATE FUNCTION dbo.CalculateMode ( ) RETURNS INT AS BEGIN DECLARE @ModeOfC1 INT = ( SELECT TOP 1 C1 FROM ( SELECT C1 , C = COUNT(C1) FROM dbo.SampleTableforSUDFUseCases GROUP BY C1 ) a ORDER BY C DESC ); RETURN @ModeOfC1; END;
Note that it is possible that in our 10,000 sample points there might be several actual values that share the same highest frequency count, so we’ll assume that any of them can be used to represent the mode.
Let’s run a quick query to show us the current mode in our sample data (note that if you’re following along you’ll get a different value for C1 and C).
SELECT TOP 1 C1, C FROM ( SELECT C1, C=COUNT(C1) FROM dbo.SampleTableforSUDFUseCases GROUP BY C1 ) a ORDER BY C DESC; -- Results (our mode is 20 and there are 226 rows where C1 is 20) C1 C 20 226
So now let’s use our sUDF to create a named, DEFAULT CONSTRAINT for column C1.
ALTER TABLE dbo.SampleTableforSUDFUseCases ADD CONSTRAINT Tst_df_c1 DEFAULT (dbo.CalculateMode()) FOR C1; GO
We can then insert a new row and see that the result that was inserted (since it was not specified on the INSERT statement) defaulted to the current mode.
INSERT INTO dbo.SampleTableforSUDFUseCases (ID) SELECT 10001; SELECT * FROM dbo.SampleTableforSUDFUseCases WHERE ID = 10001; -- Results ID C1 10001 20
We have just demonstrated a way to set a non-deterministic DEFAULT CONSTRAINT for a column in a table.
Note that an UPDATE statement that uses DEFAULT for the value being set will also call our sUDF to assign the mode to the updated row(s). For example:
UPDATE dbo.SampleTableforSUDFUseCases SET C1 = DEFAULT WHERE ID = 10000;
Using a sUDF in a Computed Column
For our second example of using a sUDF, let’s create a FUNCTION that calculates a running total. May the gods of SQL performance forgive me this transgression, but I’ll get back to that in a minute.
IF OBJECT_ID(N'dbo.RunningTotal', N'FN') IS NOT NULL DROP FUNCTION dbo.RunningTotal; GO CREATE FUNCTION dbo.RunningTotal ( @ID INT ) RETURNS INT AS BEGIN DECLARE @RT INT = ( SELECT SUM(C1) FROM dbo.SampleTableforSUDFUseCases WHERE ID <= @ID ); RETURN @RT; END;
We can now add a computed column to our table (RT) and call this value to show the running total (ordered by the ID column).
ALTER TABLE dbo.SampleTableforSUDFUseCases ADD RT AS (dbo.RunningTotal(ID)); GO SELECT TOP 5 ID, C1, RT FROM dbo.SampleTableforSUDFUseCases ORDER BY ID; -- Results ID C1 RT 1 8 8 2 19 27 3 14 41 4 26 67 5 43 110
Here is where the pundits are gonna thrash me for using such a poor-performing method of calculating a running total! Let me state categorically and for the record, this is not how I would calculate a running total in actual practice. This is merely a demonstration of a technique (using a sUDF in a computed column). If I were to actually calculate a running total, my preference would be:
- In SQL 2012, I’d probably use the SUM operator OVER a window frame (ROWS UNBOUNDED PRECEDING).
- In earlier versions of SQL, I’d probably use a Quirky Update (despite my inset warning about it in “Filling In Missing Values Using the T-SQL Window Frame”), and all of the caveats and rules thoroughly elucidated by Jeff Moden (along with other techniques for calculating a running total) here: “Solving the Running Total and Ordinal Rank Problems.”
Just for fun though, we’re going to create another FUNCTION, this time a schema bound iTVF that will also calculate a running total.
IF OBJECT_ID (N'dbo.RunningTotal2', N'TVF') IS NOT NULL DROP FUNCTION dbo.RunningTotal2; GO CREATE FUNCTION dbo.RunningTotal2(@ID INT) RETURNS TABLE WITH SCHEMABINDING RETURN SELECT RT2=SUM(C1) FROM dbo.SampleTableforSUDFUseCases WHERE ID <= @ID;
The reason we’re going to do this (besides for fun) is to demonstrate the supposition that sUDFs perform more slowly than an equivalent iTVF. Try the following test harness.
DECLARE @RunningTotal BIGINT = 0 ,@StartDT DATETIME = GETDATE() ,@DumpVar BIGINT; SELECT @DumpVar = RT2 FROM dbo.SampleTableforSUDFUseCases CROSS APPLY dbo.RunningTotal2(ID); SELECT ElapsedMSiTVF = DATEDIFF(millisecond, @StartDT, GETDATE()); SELECT @StartDT = GETDATE(); SELECT @DumpVar = RT FROM dbo.SampleTableforSUDFUseCases; SELECT ElapsedMSsUDF = DATEDIFF(millisecond, @StartDT, GETDATE());
When I ran this I got the following results on my 10,000 (or so) rows of data.
ElapsedMSiTVF 6676
ElapsedMSsUDF 7436
This demonstrates not only that the iTVF version of our running totals function is faster, but also the correct way to calculate timing results for a sUDF (as noted again by Jeff Moden in the article I mentioned in the second paragraph).
I recently ran across a really interesting case for using a sUDF in a computed column. SQL MVP Steve Jones, who’s probably better known as the editor of the SQL Server Central web site, recently posted to his Voice of the DBA blog site this article: “T-SQL Tuesday #65 – Learning Computed Columns for XML.”
In that short blog, Steve shows how you can create a sUDF to use in a computed column, which shreds a bit of information out of an XML column. Rather than reproduce that here, I’ll let you read his blog. Suffice it to say, I thought that was a pretty darned cool use case!
Using a sUDF in a CHECK CONSTRAINT
For our third and final use case for sUDFs, let’s create one more function.
IF OBJECT_ID (N'dbo.PctVarianceFromMedian', N'FN') IS NOT NULL DROP FUNCTION dbo.PctVarianceFromMedian; GO CREATE FUNCTION dbo.PctVarianceFromMedian(@C INT) RETURNS FLOAT AS BEGIN DECLARE @Median FLOAT = ( SELECT AVG(0.+b.C1) FROM ( SELECT c1=(c+1)/2 ,c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END FROM ( SELECT c=COUNT(*) FROM dbo.SampleTableforSUDFUseCases ) a ) a JOIN ( SELECT C1, rn=ROW_NUMBER() OVER (ORDER BY C1) FROM dbo.SampleTableforSUDFUseCases ) b ON b.rn BETWEEN a.c1 AND a.c2 ); RETURN 100.*ABS(@C-@Median)/@Median; END
This function first calculates the median for our sample data using a technique I describe in An Even Faster Method of Calculating the Median on a Partitioned Heap. Note that this might not be the fastest way to calculate median for this INDEXing case (see Best approaches for grouped median by SQL MVP Aaron Bertrand for other ways), but it will suffice for our purposes.
After calculating the median, it compares this against the argument to the function doing a percentage over/under type calculation. We’ll now set up a CHECK CONSTRAINT on C1 that will ensure that any new value inserted for C1 is within ±10% of the calculated median.
ALTER TABLE dbo.SampleTableforSUDFUseCases ADD CONSTRAINT Tst_ck_InsertedValueExceedsVarianceFromMedian CHECK (10>dbo.PctVarianceFromMedian(C1)); GO
We can run a script to calculate the median four our sample data, using the same technique:
SELECT Median=AVG(0.+b.C1) FROM ( SELECT c1=(c+1)/2 ,c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END FROM ( SELECT c=COUNT(*) FROM dbo.SampleTableforSUDFUseCases ) a ) a JOIN ( SELECT C1, rn=ROW_NUMBER() OVER (ORDER BY C1) FROM dbo.SampleTableforSUDFUseCases ) b ON b.rn BETWEEN a.c1 AND a.c2; -- Results Median 26.000000
Now we’ll try to insert a new sample data item (12) that is outside of the allowable 10% range (calculated as: 100*ABS(12-26)/26 = 53.8%), and we’ll get an error indicating that the insertion violates our newly imposed CHECK CONSTRAINT.
INSERT dbo.SampleTableforSUDFUseCases (ID, C1) VALUES (10002, 12); -- Resulting error The ALTER TABLE statement conflicted with the CHECK constraint "Tst_ck_InsertedValueExceedsVarianceFromMedian". The conflict occurred in database "YOUR-DATABASE", table "dbo.SampleTableforSUDFUseCases", column 'C1'.
So using sUDFs we can impose more elaborate checking on the data we insert into table columns.
I once saw a very interesting use case for this approach having to do with hierarchies that are stored in adjacency lists. I’ll apologize in advance for not having a link to the original source for this.
To demonstrate this case, we’ll refer you to yet another article by Jeff Moden called Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets. If you run the stored procedure he provides there named BuildSmallEmployeeTable, it builds a small adjacency list hierarchy, which once you select results from it you’ll see this.
SELECT EmployeeID, ManagerID, EmployeeName FROM dbo.Employee; -- Results EmployeeID ManagerID EmployeeName 2 26 Lynne 3 26 Bob 6 17 Eric 7 3 Vivian 8 3 Bill 12 8 Megan 14 8 Kim 17 2 Butch 18 39 Lisa 20 3 Natalie 21 39 Homer 26 NULL Jim 39 26 Ken 40 26 Marge
I’ll also refer you now to the very nice graphic Jeff provides in his article showing the four levels that this hierarchy/adjacency list represents.
In any adjacency list that you intend for any useful purpose, when inserting or updating nodes you must ensure that you do not introduce a cyclic reporting relationship. In other words (for example) if manager A has employee B, employee B cannot also be the manager of A. That’s a pretty simple cycle but they can become much more convoluted the deeper your hierarchy becomes.
So let’s create a sUDF that will (hopefully) check for cyclic reporting relationships. Since it utilizes a recursive Common Table Expression (rCTE), you may want to use the MAXRECURSION OPTION if you’ve got a hierarchy that is more than 100 levels deep (the default).
CREATE FUNCTION dbo.CheckManagerCycle (@managerID INT) RETURNS INT AS BEGIN DECLARE @cycleExists INT = 0; WITH RecurseHierarchy as ( SELECT EmployeeID, ManagerID, EmployeeName FROM dbo.Employee WHERE EmployeeID = @managerID UNION ALL SELECT a.EmployeeID, a.ManagerID, a.EmployeeName FROM dbo.Employee a JOIN RecurseHierarchy b ON b.ManagerID = a.EmployeeID and a.EmployeeID <> @managerID ) SELECT @cycleExists = count(*) FROM RecurseHierarchy WHERE ManagerID = @managerID; RETURN @cycleExists; END
We can then call this sUDF in the following CHECK CONSTRAINT, where WITH CHECK causes it to run for all data currently saved in the table.
ALTER TABLE dbo.Employee WITH CHECK ADD CONSTRAINT ManagerCycleCheck CHECK ((dbo.CheckManagerCycle(ManagerID)=0)); GO
Since executing the above did not fail, we’ll assume the hierarchy has no cyclic reporting relationships at the outset.
Looking again at Jeff’s nice hierarchy graphic, we see that if we change Bob’s (whose EmployeeID = 3) manager from Jim (EmployeeID = 26) to Megan (EmployeeID = 12), we’ll introduce a cycle that should not be allowed to occur because Megan already reports to Bob through Bill (EmployeeID = 8).
In SQL terms, that change (UPDATE) is shown below.
UPDATE dbo.Employee SET ManagerID = 12 WHERE EmployeeID = 3; -- Resulting error The UPDATE statement conflicted with the CHECK constraint "ManagerCycleCheck". The conflict occurred in database " YOUR-DATABASE", table "dbo.Employee", column 'ManagerID'.
The failure of this seems to verify that this function works as advertised.
At this point I need to let you know that I have not thoroughly tested this sUDF/CHECK CONSTRAINT against all possible operations that could cause a cyclic reporting relationship, as I would if I was going to introduce it into a production system. However I will say that it does seem to work correctly for the limited cases that I have tested. Certainly it worked for our example case. So before you use it, be sure to run it through its paces using your adjacency list hierarchy.
Still, it does make for quite an interesting use case for a sUDF in a CHECK CONSTRAINT.
Learning Recap and Conclusions
Today we have learned that sUDFs can be used in DEFAULT CONSTRAINTs, CHECK CONSTRAINTs and to calculate the value for a computed column. In fact, using a sUDF for these purposes is about the only way you’ll be able to introduce rather complex calculations for any of those scenarios.
While performance of sUDFs may not be as swift as for other methods, if they are your only recourse to get the job done, then you probably won’t want to hesitate to use them.
I’d also like to acknowledge and thank SQL MVPs Jeff Moden, Aaron Bertrand and Steve Jones for their prior work, which greatly helped in the development of this article. I pride myself on learning best of breed solutions from the best of the best.
You may need to clean up your sandbox if you’ve been following along, so here is a SQL script you can use to do that (except for the Employee table, where Jeff’s article will take care of you).
DROP FUNCTION dbo.RunningTotal2; ALTER TABLE dbo.SampleTableforSUDFUseCases DROP CONSTRAINT Tst_df_c1; ALTER TABLE dbo.SampleTableforSUDFUseCases DROP COLUMN RT; DROP FUNCTION dbo.RunningTotal; DROP FUNCTION dbo.PctVarianceFromMedian; DROP FUNCTION dbo.CalculateMode; DROP TABLE dbo.SampleTableforSUDFUseCases;
Follow me on Twitter: @DwainCSQL and check out my blog: SQL Tips for the Up and Comers
© Copyright Dwain Camps 21 Apr 2015. All rights reserved.