September 28, 2015 at 10:51 pm
Kim Crosser (9/28/2015)
Jeff Moden (9/28/2015)
Kim Crosser (9/28/2015)
On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.Lets not guess... Post a couple such functions and let's find out.
Sometime - "good enough is"...
...until someone uses it somewhere else and that IS not only the nature of functions, but their primary purpose.
Example where I would use a UDF for formatting - Phone numbers.
Given a table where the phone numbers are stored as:
...
CC int, -- Country Code
NPA int, -- Numbering plan area/region code
NXX varchar(9), -- Exchange code
SNUM varchar(9) not null, -- Subscriber number
EXT varchar(9), -- Extension
...
(Yes - they could be INT values, except that a lot of people and companies like to use the keyboard alpha strings instead.)
Customer wants phone numbers to display in results as:
if CC is not null and CC <> 1 ("US"), display as "+CC"
if NPA is not null, display as "(NPA)"
if NXX is not null, display as "NXX-"
display SNUM
if EXT is not null, display as " xEXT"
Yes - you can write an inline Isnull/Coalesce/Case expression to format this, and repeat it in every column in every query that accesses the data, or you can write one simple UDF.
So - which is easier to maintain?
select
...
case isnull(CC,1) when 1 then '' else '+' + convert(varchar(5),CC) + ' ' end
+ coalesce('(' + convert(varchar(5),NPA) + ') ','')
+ coalesce(EXCH + '-','')
+ coalesce(SNUM,'') -- yeah, I know it is a not null column, but I am paranoid...
+ coalesce(' x' + EXT,'') as nicenum
...
or
select
...
myUDFNicePhone(CC,NPA,NXX,SNUM,EXT) as nicenum
I loaded a table with 800,000 records (with different field values) and queried it both ways multiple times. The function version took 3.4-4.5 seconds to process the 800,000 records, while the inline expression took 1.1-1.8 seconds.
Thus, the function averaged 1.6-3.4 seconds slower over 800,000 records, or an average of 4.25 microseconds per record slower (or less).
In real life, I have had to implement functions like this to handle multiple foreign telephone formats, where some country formats have dashes and some just spaces, and some have spaces at fixed intervals, while others can vary.
A similar issue arises with Postal (Zip) codes and formatting of address lines, which vary in interesting ways in different countries. Yes, you could write a big complex Case/Coalesce expression and then copy/paste it wherever you wanted to output these values, or you can write (and fully debug) one UDF and use it where needed. I know which of those customer systems I would rather support when some country decides to change its postal code format (like when the US went to Zip+4).
Code should be efficient and clean, but IMO replicating complex expressions in multiple locations should only be done when necessary - not just because you can squeeze a few more microseconds out of a query that is already performing well.
Ok, let's do an 800,000 row test. Keep in mind - I don't know exactly how what you posted is supposed to work so I just created some random data. Feel free to modify my test so that it's more accurate.
First the scalar and inline version of the function:
USE tempdb -- a db we all have
GO
-- Create a Scalar UDF
IF OBJECT_ID('dbo.SVF_FormatNbr') IS NOT NULL DROP FUNCTION dbo.SVF_FormatNbr
GO
CREATE FUNCTION dbo.SVF_FormatNbr
(
@cc varchar(5),
@NPA varchar(5),
@EXCH varchar(10),
@SNUM varchar(10),
@EXT varchar(10)
)
RETURNS varchar(50) AS
BEGIN
RETURN
(
SELECT
CASE isnull(@CC,1) WHEN 1 THEN '' ELSE '+' + convert(varchar(5),@CC) + ' ' END
+ coalesce('(' + convert(varchar(5),@NPA) + ') ','')
+ coalesce(@EXCH + '-','')
+ coalesce(@SNUM,'')
+ coalesce(' x' + @EXT,'') AS nicenum
);
END
GO
-- Create a iTVF
IF OBJECT_ID('dbo.iTVF_FormatNbr') IS NOT NULL DROP FUNCTION dbo.iTVF_FormatNbr
GO
CREATE FUNCTION dbo.iTVF_FormatNbr
(
@cc varchar(5),
@NPA varchar(5),
@EXCH varchar(10),
@SNUM varchar(10),
@EXT varchar(10)
)
RETURNS TABLE AS RETURN
(
SELECT
CASE isnull(@CC,1) WHEN 1 THEN '' ELSE '+' + convert(varchar(5),@CC) + ' ' END
+ coalesce('(' + convert(varchar(5),@NPA) + ') ','')
+ coalesce(@EXCH + '-','')
+ coalesce(@SNUM,'')
+ coalesce(' x' + @EXT,'') AS nicenum
);
GO
Now for some sample data:
IF OBJECT_ID('dbo.sometable') IS NOT NULL DROP TABLE dbo.sometable
GO
CREATE TABLE dbo.sometable
(
someid int identity primary key,
CC INT,
NPA INT,
EXCH varchar(10),
SNUM varchar(10),
EXT varchar(10)
)
-- Insert sample data
-- will take 5-10 seconds depending on your system (more if it stinks)
INSERT dbo.sometable(CC,NPA,EXCH,SNUM,EXT)
SELECT TOP (800000)
1000+ABS(CHECKSUM(newid())%9995),
1000+ABS(CHECKSUM(newid())%9995),
REPLACE(LEFT(newid(),6),'-',''),
REPLACE(LEFT(newid(),6),'-',''),
REPLACE(LEFT(newid(),6),'-','')
FROM sys.all_columns a, sys.all_columns b;
GO
And now the tests. For the iTVF version I tested it with a serial query plan and then with a parallel plan using Adam Mechanic's make_parallel function. I obviously did not test the scalar function with a parallel plan because, as we all know, scalar functions don't do parallel.
We're using a throw-away variable because I don't care how long it takes to return the query in the GUI. (I learned this here[/url]). We ran each test 5 times....
PRINT '=== Scalar test ==='
DECLARE @st datetime = getdate(), @x varchar(50);
SELECT @x = dbo.SVF_FormatNbr(CC,NPA,EXCH,SNUM,EXT)
FROM sometable
PRINT DATEDIFF(MS,@st,getdate());
GO 5
PRINT '=== iTVF test with serial plan ==='
DECLARE @st datetime = getdate(), @x varchar(50);
SELECT @x = nicenum
FROM sometable
CROSS APPLY dbo.iTVF_FormatNbr(CC,NPA,EXCH,SNUM,EXT)
OPTION (MAXDOP 1) -- ensure that it gets a serial plan
PRINT DATEDIFF(MS,@st,getdate());
GO 5
PRINT '=== iTVF test with parallel plan ==='
DECLARE @st datetime = getdate(), @x varchar(50);
SELECT @x = nicenum
FROM sometable
CROSS APPLY dbo.iTVF_FormatNbr(CC,NPA,EXCH,SNUM,EXT)
CROSS APPLY dbo.make_parallel() -- force a parallel plan
PRINT DATEDIFF(MS,@st,getdate());
GO 5
I knew what was going to happen before I ran the test....
Beginning execution loop
=== Scalar test ===
1860
=== Scalar test ===
1880
=== Scalar test ===
1853
=== Scalar test ===
1883
=== Scalar test ===
1873
Batch execution completed 5 times.
Beginning execution loop
=== iTVF test with serial plan ===
666
=== iTVF test with serial plan ===
653
=== iTVF test with serial plan ===
666
=== iTVF test with serial plan ===
696
=== iTVF test with serial plan ===
660
Batch execution completed 5 times.
Beginning execution loop
=== iTVF test with parallel plan ===
216
=== iTVF test with parallel plan ===
210
=== iTVF test with parallel plan ===
203
=== iTVF test with parallel plan ===
206
=== iTVF test with parallel plan ===
220
Batch execution completed 5 times.
The iTVF was three times faster when it got a serial plan. 9 times faster with a parallel plan. :w00t:
Edit: Copy/pasted wrong test results (same results, different test titles)
-- Itzik Ben-Gan 2001
September 29, 2015 at 5:57 am
Kim Crosser (9/28/2015)
The real question is whether the UDFs are causing any performance problems. 10x sounds horrible, but in your system does that result in a query that runs 10 minutes vs. 1 minute or is it (more likely) a run-time difference measured in micro- or milliseconds?
At the very least its a habit that I need to break. In doing so I need to decide the best alternative-coding-method to adopt - I'm reluctant to repeat code-snippets throughout my code where I used to use a UDF so that the code was centralised
Are the functions used in the Where clause and thus may be executed against a high percentage of the source data, or are they in the Select clause and only applied to the final result set? Or - as you noted on one of them - are they only used once per query to prep a parameter value?
I've done some analysis and the ones that are in a SELECT (e.g. to modify the data for some desirable presentation effect) are not many in number. Some of them have awful performance though, although in the context of the overall query it is "tolerable", my inclination is to program them out. What I'd really like is to modify the UDF so that I can log the fact that it ran, and thus figure out which ones are actually used. Can't think of a way to do that though ... perhaps something using SQL Profiler might catch them (at the Statement rather than the Batch level)
I have things like
CREATE PROCEDURE MyProc
@EndDateTime datetime
AS
...
SELECT @EndDateTime = dbo.fnDATEROUND(@EndDateTime, 1) -- dbo.fnDATEROUND(@Datetime, @AddDays)
...
SELECT Col1, Col2, ...
FROM MyTable
WHERE MyDateTime < @EndDateTime
I am happy with the performance of that and I don't think I have any/many alteratnive styles of:
...
WHERE MyDateTime < dbo.fnDATEROUND(@EndDateTime, 1)
and definitely none with
...
WHERE dbo.fnDATEROUND(MyDateTime, 1) <= @EndDateTime
😎
As an aside I'd like to prove that by searching for
WHERE ... AnyFunction(ColumnName)
but it feels like I would need a parser, rather than just the RegEx find in my Programmer's Editor to narrow that down.
Maybe searching for rubbish query plans would help find them (but that would require that they were executed ... which feels like a sledgehammer and nut approach
Hmmm ... always helpful having to type and describe stuff as it triggers thought processes ... all my column names are of the style "xxx_yyy_ColumnName" so I might actually be able to RegEx find "(xxx_yyy_". I'll report back 😉
Eric M Russell (9/28/2015)
computed table columns can be persisted ... referenced anywhere else without ... having the expense of re-computation.
I have tended to avoid computed columns (I have VIEWs with manipulated data) but its a very good point so I will revisit using computed columns. One of my slow-running presentation functions is
dbo.fnFullName(@Title, @FirstName, @LastName, @intStyle)
where the @Style indicates if the output should be Lastname-Firstname or the other way round and deals with concatenating them if any part is NULL. Might even be sufficient to have computed columns for FirstNameBlankIfNull, LastNameBlankIfNull ... so that I can use a simple concatenation rather than a herd of ISNULL()s ...
That said, a preprocess could also take care of fnFullName(@Title, @FirstName, @LastName, @intStyle) provided that the @intStyle was a CONST (which I expect it is 99% of the time)
Ed Wagner (9/28/2015)
the task will be a whole lot better now than having to do it when people are screaming about performance problems in production.
yeah, I do take that view. When we see something that can be improved we take the time to make the change, the longer we leave it downstream the more costly the fix becomes. We have slack built into schedules to allow for that, and we can always ditch that strategy when time is tight near the release date.
One approach we have is to add a comment to the start of a script file (e.g. for a Procedure / Function) and start the line with "zzzz" to deliberately trigger a syntax error.
To create a release script we concatenate all (relevant) scripts together into a single file and we then change all "zzzz" to "--zzzz" in order to allow it to run (we do that one-by-one in case any of them say "Don't whatever you do release this procedure in this state"!!)
So my First Step would be to put a "zzzz inline code required for fnSlowFunction" in every script that uses that function. Easy to do that globally with my programmers editor 🙂 and then we can tackle them, when time permits either sooner or later, by just searching for anything that has "zzzz inline code required". If we revisit a procedure with a ZZZZ comment sooner, for code maintenance, we will fix that as part of the other changes.
Jeff Moden (9/28/2015)
Ed Wagner (9/28/2015)
I hope it doesn't use FORMAT inside the scalar function. 😛Now THAT would be funny!
That would be REALLY funny as most of my functions are 10 years old 😛
... of course I have loads of scalar functions calling dbo.fnKristensFormat :hehe:
patrickmcginnis59 10839 (9/28/2015)
A preprocessor would be awesome!
Donkeys Years Ago, when we had to do a fair bit of ELT using sequential editing of CSVs and the like I used to use AWK (Text file processing tool from the original author(s?) of C language) and SED (Sequential Editor). I'll revisit SED as it would easily do this job, and it would mean that I could build a list of Find & Replace RegExs that could be applied to a series of files.
Way-back-then SED was available for free as part of the GNU tools; I expect it still is, and probably something more spivy exists on GitHub.
Something that SQL itself applied as part of just executing a script would be much better of course, because I'd be using the final-form code during DEV rather than only after a deliberate step in the DEV/Release process
Kim Crosser (9/28/2015)
I loaded a table with 800,000 records (with different field values) and queried it both ways multiple times. The function version took 3.4-4.5 seconds to process the 800,000 records, while the inline expression took 1.1-1.8 seconds.Thus, the function averaged 1.6-3.4 seconds slower over 800,000 records, or an average of 4.25 microseconds per record slower (or less).
I look at that statistic through different eyes I'm afraid:
"Using a scalar function added 2 seconds to a 3 second task", rather than microseconds-per-row.
However, I do need to test it in a complex query. e.g. where I have
SELECT ... lots of columns ...
, dbo.fnFullName(MyTitle, MyFirstname, MyLastname, 1)
FROM MyTable
... lots of JOINS ...
to see what the actual increase in runtime is.
But I still think that if I can easily figure out how to change such a function call to inline using a pre-processor batch process then that will be my solution. No code changes 😎
Alan.B (9/28/2015)
Ok, let's do an 800,000 row test
Very interesting, thanks.
September 29, 2015 at 6:19 am
Alan.B (9/28/2015)
Ok, let's do an 800,000 row test.
Nice test, Alan. I kinda knew how it would turn out, but it was nice to see it done.
Alan.B (9/28/2015)
And now the tests. For the iTVF version I tested it with a serial query plan and then with a parallel plan using Adam Mechanic's make_parallel function.
That's something new to me, so thank you for it. I'm going to have to investigate and play around with this.
September 29, 2015 at 6:22 am
Kristen-173977 (9/29/2015)
Ed Wagner (9/28/2015)
the task will be a whole lot better now than having to do it when people are screaming about performance problems in production.yeah, I do take that view. When we see something that can be improved we take the time to make the change, the longer we leave it downstream the more costly the fix becomes. We have slack built into schedules to allow for that, and we can always ditch that strategy when time is tight near the release date.
Kristen, I can't tell you how relieved I am to hear that. I've seen people bitten by that before and am glad you've learned the lesson without having to endure the pain. The attitude of "hey, it works and it has to be released today; so what if it's slow" comes up. Sadly, the author isn't the one who has to deal with the performance problem - I am. :crazy:
September 29, 2015 at 11:00 am
Alan.B (9/28/2015)
Alan - minor issue, but the first two fields (CC, NPA) are actually Int values. Making them Varchar(5) introduces some additional overhead in the function, where it then needs to do some implicit conversions.
I think everyone missed the point of the 800,000 records. It isn't that you can save a couple of seconds on an 800,000 record test. It is that it took 800,000 iterations to make enough of a difference to be noticeable.
In real life, functions like this one which applies formatting to values in the output of a select statement are applied to much smaller result sets. If that query returned 1000 records, the difference in the output times wouldn't be 2-3 seconds, it would be 3.75 Milliseconds or less!
Also, that function was a very simplified version of a real one I did years ago, where it not only handled U.S. NANP formats, but also several selected foreign formats.
BTW - I do try to write all my SQL without using UDFs - except where a UDF is appropriate and makes the system more maintainable without unduly slowing the system. And I try to minimize/avoid functions of any kind in a Where (or Join) clause.
September 29, 2015 at 2:19 pm
Ed Wagner (9/29/2015)
Alan.B (9/28/2015)
Ok, let's do an 800,000 row test.
Nice test, Alan. I kinda knew how it would turn out, but it was nice to see it done.
Alan.B (9/28/2015)
And now the tests. For the iTVF version I tested it with a serial query plan and then with a parallel plan using Adam Mechanic's make_parallel function.
That's something new to me, so thank you for it. I'm going to have to investigate and play around with this.
I love make_parallel, it's absolutely genius. I have found it super helpful for iTVF's that use a tally table for some kind of string manipulation. Really good stuff.
-- Itzik Ben-Gan 2001
September 29, 2015 at 2:39 pm
Kim Crosser (9/29/2015)
Alan.B (9/28/2015)
Alan - minor issue, but the first two fields (CC, NPA) are actually Int values. Making them Varchar(5) introduces some additional overhead in the function, where it then needs to do some implicit conversions.
I think everyone missed the point of the 800,000 records. It isn't that you can save a couple of seconds on an 800,000 record test. It is that it took 800,000 iterations to make enough of a difference to be noticeable.
In real life, functions like this one which applies formatting to values in the output of a select statement are applied to much smaller result sets. If that query returned 1000 records, the difference in the output times wouldn't be 2-3 seconds, it would be 3.75 Milliseconds or less!
A couple seconds in that test but hopefully you try to understand the lesson. I just showed you how to make a query run 3-9 times faster. I aver that the performance gains will be similar regardless of how complex the scalar function is. 2 seconds is not a big deal unless it's a function/query that runs 100's or thousands of times/day. 3-9 times faster is not a big deal until we're talking about queries that take minutes or hours. Not a week has gone by where I haven't said to my boss, PM, co-worker or client, "... that query that was running for 2 hours you wanted me to look at, I got it down to 4 seconds." What I demonstrated is one technique for accomplishing this.
Also, that function was a very simplified version of a real one I did years ago, where it not only handled U.S. NANP formats, but also several selected foreign formats.
BTW - I do try to write all my SQL without using UDFs - except where a UDF is appropriate and makes the system more maintainable without unduly slowing the system. And I try to minimize/avoid functions of any kind in a Where (or Join) clause.
What we both posted is surely a simplified version of what we all deal with. The performance gains from scalar to inline, in my experience, are more profound as the query gets more complex. I create functions often to enforce best practices and provide clients/colleagues with high-performing re-usable code so that work is not duplicated.
-- Itzik Ben-Gan 2001
September 29, 2015 at 4:22 pm
Kim Crosser (9/29/2015)
I think everyone missed the point of the 800,000 records. It isn't that you can save a couple of seconds on an 800,000 record test. It is that it took 800,000 iterations to make enough of a difference to be noticeable.
Correct. Now just imagine everything in your database running more than twice as fast. You just about can't buy a machine with such a performance improvement. As for 800,000 rows, it's been a long time since I've had to work with so few.
It takes so little time and effort to get a 100 to 60,000% (not a misprint) out of code. Don't do it slow when you know how to do it fast. "Mind the pennies and the dollars will take care of themselves."
--Jeff Moden
Change is inevitable... Change for the better is not.
September 29, 2015 at 8:24 pm
Ah. My apologies, Kim... I missed what you said here...
Kim Crosser (9/29/2015)
Also, that function was a very simplified version of a real one I did years ago, where it not only handled U.S. NANP formats, but also several selected foreign formats.
BTW - I do try to write all my SQL without using UDFs - except where a UDF is appropriate and makes the system more maintainable without unduly slowing the system. And I try to minimize/avoid functions of any kind in a Where (or Join) clause.
Same here and sounds like you're actually cut from the same cloth. There have been places where I've been forced into make a UDF, like when people insist that they need a function for a persisted computed column. Of course, I'd rather have something like that than people using SFs or mTVFs in a JOIN or WHERE clause.
Still, whether it's a function or not, I do try to make things ultimately scalable especially if it's some form of function because such encapsulations of logic are bound to be used in places where a thousand rows would look like the head of a pin in a timber-spike factory. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
September 29, 2015 at 8:57 pm
Jeff Moden (9/29/2015)
Kim Crosser (9/29/2015)
I think everyone missed the point of the 800,000 records. It isn't that you can save a couple of seconds on an 800,000 record test. It is that it took 800,000 iterations to make enough of a difference to be noticeable.Correct. Now just imagine everything in your database running more than twice as fast. You just about can't buy a machine with such a performance improvement. As for 800,000 rows, it's been a long time since I've had to work with so few.
It takes so little time and effort to get a 100 to 60,000% (not a misprint) out of code. Don't do it slow when you know how to do it fast. "Mind the pennies and the dollars will take care of themselves."
Amen. I know everyone's heard about queries that run too slowly and need to be tuned. However, I've never heard of anyone being told that their queries run too fast and they need to be slowed down.
Once you know how to make the database walk, figure out how to make it run. Next, learn how to make it fly. Then you'll be working on different speeds of flight, but you'll never want to go back to walking again.
September 30, 2015 at 12:41 am
Jeff Moden (9/29/2015)
Don't do it slow when you know how to do it fast.
+1000 (to quote your good self 😉 )
I don't want to have to revisit the code, once deployed, if I can avoid it. Put differently: if we ever have to visit code that is in production to either fix a bug or fix performance then there is blood on the carpet and a post mortem about how it happened, and what changes are needed to procedures and / or methods to prevent it happening again.
Clearly Scalar UDFs have not brought my APPs to their knees or I would have known, already, that this was a problem that needed avoiding. I hang out in forums plenty, so I'm a bit surprised that I haven't clocked this before - probably due to my earlier comment that as a C/C++ programmer turned SQL-er I just assumed that Scalar Functions (when not applied to a column in WHERE / JOIN) were trivial compared to inline code. Perhaps MS can make them trivial in the future ...
... but now I know I want to fix that because it is a smoking gun; perhaps there is only a 5% gain to be had, perhaps on certain queries it is nX, I'll take both of those Thank You Very Much 😉 but more importantly, now I know, I don't want to be wasting the 5% and nX on any new code.
One of my #1 priorities is that we don't have a DEV schedule wrecked by having to fire-fight Production Code, particularly when data-size ramps up.
Fixing Production is hugely expensive for us (I imagine it is for everyone ...) - the time to fix is not in my schedule, the urgency of business managers saying "Company is losing money" leading to a rushed-job, the fix causing side effects elsewhere in the system, the reluctance of the Business Client to do any proper testing (compared to when we are rolling out new features for them), and of course the actual cost of out (DEV/QA department's) Testing cycle - which is only in my budget for new releases, not an emergency fix.
We have close to zero failures in Production and I'd like to keep it that way 😎
September 30, 2015 at 10:25 am
OK OK - I give!!!
I hereby take the No-UDF pledge.
From now on, I will never use a UDF. Instead, I will only write Select statements that include in-line complex expressions that may run for 20-30 lines per field until they resemble paragraphs from James Joyce's Ulysses. Maintainers may feel like they are navigating through the Colossal Cave when traversing my syntax, but I will get that last microsecond or else!
:hehe:
September 30, 2015 at 10:42 am
Kim Crosser (9/30/2015)
OK OK - I give!!!I hereby take the No-UDF pledge.
From now on, I will never use a UDF. Instead, I will only write Select statements that include in-line complex expressions that may run for 20-30 lines per field until they resemble paragraphs from James Joyce's Ulysses. Maintainers may feel like they are navigating through the Colossal Cave when traversing my syntax, but I will get that last microsecond or else!
:hehe:
You could also use iTVFs, that way you keep performance and simplicity in the code. 😀
September 30, 2015 at 10:58 am
Sometimes folks will use a UDF to abstract or reuse the implementation details of some transformation, when adding a computed column on a view or table works better for performance. A good example is a legacy table with date+time contained in a varchar column, and the application developers "dont have a problem with it". So you can add a persisted computed column to the table containing a properly converted to proper DateTime type. Not only does it allow you to circumvent repeated conversions, but it also serves as a check constraint to block insertion of invalid values that cant be converted.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
September 30, 2015 at 2:06 pm
BTW - that demonstration of the iTVF approach (much earlier above) was very helpful and interesting. 🙂
Honestly, I never considered using an iTVF in that type of situation, but I will now.
It does sort of raise the question of why SQL Server is so much slower to return one scalar value than to do a table join, when the logic (and SQL expression) is identical in both.
It also means that one *could* actually write a procedure that would automatically refactor queries with scalar UDFs to iTVFs (automatically convert the UDF to an iTVF, then replace the UDF references in the queries with table references).
Hmm... maybe a fun project for an slow day? 😉
Viewing 15 posts - 16 through 30 (of 32 total)
You must be logged in to reply to this topic. Login to reply