January 23, 2008 at 2:50 am
Hi All,
I read in this particular post (http://sqlblog.com/blogs/adam_machanic/archive/2006/08/04/scalar-functions-inlining-and-performance-an-entertaining-title-for-a-boring-post.aspx) that is a scalar valued function is used as table valued function it would perform better ... but I do not really understand it why??
Can someone guide me in the right direction ?
Thanks
January 23, 2008 at 6:01 am
I think this is a case of buyer beware, your mileage may vary, some assembly required... You get the drift.
Scalar functions, as Adam very clearly outlines, unless they're very simple & direct, generally lead to RBAR (row-by-agonizing-row) processing (all due deference to Jeff Moden) because the optimizer can't simply unpack them and add them to the execution plan in a meaningful way. So instead, Adam showed how using a table valued function, in that case, resulted in a query that the optimizer could do something with, hence a radical increase in speed (through a reduction in I/O).
It's not the case that ALL table valued functions will perform better than ALL scalar functions, but rather that there are situations where you achieve significant wins, depending on what the different functions are doing. Further, using Adam's example, you probably write a query, no function needed, that would perform even faster than the final UDF.
Personall, at my company, we've found functions just too problematic to use them much. Note the keyword, much. We do use them, but we try to very carefully identify places where we can achieve either maintenance & coding improvements at no performance cost, or performance improvements. Other than that, we try to stay away from them because they are very problematic, as Adam outlined.
Does that help at all?
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 23, 2008 at 6:37 am
What I derived is:
functions can cause Row by row processing and also that would result in a table scan as for each row the function will have to be executed. The analogy which can be taken here is the use of aggregate function in the where clause.
Questions:
In case of table valued functions how does the processing take place? .... I read that joins can be executed by FunctionReturningTable(parameter) ... In this case the execution plan generated for the SP will include that of Function ... but when the function is in the where clause how does the database engine create a execution plan?
how and why in the above example returning a table reduce the logical reads?
January 23, 2008 at 7:53 am
If you look at the execution plans for the two queries, the scalar function looks like the simpler plan. It simply has a clustered scan and a scalar operation. The scan is apparently the full cost of the execution. The other query shows two scans of the table, a hash match aggregate, and a merge join. Basically the optimizer is able to see into the table valued function and incorporate it's query into the plan for the outer query. Because it's able to generate a plan, it can make good choices in terms of performance. I set up the tests on my copy of AdventureWorks and saw the performance times go from 937ms to 78ms.
The reads, according to SQL Trace, go from 374944 to 1253. Again, this has to come from getting an actual plan that seeks out appropriate data & joins it together as opposed to opening the table and checking each and every row, one by one.
The only funny thing was, based on estimated plan cost, when the two scripts were run side by side, was 1% and the second plan was 99% even though the actual execution statistics went in exactly the opposite direction.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 23, 2008 at 8:02 am
One of the things to watch for with table-valued functions (especially the multi-statement table valued functions) is that the resultant table, just like a table variable) has no column statistics and no indexes. The optimiser doesn't know how many rows are in the function's result table. That's not an issue for small numbers of rows, but it becomes a very big issue for larger numbers of rows.
If the optimiser misjudges the number of rows, it can result in very poor query plans.
Note, this is a maybe, possibly, could be, not definitely and always. Test carefully.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 23, 2008 at 8:11 am
The other place not yet mentioned where this gets even more important to consider is when you use a table-valued function within a CROSS APPLY clause. It's in those cases where it sets up something like a "correlated sub-function".
Again - very powerful potential, but will tend to quickly turn into something not at all efficient on the execution plan side.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 24, 2008 at 12:31 pm
I've bench marked the type of scalar function in Adam's article against correlated sub queries, since they're doing the same type of thing, and they seem to work at about the same efficiency. When the result set is small you don't see much performance problem. When the result set gets into the hundreds or thousands of rows, the performance takes a big dive.
I've pretty much always used derived tables when I need to compare a single row to an aggregate value and I usually get good performance.
Something like this:
SELECT
O.CustomerID, OD.ProductID, OD.Quantity
, X.MaxQty
FROM
Orders O
INNER JOIN
OrderDetails OD ON
O.OrderID = OD.OrderID
INNER JOIN
(SELECT ProductID, MAX(Quantity) AS MaxQty
FROM Orders O
INNER JOIN OrderDetails OD ON
O.OrderID = OD.OrderID
WHERE O.OrderDate BETWEEN '1/1/07' and '1/31/07'
GROUP BY OD.ProductID
) AS X ON
OD.ProductID = X.ProductID
WHERE
O.OrderID BETWEEN '1/1/07' AND '1/31/07'
The hard dates are usually just variables in the SProc.
I'd have to bench mark a UDF that returns a table against this type of logic to see what happens
Todd Fifield
June 25, 2008 at 1:53 pm
Matt Miller (1/23/2008)
The other place not yet mentioned where this gets even more important to consider is when you use a table-valued function within a CROSS APPLY clause. It's in those cases where it sets up something like a "correlated sub-function".Again - very powerful potential, but will tend to quickly turn into something not at all efficient on the execution plan side.
GilaMonster (1/23/2008)
One of the things to watch for with table-valued functions (especially the multi-statement table valued functions) is that the resultant table, just like a table variable) has no column statistics and no indexes. The optimiser doesn't know how many rows are in the function's result table. That's not an issue for small numbers of rows, but it becomes a very big issue for larger numbers of rows.If the optimiser misjudges the number of rows, it can result in very poor query plans.
Note, this is a maybe, possibly, could be, not definitely and always. Test carefully.
I'm assuming that those and the other statements talk about table functions in general, and I believe they are true for non-Inline table functions.
But I'm very interessed to hear what you guys think about Inline Table Functions.
if we use inline table functions, would we have performance decrease?
As far as I understand, when an SQL statement references an Inline Table Function,the parser and query optimizer analyze the source of both the SQL statement and the Inline Table Function and then resolve them into a single execution plan. There is not one plan for the SQL statement and a separate plan for the Inline Table Function (please correct me if I'm wrong). If this is true, would we still have any reasons to believe that an Inline Table Function may hurt the performance?
Thanks,
Luiz.
June 25, 2008 at 3:13 pm
It's ultimately going to depend on what the function is and how it does it. Yes, inlining a function can yield some interesting perf gains, but again, they tend to fall into specific circumstances. So - there's not really a decent way to generalize. In both of the cases you mentioned, even if the function execution itself is faster due to inlining, there are lots of opportunities for derived tables to flat out beat this kind of setup (unless the inline function itself is a derived table setup).
This is just one of those where you need to test, test and test. And did I mention to test?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
June 25, 2008 at 3:29 pm
The best thing to do, in most cases, it to avoid any kind of function and any kind of correlated sub-query including Cross Apply. Yeah, I know... I'm going to get the same argument that people make for cursors... sometimes, it's not possible to do otherwise. The problem is, that a lot of folks just don't know what is and what is not possible... 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
June 26, 2008 at 12:48 pm
I currently use 2 table-valued functions. With those, it's high-performance, many-reuse code, and neither of them uses any table variables. (Inline select functions) That means they can actually access table/index statistics meaningfully, and can be used for complex, set-based solutions. I've tested, and I get a .1% performance increase when I incorporate their code directly into procs instead of calling the function, but that loss in speed in worth it for the ease of code re-use in these two cases.
(One of them resolves complex hierarchies, the other splits delimited strings.)
In almost all other cases, incorporating the code into a proc directly has been the better option.
I don't currently use any multi-select table-value functions. Used to use them a lot, but that was because my Access front-end couldn't deal with procs that did the same thing, but could deal with functions that did. (I never did get a good answer from MS on why that was built that way.) Now, I avoid them as much as possible. They're generally performance killers, and can almost always be replaced with code in the proc and a good temp table.
I have a few scalar functions that get used on very small recordsets where the complex calculations can't readily be done in a set-based fashion. All three of these have been tested against multi-step set-based solutions, and found to be better performers, on the size of recordset they get used on.
Otherwise, I've found that tearing apart inline scalars into set-based multi-step updates, is generally MUCH faster and better than inline scalars.
For example, I have to regularly calculate the distances between all records in two data sets. One set might be anywhere from 5-thousand to 3-million rows, all with latitude and longitude, and the other set might be anywhere from 100 rows to 5000 rows. (Size between the two is generally proportionate, with larger first sets being because of larger second sets.)
One option, which was tested, was having an inline scalar that takes the lat and long from list one, and compares it to the lat and long from list two, and gives the distance on a row-by-row basis. It was very convenient and easy to build. And horribly slow.
The other option was taking each line of the calculation, adding a column to a join table between the two sets, and calculating one column at a time on the whole set of the join table. This was ugly code, and nowhere near so simple to maintain. But it was amazingly fast.
The difference between the two solutions was lists that would take 4-6 hours to run, or lists that would take 30-60 seconds to run, on the largest lists, and 1-2 hours on the smaller versus under 10 seconds.
Since as many as five or six of these comparisons need to be done in a day sometimes, the faster solution was necessary. Like I say, ugly, not as easy to read, not as easy to maintain, required adding a whole bunch of columns to an otherwise unnecessary table, and so on. But the speed is worth it.
Test various solutions. Try things that may not be obvious. Test them under load and with realistic record set sizes. What works in one case may not work well in another.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 26, 2008 at 5:07 pm
The other option was taking each line of the calculation, adding a column to a join table between the two sets, and calculating one column at a time on the whole set of the join table. This was ugly code, and nowhere near so simple to maintain. But it was amazingly fast.
How many rows did you end up with in that table, Gus? Was it a full Cartesian product or just a triangular join ((2-1)=ABS(1-2))
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2008 at 7:30 am
At each of my last 3 SQL 2005 performance tuning/analysis client gigs I have addressed tremendous performance issues by removing the use of TVFs (and also table variables). It is amazing how inefficient these features can make queries and how frequently they can lead the optimizer to implement an inefficient query! Nested loops on M+ row tables is soooo ugly. :w00t:
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 27, 2008 at 8:09 am
TheSQLGuru (6/27/2008)
At each of my last 3 SQL 2005 performance tuning/analysis client gigs I have addressed tremendous performance issues by removing the use of TVFs (and also table variables). It is amazing how inefficient these features can make queries and how frequently they can lead the optimizer to implement an inefficient query! Nested loops on M+ row tables is soooo ugly. :w00t:
I absolutely agree, Kevin... not so hidden RBAR on steriods!
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2008 at 8:11 am
Would you guys share some real examples where you stop using functions in order to increase performance? I mean, could you share the function code and the SQL statement code where you used the proc?
I ask this because I also had bad performance using functions, and so I stopped using them. But after more study, I realized I didn't know how to use them. I also know a couple of good and experienced DBAs that were doing my same mistake. After learning the right way to use functions I saw no drop in performance. I'm not saying that the performanced drop just a little bit and it was aceptable. I'm saying that I experienced no drop in performance at all.
Thanks a lot,
Luiz
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply