Choosing between a View and Table-valued function

  • I advise my clients to avoid UDFs (other than very simple, non-table-accessing scalar ones) like the plague. They FAR too often lead to row-by-row processing under the covers. Just incredibly bad for performance. And the worst part is that you often don't see it in the graphical query plan, so without show statistics IO on you think your UDF solution rocks.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I've used UDFs in cases where the select logic was so insanely complex that a single query just couldn't do it (which leaves out views) and it needed to be joined to other queries (which pretty much leaves out procs unless you want to do the create-temp-table-insert-proc-join-to-temp route).

    The thing to watch out for on views is lazy developers who will build a single view with a dozen or more tables in it, hundreds of columns, complex joins, and then use it everywhere. I've seen a view with 3 sub-views and 19 tables, pulling over 200 columns of data, be used in procs that needed two columns from two tables with one join.

    So, yeah, use views to shorthand your common columns and joins, but do so intelligently and use some common sense. There's no reason to reference a view that forces a dozen joins when you need data from one or two of the tables.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/31/2008)


    I've used UDFs in cases where the select logic was so insanely complex that a single query just couldn't do it (which leaves out views) and it needed to be joined to other queries (which pretty much leaves out procs unless you want to do the create-temp-table-insert-proc-join-to-temp route).

    I have personally yet to see a situation where doing the latter wasn't more efficient (often WAYYYY more efficient) than the former. Spooling under the covers by the optimizer due to UDF usage is just a killer.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Lots of great ideas on this thead...

    At this point, I'd have to say "it depends"... like has already been said, both UDF's and Views can be written in horribly incorrect manners as can any code. Attempts to "Encapsulate" with views will many times result in very wide complicated views that really only need to return a couple of columns. They can also be written so wrong that a SELECT TOP 10 from a view will take hours to resolve.

    My recommendation is to 1) decide if you really need either... 2) test which one is better for your particular situation... and 3) make sure they're written correctly. UDF's and Views that do too much compared to what's required at the time will certainly impart some form of performance problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (3/31/2008)


    GSquared (3/31/2008)


    I've used UDFs in cases where the select logic was so insanely complex that a single query just couldn't do it (which leaves out views) and it needed to be joined to other queries (which pretty much leaves out procs unless you want to do the create-temp-table-insert-proc-join-to-temp route).

    I have personally yet to see a situation where doing the latter wasn't more efficient (often WAYYYY more efficient) than the former. Spooling under the covers by the optimizer due to UDF usage is just a killer.

    Yep. There are times when a temp table that can be populated by a proc is the better route. And there are times when a UDF is better, mainly in terms of re-usability.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Here is my take on reusability:

    1) most code doesn't actually GET reused because a) all devs don't know about it, b) devs don't bother checking to see if their particular need for a 'wheel' has already been invented and or c) the code isn't exactly what they need

    2) a template library of good tsql queries/snippets to give equal functionality for a function would serve the same purpose (and probably just as likely not get reused) 🙂

    3) saving a few mins or hours of dev time is nothing compared to the wads of money companies throw at hardware to solve performance problems which doesn't work and which leads to unhappy clients. bad data structures and/or bad code can easily swamp most any hardware you can reasonbably purchase with your company's resources.

    Sorry, but I have seen way too many cases where UDFs are one of (if not the) leading causes of horrid SQL Server performance and advise my clients strongly to not use them except in very isolated specific situations.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (4/1/2008)


    Here is my take on reusability:

    1) most code doesn't actually GET reused because a) all devs don't know about it, b) devs don't bother checking to see if their particular need for a 'wheel' has already been invented and or c) the code isn't exactly what they need

    2) a template library of good tsql queries/snippets to give equal functionality for a function would serve the same purpose (and probably just as likely not get reused) 🙂

    3) saving a few mins or hours of dev time is nothing compared to the wads of money companies throw at hardware to solve performance problems which doesn't work and which leads to unhappy clients. bad data structures and/or bad code can easily swamp most any hardware you can reasonbably purchase with your company's resources.

    Sorry, but I have seen way too many cases where UDFs are one of (if not the) leading causes of horrid SQL Server performance and advise my clients strongly to not use them except in very isolated specific situations.

    Hear here! {insert sound of clapping and uncontrolled cheering here!}

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (4/1/2008)


    Here is my take on reusability:

    1) most code doesn't actually GET reused because a) all devs don't know about it, b) devs don't bother checking to see if their particular need for a 'wheel' has already been invented and or c) the code isn't exactly what they need

    2) a template library of good tsql queries/snippets to give equal functionality for a function would serve the same purpose (and probably just as likely not get reused) 🙂

    3) saving a few mins or hours of dev time is nothing compared to the wads of money companies throw at hardware to solve performance problems which doesn't work and which leads to unhappy clients. bad data structures and/or bad code can easily swamp most any hardware you can reasonbably purchase with your company's resources.

    Sorry, but I have seen way too many cases where UDFs are one of (if not the) leading causes of horrid SQL Server performance and advise my clients strongly to not use them except in very isolated specific situations.

    I haven't been in that situation yet, so can't speak from experience on it.

    Here's a specific situation where I used a UDF. You tell me if you think it was a good/bad/indifferent idea:

    One of the products that we sell involves hierarchies of customer groups who can see data at levels less than or equal to their level, but only on their branch of the hierarchy, and definitely not data from levels above their level. These are mostly regarding results from various marketing campaigns.

    There are several reports, a dozen or more web pages, several automated processes (SSIS packages, scheduled procs, etc.), etc., that all have to resolve the customer hierarchies. This involves at least 20 or 30 procs, since different things are needed in each case.

    So, I set up a table-value UDF that resolves the hierarchy, given the top level, and a couple of scalar UDFs that will resolve the top level given various input parameters (there are five or six ways to find it out).

    Each proc that needs to resolve the hierarchy has a simple join to the UDF, instead of having to resolve the hierarchy itself. This has been very convenient, because I've had to change the resolution code three times in the last six months to accommodate changing customer needs. (This is for a new product line that nobody's ever done before and the needs are evolving as customers realize what they actually need vs what they thought they needed last summer when the whole thing started.)

    Because of the UDF, when there have been changes to the hierarchy code, I've only had to change it one place.

    In the production database, with actual customers using it, 6-thousand level hierarchies resolve in under a second and the more usual 200-300 level ones in only a few milliseconds. The pages and reports run quickly enough for the users, and there's been no need to throw huge amounts of extra hardware at it.

    So, bad decision on my part? Should I have duplicated the hierarchy resolution code in every proc that needs it?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thinking about another post on this same thread, there is another solution, which would be to have every proc that needs the hierarchy create a temp table around a specific template, then call a proc that populates the table, then join to the temp table in the primary proc.

    This would have the modularity advantage (changes in hierarchy code would only have to be made in one place - the proc that populates the table), and would avoid the UDF solution. It would have the disadvantage of requiring more lines of code in each proc that needs the hierarchy, but not too horribly many.

    But what advantage would this gain in real-world use? It's more work to set up, slightly more work to maintain, and I'm not clear on the ROI on it. What am I missing? Since most of the hierarchies are only 200-300 rows, statistics/indexes won't matter in those cases.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I have a client with a similar situation. On their system we developed an orghierarchy table that gave both the up and down listing for each org and use that. With a function in play some orgs that had lots of levels (or lots of associated data) would wind up with millions or even billions of logical reads due to the estimated 1 row returned from the function in the join causing the optimizer to use nested loop joins. Pre-staging the org info into a temp table and joining to that let the optimizer use a nested loop join for small sets and something much more efficient for large sets.

    Have you examined the estimated rows vs actual rows for some queries? It can be quite a shock to also use set statistics IO on and see all of those reads that are happening. The graphical plan doesn't report the function cost correctly however, at least from what I can tell.

    NOTE: 1M logical reads due to the nested loop joins doesn't translate directly into extremely long run times. Most of the looping will be in-cache hits and thus quite fast. But still, 1M reads over 10 sec as opposed to 17K reads in 3 or 4 sec with hash joins does ease up on the system quite a bit, especially if it is busy with other work.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • GSquared (4/1/2008)


    But what advantage would this gain in real-world use? It's more work to set up, slightly more work to maintain, and I'm not clear on the ROI on it. What am I missing? Since most of the hierarchies are only 200-300 rows, statistics/indexes won't matter in those cases.

    There is no ROI on it in the next month or maybe 6. But as the data grows, there will be an increasing ROI that will eventually become paramount. Dunno for sure how or why, but unless you plan on unkown large scalability and the related performance, it bite you when you least expect it.

    The real thing is, I don't think I do it in any of the ways mentioned. I'd probably use some form of Celko's "Nested Set" hierarchy instead of an "adjacency model" such a what you are currently using. See the following URL... it's an oldy but a goody... 😉

    http://www.intelligententerprise.com/001020/celko.jhtml?_requestid=1266295

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/1/2008)


    GSquared (4/1/2008)


    But what advantage would this gain in real-world use? It's more work to set up, slightly more work to maintain, and I'm not clear on the ROI on it. What am I missing? Since most of the hierarchies are only 200-300 rows, statistics/indexes won't matter in those cases.

    There is no ROI on it in the next month or maybe 6. But as the data grows, there will be an increasing ROI that will eventually become paramount. Dunno for sure how or why, but unless you plan on unkown large scalability and the related performance, it bite you when you least expect it.

    The real thing is, I don't think I do it in any of the ways mentioned. I'd probably use some form of Celko's "Nested Set" hierarchy instead of an "adjacency model" such a what you are currently using. See the following URL... it's an oldy but a goody... 😉

    http://www.intelligententerprise.com/001020/celko.jhtml?_requestid=1266295

    I looked into the nested sets solution. Not workable in this situation because of the number of updates and changes that occur in the hierarchies on a regular basis.

    I've never yet gotten an answer on the nested sets hierarchies on how to handle this kind of situation:

    Executive A has 6 managers, each manager has 1 or more offices, each office has an unlimited number of levels of divisions, separated into departements, separated into units, and so on. Then, office 126 is moved from manager 5 to manager 2, in the same week but not at the same time, as office 13 is moved from manager 1 to manager 3, and a week later, unit 5 in department 31 of division 2 under office 12 (manager 1) is moved to division 6 under manager 2.

    The nested sets version is very fast on queries, but essentially requires either the ability to see the future, or immensely complex rebuilds to deal with that kind of reorganization. If I'm wrong, and there's a simple solution for that, please point me to it. I've looked, but I've yet to see an answer on that point.

    The same sort of difficulty applies to creating a pre-populated table of all possible hierarchies. It would require, in this business case, an incredible level of complexity of code to deal with the constant updates.

    An adjacency model requires that one row be updated with a new value in one column for each move.

    So I have to ballance Select performance vs Update/Insert/Delete performance in this case.

    I have load tested the solution I came up with. Thousands of simultaneous queries on hierarchies of up to 50,000 nodes (average of 300). At that level, yeah, the current hardware begins to crap out and you get about a 1% time-out rate. Always on the biggest hierarchies. To achieve that level of activity, the business will have to increase sales volume by about 10,000%. Not gonna say it won't happen, but that also means that updates/inserts/etc., will increase by about the same volume. So, if I go with a solution that will never, ever time out, even with 10k times current business, I end up with a solution that will almost certainly time out nearly every time on updates.

    Please tell me what I'm missing here.

    And Jeff, what I was talking about was the ROI on switching to procs and temp tables instead of a UDF. Since most of the hierarchies are in the 200-300 rows range, there isn't any performance gain from index/statistic use on those. That means the complexity of the code would increase, but there would be 0 performance gain in 99% of use. That's why I say no ROI. And no future ROI from it. Not in this case.

    This weekend, when I can run the test servers up to the "there's smoke coming out of the hard drives" level of load test, I'll try some temp table vs UDF runs. Maybe I'm wrong about how that will work. And it'll be fun either way. 🙂

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • TheSQLGuru (4/1/2008)


    I have a client with a similar situation. On their system we developed an orghierarchy table that gave both the up and down listing for each org and use that. With a function in play some orgs that had lots of levels (or lots of associated data) would wind up with millions or even billions of logical reads due to the estimated 1 row returned from the function in the join causing the optimizer to use nested loop joins. Pre-staging the org info into a temp table and joining to that let the optimizer use a nested loop join for small sets and something much more efficient for large sets.

    Have you examined the estimated rows vs actual rows for some queries? It can be quite a shock to also use set statistics IO on and see all of those reads that are happening. The graphical plan doesn't report the function cost correctly however, at least from what I can tell.

    NOTE: 1M logical reads due to the nested loop joins doesn't translate directly into extremely long run times. Most of the looping will be in-cache hits and thus quite fast. But still, 1M reads over 10 sec as opposed to 17K reads in 3 or 4 sec with hash joins does ease up on the system quite a bit, especially if it is busy with other work.

    Oh yeah, I've definitely seen the LIES in estimated rows vs actual rows, and the lies in estimated costs, in execution plans.

    If the data was a little more static, a pre-loaded table would definitely be the way to go. SQL 2008 does something similar to that with the hierarchy data type. Unfortunately, the data I'm dealing with changes constantly, so the overhead of maintaining pre-queried data is higher than the cost of querying it real-time from the raw data.

    My point in all of this is simply that a UDF solved a real problem very efficiently in this particular case. I certainly don't recommend them for general use. 99% of the time, they actually slow things down too much.

    (Early on, I built everything with views. Later, I realized I could use UDFs to "make them compiled" and thought that was the way to go. Now, it's almost all highly specialized procs. But I do still find the ocassional view to be useful to support extremely common joins. And I still find the ocassional UDF useful for more complex common selects. I'm still learning every day and discarding bad habits on a routine basis.)

    If you have a better solution for a volatile hierarchy than what I set up, if I'm mistaken about the maintenance overhead on updates in your solution, if I'm misreading something here, please let me know.

    I already took this hierarchy from crashing if one customer wanted anything with more than 100 nodes (the prior developer used recursive functions with multiple cursors), to hundreds of concurrent users having no problem with hierarchies up to thousands of nodes, and if I can make it even better than that, I'm very, very interested.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (4/1/2008)


    The nested sets version is very fast on queries, but essentially requires either the ability to see the future, or immensely complex rebuilds to deal with that kind of reorganization. If I'm wrong, and there's a simple solution for that, please point me to it. I've looked, but I've yet to see an answer on that point.

    The same sort of difficulty applies to creating a pre-populated table of all possible hierarchies. It would require, in this business case, an incredible level of complexity of code to deal with the constant updates.

    An adjacency model requires that one row be updated with a new value in one column for each move.

    Heh... I know I'm not supposed to cheat, but I do... I maintain the adjacency model... when there's a change move (not just an insert), the whole bloody world changes when the position being moved changes... as you said, those types of changes are very easy to make in an adjacency model. Right after those changes are done, BOOM, I run the code to rebuild the nested set model. When it's done building, I simply "flip" tables (views in 2k and synonyms in 2k5) so the web services and the users see about a 10ms delay if they see it at all. Properly written, the changes don't take that long to run.

    However... if you already have a working system, the ROI you spoke of becomes incredibly important on what the long term tradeoffs caused by growth actually are. I like to anticipate growth but if you've already done that and are a bazillion percent sure than virtually no growth will occur, then why spend the moneY? Ya just gotta be sure, that's all.

    By the way, there is another alternative to both the adjacency and nested set models. How many total "nodes" do you have and what datatype do you use as the PK ("ChildID" in this case? I need to make sure we can even do it before I spend any time suggesting it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • G2, your issue seems to be different in nature from my client's. They took the hierarchy as an initial set of data and then joined it to between 3 and 7 other tables to extract out reporting data. It was the nested loop joins picked by the optimizer that was the killer due to the sometimes very large numbers of rows that came out of the joins.

    I am actually impressed that you were able to develop a system as functional as it is for such a diverse and complex dataset.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply