Performance of Joins Vs Functions

Question

Performance of Joins Vs Functions

am-244616

SSCarpal Tunnel

Points: 4936
More actions
September 18, 2007 at 6:42 am

#125667

I have 1 million row table which has look up values in 7 columns, I will get the description for all the 7 columns from the corresponding look up tables.
In this case is it better to join the core table with all the look up tables or is it better to create a function for each of the lookup value which will in turn spit out the description for these lookup values.
Lookup value tables will not have more than 100 rows at any given point.
Any inputs/suggestions would help.
Thanks

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 1

I would join the lookup tables to the core table. Using a function in this case could cause a potential performance problem.

Lowell SSC Guru Points: 323755 More actions · Answer 2

I agree with Lynn no value to using a function in this case; also, I'd suggest creating an index on those 7 columns; it would most likely help on performance as they get joined to their description tables.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 3

It is probably not a good idea to do this with functions for a table with that many rows. You should read the topic below.

Demo Performance Penalty of User Defined Functions

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=78601

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 4

Agreed as to accessing the base tables directly. A function is overkill for this purpose.

And now - veering off-topic for a minute...

Michael - you only tested non-deterministic UDF's in your test, so they're going to force what is essentially a table scan (so yes - performance will suck). I don't think that's necessarily a "fair test" of all UDF's.

that's one of those ugly things - if you can make your UDF deterministic, it will tend to play somewhat well with larger data sets, meaning it will cooperate with firing set-based. Non-deterministic means you're up the creek without the old paddle: each row will be fired through this one at a time.

Given how hard it is at times to figure out just what will bring it back in line, I'd tend to agree about shying away from any process firing a function 3M times. Of course - damn near anything that does the SAME thing 3M times would be RBAR, which in itself should be avoided like the bubonic plague when possible. It's a lot like views - in a lot of circumstances they tend to "obscure" perfectly good indexes, so they essentially "confuse" the optimizer when used to encapsulate a sub-query.

However - applying that to firing a UDF <100 times, not so sure. Sometimes - there's "fast enough", especially when the function is written in such a way that it will "play well".

</off topic>

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Source-NH SSCarpal Tunnel Points: 4165 More actions · Answer 5

Umm ... Matt, I think that you are using the terms deterministic / non-deterministic incorrectly for the functions that Michael wrote. They are all deterministic functions - ie given identical input, they return identical output. A non-deterministic function is one where you can NOT determine the output from a consistent input.

His first function -

create function DateTime1

(@Year int, @Month int, @Day int, @Hour int, @Minute int, @Second int)

-- returns a dateTime value for the date and time specified.

returns datetime

as

begin

returndateadd(month,((@Year-1900)*12)+@Month-1,

dateadd(ss,(@Hour*3600)+(@Minute*60)+@Second,@Day-1))

end

will return the exact same result every time it is run with the same inputs. A good example of a non-deterministic function is the GETDATE() fuction.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 6

I'm using deterministic, because when you create that as a SQL UDF - SQL considers the UDF non-deterministic. Try it and see.

SELECT

OBJECTPROPERTY(OBJECT_ID('DateTime1'), 'IsDeterministic')

SELECT

OBJECTPROPERTY(OBJECT_ID('Date'), 'IsDeterministic')

SELECT

OBJECTPROPERTY(OBJECT_ID('Time'), 'IsDeterministic')

SELECT

OBJECTPROPERTY(OBJECT_ID('DateTime2'), 'IsDeterministic')

Not that I "agree" with SQL server, they SHOULD be deterministic. But if SQL server doesn't agree, it ain't going to use them as deterministic.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 7

Matt: Technically, the functions in my post are non-deterministic from the SQL Server point of view, but only because they were not created with the "with schemabinding" option. However, as Richard pointed out, they are deterministic in the sense that they will produce identical output every time they are run with the same input.

Since the discussion on my post is entirely about performance, I think any valid criticism has to demonstrate counter results and the code used to get those results. If you want to post some results that show there is a significant difference in performance when the function is created with and without the "with schemabinding" option, I would be interested in seeing that. That is why I posted all the code that I used for the tests, so that other people could examine and verify the results themselves.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

Here's why I don't think it's a fair test. By running it at a single size, with a single operation, and by using ONLY in an exclusionary way - I don't think you get the full picture. Here's my test app:

set

nocount on

go

select

count(date) from T_DATE

go

declare

@count int

declare

@st datetime

select

@st = getdate()

select

--top 800000

a

.[DATE], 0+a.[DATE]

from

T_DATE a

select

[MS No Action] = datediff(ms,0,getdate()-@st)

go

declare

@count int

declare

@st datetime

select

@st = getdate()

select

--top 800000

a

.date, dateadd(month,((a.YEAR-1900)*12)+a.MONTH-1,

dateadd(ss,(a.HOUR*3600)+(a.MINUTE*60)+a.SECOND,a.DAY-1))

from

T_DATE a

select

[MS No Function] = datediff(ms,0,getdate()-@st)

go

declare

@count int

declare

@st datetime

select

@st = getdate()

select

--top 800000

a

.date, dbo.DateTime1(a.YEAR,a.MONTH,a.DAY,a.HOUR,a.MINUTE,a.SECOND)

from

T_DATE a

select

[MS DateTime1] = datediff(ms,0,getdate()-@st)

go

declare

@count int

declare

@st datetime

select

@st = getdate()

select

--top 800000

a

.date,dbo.DateTime2(a.YEAR,a.MONTH,a.DAY,a.HOUR,a.MINUTE,a.SECOND)

from

T_DATE a

select

[MS DateTime2] = datediff(ms,0,getdate()-@st)

go

Now - I go through and uncomment the Top predicates, and vary the sample sizes, and run this a few times at the same size, after forcing a new exec. plan at each sampling size and I get:

Sample Size

1,000

2,000 5,000 10,000 250,000 400,000 3,000,000

MS no action1101261562334093626654500

MS No Function9385%126100%203130%250107%400398%6263100%4964091%

Ms DateTime17669%9374%186119%263113%4953121%8030128%54826101%

Ms DateTime28073%140111%313201%580249%14720360%22890365%96386177%

Which tells me that sampling size seems to have a LOT to do with this, and that the function is not nearly as bad a choice as your experiment would prove. If I had a bigger machine to spend all day hammering on - we could certainly try bigger samples, but it looks like difference seems to go right back down. I still don't know why - but what I do see is that there is anything BUT a clear-cut case for ANYTHING.

If anything - I see that the function (the one behaving well) might actually outperform the inline calculations under certain circumstances, and keeps up pretty closely with the inline. I can't honestly tell WHAT the big drop at the end tells me, since it looks like they're running neck and neck at that point.

I know I must be missing something horribly obvious - but I'd be curious to know. I just don't think there is any form of a clear answer whether a scalar function running on a 100 record sample would perform worse than something inline.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 9

I don’t see how your last post is relevant to this discussion. The testing you did completely fails to address the points you raised in your prior post about the performance of deterministic vs. non-deterministic functions, since you didn’t do any tests of deterministic vs. non-deterministic functions.

The points you brought up about sample size don’t seem to be relevant. My testing was focused on large sample sizes because that it where I would expect to see the biggest impact. I conducted my tests on three different idle servers on two different versions of SQL Server 2000 and one version of SQL Server 2005, and all of my test results were consistent in showing big differences between the performance of inline code vs. functions. I ran my tests many times on each server, and saw extremely consistent results from one run to the next.

Also, you changed the testing methodology. I was careful to try to isolate the results to just the impact of inline vs. function. You chose to return result sets to the client, introducing the impact of network and client performance. Both can be highly variable, can have a big impact on run time, and can introduce enough runtime variation to overwhelm the effects of the difference between inline code vs. functions. I specifically coded my test scripts to not return results sets to the client to better isolate the impact of what I was testing.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

I was addressing your initial assessment that it was probably NOT a good idea to use a function (especially one that the OP wanted to return results) in a scenario with lots of rows, and not the "deterministic versus not" part of the discussion. That's an optimizer screwup at best, and it doesn't seem to have much impact on performance IN THIS CASE. I'm not going to generalize, since I'm not convinced that that is in fact true; I remember coming across some discussion outlining how much harder it was for the optimizer to handle non-deterministic "well", and it made a lot of sense at the time. I'm actually pleased that the optimizer did seem to see right through the fake "non-deterministic" property (less than 1% difference in all samples).

Looking only at very large sample is arbitrary, and it's dangerous to assume that just because something works well in a large scale means it will work well in a small scale. Comparing performance results based on 3M records to what happens on a 100 record set (the size stated by the OP for the number of results to pull) seemed arbitrary to me, so I took issue with it.

Yours seems to highlight a very specific circumstance where it does behave poorly. I just don't think you can make the case that that can be generalized in any way. Creating a circumstance where there is no output in my mind is artificial to begin with, and certainly doesn't represent the majority of circumstances. The fact that the two versions are outputting the SAME data still provides for a basis for comparison, even if there is a margin of error.

I also can't help but think that there is something else afoot when there is that much of a difference between a function operating not well in one query (with essentially no output), and the SAME one performing fairly close to right on in a different circumstance (with output). And other than saying, there are times when functions might not work all that well, but I don't have a specific circumstance to describe why or how, I'm now sure what else we've proven.

Anyway - I'm sorry to have hijacked the thread. We should probably come up with somewhere more appropriate to continue this if necessary.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 11

You are mistaken about the OP requirements. He said his table had 1 million rows and that the lookup tables had around 100 rows. If he is scanning that table, the 7 different functions would be called for each of the 1 million rows. Since each function would be doing a table lookup query, my guess is that the performance would be terrible compared to a multi-table join, so my post was addressing exactly the situation he posted. If fact, I believe that using functions for table lookups on each row would result in horrible performance, much worse than the effects of the simple scalar functions used in my tests.

Looking at a small sample is irrelevant, because it is not the situation under discussion, and even very inefficient function calls will have such a short run-time that the difference will not be noticeable. My post on SQL Team was about situations where a function is being used in queries where there are a large number of rows.

I explained the reasons why I chose a test the produces no rowset. It is not meant to represent the typical query, but to isolate the effects of the particular thing being tested without letting the results get overwhelmed by the effects of network and client performance. The point I am getting at is that the difference in the performance of the function vs. the inline code can be orders of magnitude, but in a query that returns the output to the client, that difference may represent only a tiny fraction of the total query run time and the network and client performance can be so variable that they completely mask the effect of the thing I was actually trying to test.