Join Problem

Question

Join Problem

-=JLK=-

SSCertifiable

Points: 5315
More actions
October 5, 2010 at 7:39 pm

#224721

I have two tables, one a basic data table (8000 records, 24 columns) nothing fancy which I join to a "date" table, which has 80,000 records (one record for each date between 1900 and 2199) a standard data warehouse date dimension table. the primary key on the date table is the "date" column which is a datetime field which has no time recorded. When I join to it from my data table on a datetime column (which also does not record the time portion) I get extremely poor performance. to retrieve 70 rows from the basic table using it's primary key takes 1 second, when I join it to the date table (same select statement) it takes 3 minutes.
Here is the select statement:
select cb.day_number_, cb.begin_date_, cdd.dayofwekk_, cdd.dayofmonth_, cdd.MonthName_
from cdDate_ cdd join skClassBlock_ cb on (cdd.date_ = db.begin_date_)
where cb.class_id_ = 3147
Sorry I can't provide the table definitions they are on my other computer. If needed I'll post in a followup. The query plan shows "nested Loops" as the physical operations and inner join as the logical operation with an estimated number of rows of 4455990 and estimated row size 4o B
From the information provided can someone explain why the join is so costly/slow? I would think it would grab the matching 70 rows in the data table and then join to the date table one for one, why the nested loop/table scan?
I don't know if I've provided enough information, if not I can post table/index specifics tomorrow.
Thanks for any help or ideas,
James.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150468 More actions · Answer 1

The most useful thing at this stage would probably be an actual execution plan. Run the problem query with 'Actual Execution Plan' on, right-click the resulting graphical plan, save as, and attach to a post.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 2

Please post table definitions, index definitions and execution plan as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 3

I appreciate you all taking a look. Sorry I couldn't post this last night (technical issues

I'm attaching The table/function definitions along with the query plan

I had forgotten that the begin_date_ column I'm linking on is a computed column using a function but I would have thought that it wouldn't be a huge impact since that function should be executed only once for each of the 70 rows being retrieved and then the result of the computation used to link to the date dimension using the date dimension's primary key.

I suppose it is likely I just can't see the forest for the trees, especially since the computed columns functions deal directly with the date table also. Probably something in there is what is preventing the straight forward join I would expect if these were all regular columns.

Any suggestions will be greatly appreciated, even design changes if there is poorly constructed code. This stuff has evolved over the course of several years and it's easy to get twisted into knots without realizing it.

Again thanks for taking a look.

James.

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 4

Gail, I did read your link on how to post performance problems. Thanks for that article it was very informative, fortunately I've not had to many performance issues but I've bookmarked your article for future reference. Anyway I was unable to get an Actual Performance plan because it doesn't seem to complete when that option is on. It ran for about 30 minutes and then ran my test computer out of disk space (tempdb exploded).

Thanks,

James.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

Then post the estimated plan. Is that what you posted earlier?

We prefer the actual plans, because there's far more info in there, but if the query won't finish the estimated plan's an adequate alternative.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 6

Yes my original post was the estimate. I'll try to clean up some disk space (only got 15 gb free) and see if I can get it to finish and see what the plan looks like (hopefully the plan doesn't take that much space

Thanks,

James

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 7

I have a work around using a temp table the returns the results I require in less than a second, though I still don't understand what is causing the problem. Here is the working code:

select * into #t1

from skClassBlock_ where Class_id_ = 3147;

select cb.Day_Number_, cb.Begin_Date_, cdd.DayOfWeek_, cdd.DayOfMonth_, cdd.MonthName_

from cdDate_ cdd Join #t1 cb on (cdd.Date_ = cb.Begin_Date_)

where cb.Class_id_ = 3147;

I have to believe that storing the computed datetime column into the temp table is what eliminates the problem because then joining it to the date dimension causes no problems so something about using the computed field in a join back to the table used in the computations seems to tie it in knots.

If anyone has any ideas on what to change to eliminate the problem it will be appreciated as both of these tables are key in this database and I'm likely to run into this problem often as we develop reports.

Thanks,

James.

tfifield SSCrazy Eights Points: 9655 More actions · Answer 8

James,

I've written a couple of articles here on using calendar tables to simplify code and better performance. I use them extensively when applicable. One of the keys to using them is to have an index on the date column in the data table you join to the calendar table.

The begin_date in skClassBlock is computed and this is then joined to the calendar table. I would make it a persisted computed column and then create an index on it.

Todd Fifield

Paul White SSC Guru Points: 150468 More actions · Answer 9

tfifield (10/9/2010)
The begin_date in skClassBlock is computed and this is then joined to the calendar table. I would make it a persisted computed column and then create an index on it.

Sadly, you can't persist a computed column based on a function (even if deterministic) that performs data access.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 10

For anyone else that would like to look at this and see why T-SQL scalar functions that do data access are such a pig, here's a cut-down repro script:

USE Sandpit;

GO

DROP TABLE

dbo.ClassBlock,

cdDate;

GO

DROP FUNCTION

dbo.GetBeginDate,

dbo.GetBusinessDayDate;

GO

DROP TABLE

dbo.skClass;

GO

CREATE FUNCTION

dbo.GetBusinessDayDate

(

@BeginDate DATE,

@BusinessDays INTEGER

)

RETURNS DATE

AS

BEGIN

-- Not the real calculation, but enough for the demo

RETURN DATEADD(DAY, @BusinessDays, @BeginDate);

END;

GO

CREATE TABLE

dbo.skClass

(

id BIGINT PRIMARY KEY,

begin_date DATE NOT NULL,

);

GO

CREATE FUNCTION

dbo.GetBeginDate

(

@class_id BIGINT,

@day_number INTEGER

)

RETURNS DATE

AS

BEGIN

DECLARE @Date DATE;

SELECT @Date = begin_date

FROM dbo.skClass

WHERE id = @Class_id;

RETURN dbo.GetBusinessDayDate(@Date, @Day_Number);

END

GO

CREATE TABLE

dbo.ClassBlock

(

id BIGINT IDENTITY PRIMARY KEY,

class_id BIGINT NOT NULL,

day_number INTEGER NOT NULL,

begin_date AS dbo.GetBeginDate(class_id, day_number),

);

GO

CREATE TABLE

dbo.cdDate

(

id BIGINT IDENTITY NOT NULL,

base_date DATETIME NOT NULL PRIMARY KEY,

day_of_week TINYINT NOT NULL,

day_of_month TINYINT NOT NULL,

);

GO

--

-- Add some data to cdDate

--

WITH Dates (the_date)

AS (

SELECT TOP (1000)

DATEADD(

DAY,

ROW_NUMBER() OVER (ORDER BY (SELECT 0)),

GETDATE())

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3

)

INSERT dbo.cdDate

(base_date, day_of_week, day_of_month)

SELECT D.the_date,

DATEPART(WEEKDAY, D.the_date),

DATEPART(DAY, D.the_date)

FROM Dates D;

GO

--

-- Add some data to ClassBlock

--

WITH Numbers (n)

AS (

SELECT TOP (100)

ROW_NUMBER() OVER (ORDER BY (SELECT 0))

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3

)

INSERT dbo.ClassBlock

(class_id, day_number)

SELECT 3147, N.n

FROM Numbers N;

GO

--

-- One demo row for skClass

--

INSERT dbo.skClass

(id, begin_date)

VALUES (1, GETDATE());

GO

--

-- The problematic query

--

SELECT cb.day_number,

cb.begin_date,

cdd.day_of_week,

cdd.day_of_month

FROM dbo.cdDate cdd

JOIN dbo.ClassBlock cb

ON (cb.begin_date = cdd.base_date)

WHERE cb.class_id = 3147

OPTION (RECOMPILE);

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 11

Modified version that writes the result of the function call to an ordinary column:

USE Sandpit;

GO

DROP TABLE

dbo.ClassBlock,

cdDate;

GO

DROP FUNCTION

dbo.GetBeginDate,

dbo.GetBusinessDayDate;

GO

DROP TABLE

dbo.skClass;

GO

CREATE FUNCTION

dbo.GetBusinessDayDate

(

@BeginDate DATE,

@BusinessDays INTEGER

)

RETURNS DATE

AS

BEGIN

-- Not the real calculation, but enough for the demo

RETURN DATEADD(DAY, @BusinessDays, @BeginDate);

END;

GO

CREATE TABLE

dbo.skClass

(

id BIGINT PRIMARY KEY,

begin_date DATE NOT NULL,

);

GO

CREATE FUNCTION

dbo.GetBeginDate

(

@class_id BIGINT,

@day_number INTEGER

)

RETURNS DATE

AS

BEGIN

DECLARE @Date DATE;

SELECT @Date = begin_date

FROM dbo.skClass

WHERE id = @Class_id;

RETURN dbo.GetBusinessDayDate(@Date, @Day_Number);

END

GO

CREATE TABLE

dbo.ClassBlock

(

id BIGINT IDENTITY PRIMARY KEY,

class_id BIGINT NOT NULL,

day_number INTEGER NOT NULL,

begin_date DATE NOT NULL,

);

GO

CREATE TABLE

dbo.cdDate

(

id BIGINT IDENTITY NOT NULL,

base_date DATE NOT NULL PRIMARY KEY,

day_of_week TINYINT NOT NULL,

day_of_month TINYINT NOT NULL,

);

GO

--

-- One demo row for skClass

--

INSERT dbo.skClass

(id, begin_date)

VALUES (3147, GETDATE());

GO

--

-- Add some data to cdDate

--

WITH Dates (the_date)

AS (

SELECT TOP (1000)

DATEADD(

DAY,

ROW_NUMBER() OVER (ORDER BY (SELECT 0)),

GETDATE())

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3

)

INSERT dbo.cdDate

(base_date, day_of_week, day_of_month)

SELECT D.the_date,

DATEPART(WEEKDAY, D.the_date),

DATEPART(DAY, D.the_date)

FROM Dates D;

GO

--

-- Add some data to ClassBlock

--

WITH Numbers (n)

AS (

SELECT TOP (100)

ROW_NUMBER() OVER (ORDER BY (SELECT 0))

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3

)

INSERT dbo.ClassBlock

(class_id, day_number, begin_date)

SELECT 3147,

N.n,

dbo.GetBeginDate(3147, N.n)

FROM Numbers N;

GO

--

-- The problematic query

--

SELECT cb.day_number,

cb.begin_date,

cdd.day_of_week,

cdd.day_of_month

FROM dbo.cdDate cdd

JOIN dbo.ClassBlock cb

ON (cb.begin_date = cdd.base_date)

WHERE cb.class_id = 3147

OPTION (RECOMPILE);

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 12

Paul, Thanks for the example that recreates the problem in miniture and with my own tables. Quite helpful in seeing that the problem lies with the function and not something I was overlooking else where. I really appreciate you taking the time to throw that together.

I guess there is no way to convince the engine to perform the function call last, only for the result set (probably some mathmatics over my head the dictate it doing it the way it does).

You last example definately shows the speed improvement, though of course it wouldn't work in my case as the class block begin date has to stay dynamic based on the parent tables begin date so has to be calculated each call as I don't know when the class begin date might change. Though I suppose I could manage that through a trigger that then updates the child table appropriately. I suppose that might be more efficient.

Anyone have any thoughts from a design perspective on what would be the better solution (Trigger or function)? I know the drawbacks of the function (performance) but are there any hidden gotcha's from the trigger?

Thanks

Paul White SSC Guru Points: 150468 More actions · Answer 13

-=JLK=- (10/11/2010)
Paul, Thanks for the example that recreates the problem in miniture and with my own tables. Quite helpful in seeing that the problem lies with the function and not something I was overlooking else where. I really appreciate you taking the time to throw that together.

No worries.

-=JLK=- (10/11/2010)
I guess there is no way to convince the engine to perform the function call last, only for the result set (probably some mathmatics over my head the dictate it doing it the way it does).

There is a way, using the rig with begin_date as a computed column. It's based on the same idea as using a temporary table:

WITH PreCalc

AS (

SELECT TOP (9223372036854775807)

*

FROM dbo.ClassBlock CB

WHERE CB.class_id = 3147

ORDER BY

CB.begin_date

)

SELECT PC.day_number,

PC.begin_date,

CDD.day_of_week,

CDD.day_of_month

FROM dbo.cdDate CDD

JOIN PreCalc PC

ON (PC.begin_date = CDD.base_date)

OPTION (RECOMPILE);

I'm including that code for interest's sake more than anything. It uses a trick called 'intermediate materialization' - you can find more details in this blog entry from Adam Machanic: http://sqlblog.com/blogs/adam_machanic/archive/2006/10/03/exploring-the-secrets-of-intermediate-materialization.aspx

The reason you get such a poor plan with the function is interesting. Essentially it is a limitation of the optimizer, since in this particular case, with this particular function, it is quite possible to reorder things to produce a fast plan. The key, as you have realized, is to compute the current value for begin_date from the function, before performing the join. The optimizer is very conservative when it comes to dealing with non-deterministic functions (yours is non-deterministic since it does data access). Rather than evaluate the content of the function (which might have changed at query run time) to see if it is safe to reorder in the plan or not, the optimizer treats it as a 'black box' and does not reorder its position in the query plan. This is a safe approach, but does have the potential to produce deeply sub-optimal plans

The deeper reason is that your function might (and does!) call another function, which might call another function, and so on and so on. Each of these calls would have to be checked at run time to see if the definition had changed, and if so, fully evaluate all possible code paths to see if it produces a deterministic result. Even if this were possible, the overhead of such a recursive evaluation on every call to the function would likely be unacceptable. For all these reasons and more, I generally strongly encourage people to avoid T-SQL scalar functions, and scalar functions that perform data access in particular.

If you must use T-SQL scalar functions, consider adding the WITH SCHEMABINDING clause to the function definition. With this in place, SQL Server will evaluate the function for determinism (assuming no data access!) which can produce much better plans in many scenarios (a function that is known to produce deterministic results can be reordered in the query plan, giving the optimizer more options). See http://blogs.msdn.com/b/sqlprogrammability/archive/2006/05/12/596424.aspx for a fuller discussion of the benefits of schemabinding UDFs.

-=JLK=- (10/11/2010)
You last example definately shows the speed improvement, though of course it wouldn't work in my case as the class block begin date has to stay dynamic based on the parent tables begin date so has to be calculated each call as I don't know when the class begin date might change. Though I suppose I could manage that through a trigger that then updates the child table appropriately. I suppose that might be more efficient.
Anyone have any thoughts from a design perspective on what would be the better solution (Trigger or function)? I know the drawbacks of the function (performance) but are there any hidden gotcha's from the trigger?

I would definitely use a (correctly written) trigger.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

tfifield SSCrazy Eights Points: 9655 More actions · Answer 14

Paul,

That's what I get for going through the forum at 2 AM. Ooops!

Todd Fifield

Join Problem

Cookies on SQLServerCentral