October 5, 2010 at 7:39 pm
I have two tables, one a basic data table (8000 records, 24 columns) nothing fancy which I join to a "date" table, which has 80,000 records (one record for each date between 1900 and 2199) a standard data warehouse date dimension table. the primary key on the date table is the "date" column which is a datetime field which has no time recorded. When I join to it from my data table on a datetime column (which also does not record the time portion) I get extremely poor performance. to retrieve 70 rows from the basic table using it's primary key takes 1 second, when I join it to the date table (same select statement) it takes 3 minutes.
Here is the select statement:
select cb.day_number_, cb.begin_date_, cdd.dayofwekk_, cdd.dayofmonth_, cdd.MonthName_
from cdDate_ cdd join skClassBlock_ cb on (cdd.date_ = db.begin_date_)
where cb.class_id_ = 3147
Sorry I can't provide the table definitions they are on my other computer. If needed I'll post in a followup. The query plan shows "nested Loops" as the physical operations and inner join as the logical operation with an estimated number of rows of 4455990 and estimated row size 4o B
From the information provided can someone explain why the join is so costly/slow? I would think it would grab the matching 70 rows in the data table and then join to the date table one for one, why the nested loop/table scan?
I don't know if I've provided enough information, if not I can post table/index specifics tomorrow.
Thanks for any help or ideas,
James.
October 5, 2010 at 8:32 pm
The most useful thing at this stage would probably be an actual execution plan. Run the problem query with 'Actual Execution Plan' on, right-click the resulting graphical plan, save as, and attach to a post.
October 6, 2010 at 3:00 am
Please post table definitions, index definitions and execution plan as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 6, 2010 at 7:08 am
I appreciate you all taking a look. Sorry I couldn't post this last night (technical issues 🙂
I'm attaching The table/function definitions along with the query plan
I had forgotten that the begin_date_ column I'm linking on is a computed column using a function but I would have thought that it wouldn't be a huge impact since that function should be executed only once for each of the 70 rows being retrieved and then the result of the computation used to link to the date dimension using the date dimension's primary key.
I suppose it is likely I just can't see the forest for the trees, especially since the computed columns functions deal directly with the date table also. Probably something in there is what is preventing the straight forward join I would expect if these were all regular columns.
Any suggestions will be greatly appreciated, even design changes if there is poorly constructed code. This stuff has evolved over the course of several years and it's easy to get twisted into knots without realizing it.
Again thanks for taking a look.
James.
October 6, 2010 at 2:46 pm
Gail, I did read your link on how to post performance problems. Thanks for that article it was very informative, fortunately I've not had to many performance issues but I've bookmarked your article for future reference. Anyway I was unable to get an Actual Performance plan because it doesn't seem to complete when that option is on. It ran for about 30 minutes and then ran my test computer out of disk space (tempdb exploded).
Thanks,
James.
October 6, 2010 at 2:53 pm
Then post the estimated plan. Is that what you posted earlier?
We prefer the actual plans, because there's far more info in there, but if the query won't finish the estimated plan's an adequate alternative.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 6, 2010 at 3:04 pm
Yes my original post was the estimate. I'll try to clean up some disk space (only got 15 gb free) and see if I can get it to finish and see what the plan looks like (hopefully the plan doesn't take that much space 🙂
Thanks,
James
October 7, 2010 at 7:03 am
I have a work around using a temp table the returns the results I require in less than a second, though I still don't understand what is causing the problem. Here is the working code:
select * into #t1
from skClassBlock_ where Class_id_ = 3147;
select cb.Day_Number_, cb.Begin_Date_, cdd.DayOfWeek_, cdd.DayOfMonth_, cdd.MonthName_
from cdDate_ cdd Join #t1 cb on (cdd.Date_ = cb.Begin_Date_)
where cb.Class_id_ = 3147;
I have to believe that storing the computed datetime column into the temp table is what eliminates the problem because then joining it to the date dimension causes no problems so something about using the computed field in a join back to the table used in the computations seems to tie it in knots.
If anyone has any ideas on what to change to eliminate the problem it will be appreciated as both of these tables are key in this database and I'm likely to run into this problem often as we develop reports.
Thanks,
James.
October 9, 2010 at 11:54 am
James,
I've written a couple of articles here on using calendar tables to simplify code and better performance. I use them extensively when applicable. One of the keys to using them is to have an index on the date column in the data table you join to the calendar table.
The begin_date in skClassBlock is computed and this is then joined to the calendar table. I would make it a persisted computed column and then create an index on it.
Todd Fifield
October 9, 2010 at 7:45 pm
tfifield (10/9/2010)
The begin_date in skClassBlock is computed and this is then joined to the calendar table. I would make it a persisted computed column and then create an index on it.
Sadly, you can't persist a computed column based on a function (even if deterministic) that performs data access.
October 9, 2010 at 8:08 pm
For anyone else that would like to look at this and see why T-SQL scalar functions that do data access are such a pig, here's a cut-down repro script:
USE Sandpit;
GO
DROP TABLE
dbo.ClassBlock,
cdDate;
GO
DROP FUNCTION
dbo.GetBeginDate,
dbo.GetBusinessDayDate;
GO
DROP TABLE
dbo.skClass;
GO
CREATE FUNCTION
dbo.GetBusinessDayDate
(
@BeginDate DATE,
@BusinessDays INTEGER
)
RETURNS DATE
AS
BEGIN
-- Not the real calculation, but enough for the demo
RETURN DATEADD(DAY, @BusinessDays, @BeginDate);
END;
GO
CREATE TABLE
dbo.skClass
(
id BIGINT PRIMARY KEY,
begin_date DATE NOT NULL,
);
GO
CREATE FUNCTION
dbo.GetBeginDate
(
@class_id BIGINT,
@day_number INTEGER
)
RETURNS DATE
AS
BEGIN
DECLARE @Date DATE;
SELECT @Date = begin_date
FROM dbo.skClass
WHERE id = @Class_id;
RETURN dbo.GetBusinessDayDate(@Date, @Day_Number);
END
GO
CREATE TABLE
dbo.ClassBlock
(
id BIGINT IDENTITY PRIMARY KEY,
class_id BIGINT NOT NULL,
day_number INTEGER NOT NULL,
begin_date AS dbo.GetBeginDate(class_id, day_number),
);
GO
CREATE TABLE
dbo.cdDate
(
id BIGINT IDENTITY NOT NULL,
base_date DATETIME NOT NULL PRIMARY KEY,
day_of_week TINYINT NOT NULL,
day_of_month TINYINT NOT NULL,
);
GO
--
-- Add some data to cdDate
--
WITH Dates (the_date)
AS (
SELECT TOP (1000)
DATEADD(
DAY,
ROW_NUMBER() OVER (ORDER BY (SELECT 0)),
GETDATE())
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3
)
INSERT dbo.cdDate
(base_date, day_of_week, day_of_month)
SELECT D.the_date,
DATEPART(WEEKDAY, D.the_date),
DATEPART(DAY, D.the_date)
FROM Dates D;
GO
--
-- Add some data to ClassBlock
--
WITH Numbers (n)
AS (
SELECT TOP (100)
ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3
)
INSERT dbo.ClassBlock
(class_id, day_number)
SELECT 3147, N.n
FROM Numbers N;
GO
--
-- One demo row for skClass
--
INSERT dbo.skClass
(id, begin_date)
VALUES (1, GETDATE());
GO
--
-- The problematic query
--
SELECT cb.day_number,
cb.begin_date,
cdd.day_of_week,
cdd.day_of_month
FROM dbo.cdDate cdd
JOIN dbo.ClassBlock cb
ON (cb.begin_date = cdd.base_date)
WHERE cb.class_id = 3147
OPTION (RECOMPILE);
October 9, 2010 at 8:31 pm
Modified version that writes the result of the function call to an ordinary column:
USE Sandpit;
GO
DROP TABLE
dbo.ClassBlock,
cdDate;
GO
DROP FUNCTION
dbo.GetBeginDate,
dbo.GetBusinessDayDate;
GO
DROP TABLE
dbo.skClass;
GO
CREATE FUNCTION
dbo.GetBusinessDayDate
(
@BeginDate DATE,
@BusinessDays INTEGER
)
RETURNS DATE
AS
BEGIN
-- Not the real calculation, but enough for the demo
RETURN DATEADD(DAY, @BusinessDays, @BeginDate);
END;
GO
CREATE TABLE
dbo.skClass
(
id BIGINT PRIMARY KEY,
begin_date DATE NOT NULL,
);
GO
CREATE FUNCTION
dbo.GetBeginDate
(
@class_id BIGINT,
@day_number INTEGER
)
RETURNS DATE
AS
BEGIN
DECLARE @Date DATE;
SELECT @Date = begin_date
FROM dbo.skClass
WHERE id = @Class_id;
RETURN dbo.GetBusinessDayDate(@Date, @Day_Number);
END
GO
CREATE TABLE
dbo.ClassBlock
(
id BIGINT IDENTITY PRIMARY KEY,
class_id BIGINT NOT NULL,
day_number INTEGER NOT NULL,
begin_date DATE NOT NULL,
);
GO
CREATE TABLE
dbo.cdDate
(
id BIGINT IDENTITY NOT NULL,
base_date DATE NOT NULL PRIMARY KEY,
day_of_week TINYINT NOT NULL,
day_of_month TINYINT NOT NULL,
);
GO
--
-- One demo row for skClass
--
INSERT dbo.skClass
(id, begin_date)
VALUES (3147, GETDATE());
GO
--
-- Add some data to cdDate
--
WITH Dates (the_date)
AS (
SELECT TOP (1000)
DATEADD(
DAY,
ROW_NUMBER() OVER (ORDER BY (SELECT 0)),
GETDATE())
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3
)
INSERT dbo.cdDate
(base_date, day_of_week, day_of_month)
SELECT D.the_date,
DATEPART(WEEKDAY, D.the_date),
DATEPART(DAY, D.the_date)
FROM Dates D;
GO
--
-- Add some data to ClassBlock
--
WITH Numbers (n)
AS (
SELECT TOP (100)
ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3
)
INSERT dbo.ClassBlock
(class_id, day_number, begin_date)
SELECT 3147,
N.n,
dbo.GetBeginDate(3147, N.n)
FROM Numbers N;
GO
--
-- The problematic query
--
SELECT cb.day_number,
cb.begin_date,
cdd.day_of_week,
cdd.day_of_month
FROM dbo.cdDate cdd
JOIN dbo.ClassBlock cb
ON (cb.begin_date = cdd.base_date)
WHERE cb.class_id = 3147
OPTION (RECOMPILE);
October 11, 2010 at 8:40 am
Paul, Thanks for the example that recreates the problem in miniture and with my own tables. Quite helpful in seeing that the problem lies with the function and not something I was overlooking else where. I really appreciate you taking the time to throw that together.
I guess there is no way to convince the engine to perform the function call last, only for the result set (probably some mathmatics over my head the dictate it doing it the way it does).
You last example definately shows the speed improvement, though of course it wouldn't work in my case as the class block begin date has to stay dynamic based on the parent tables begin date so has to be calculated each call as I don't know when the class begin date might change. Though I suppose I could manage that through a trigger that then updates the child table appropriately. I suppose that might be more efficient.
Anyone have any thoughts from a design perspective on what would be the better solution (Trigger or function)? I know the drawbacks of the function (performance) but are there any hidden gotcha's from the trigger?
Thanks
October 11, 2010 at 8:17 pm
-=JLK=- (10/11/2010)
Paul, Thanks for the example that recreates the problem in miniture and with my own tables. Quite helpful in seeing that the problem lies with the function and not something I was overlooking else where. I really appreciate you taking the time to throw that together.
No worries.
-=JLK=- (10/11/2010)
I guess there is no way to convince the engine to perform the function call last, only for the result set (probably some mathmatics over my head the dictate it doing it the way it does).
There is a way, using the rig with begin_date as a computed column. It's based on the same idea as using a temporary table:
WITH PreCalc
AS (
SELECT TOP (9223372036854775807)
*
FROM dbo.ClassBlock CB
WHERE CB.class_id = 3147
ORDER BY
CB.begin_date
)
SELECT PC.day_number,
PC.begin_date,
CDD.day_of_week,
CDD.day_of_month
FROM dbo.cdDate CDD
JOIN PreCalc PC
ON (PC.begin_date = CDD.base_date)
OPTION (RECOMPILE);
I'm including that code for interest's sake more than anything. It uses a trick called 'intermediate materialization' - you can find more details in this blog entry from Adam Machanic: http://sqlblog.com/blogs/adam_machanic/archive/2006/10/03/exploring-the-secrets-of-intermediate-materialization.aspx
The reason you get such a poor plan with the function is interesting. Essentially it is a limitation of the optimizer, since in this particular case, with this particular function, it is quite possible to reorder things to produce a fast plan. The key, as you have realized, is to compute the current value for begin_date from the function, before performing the join. The optimizer is very conservative when it comes to dealing with non-deterministic functions (yours is non-deterministic since it does data access). Rather than evaluate the content of the function (which might have changed at query run time) to see if it is safe to reorder in the plan or not, the optimizer treats it as a 'black box' and does not reorder its position in the query plan. This is a safe approach, but does have the potential to produce deeply sub-optimal plans 🙂
The deeper reason is that your function might (and does!) call another function, which might call another function, and so on and so on. Each of these calls would have to be checked at run time to see if the definition had changed, and if so, fully evaluate all possible code paths to see if it produces a deterministic result. Even if this were possible, the overhead of such a recursive evaluation on every call to the function would likely be unacceptable. For all these reasons and more, I generally strongly encourage people to avoid T-SQL scalar functions, and scalar functions that perform data access in particular.
If you must use T-SQL scalar functions, consider adding the WITH SCHEMABINDING clause to the function definition. With this in place, SQL Server will evaluate the function for determinism (assuming no data access!) which can produce much better plans in many scenarios (a function that is known to produce deterministic results can be reordered in the query plan, giving the optimizer more options). See http://blogs.msdn.com/b/sqlprogrammability/archive/2006/05/12/596424.aspx for a fuller discussion of the benefits of schemabinding UDFs.
-=JLK=- (10/11/2010)
You last example definately shows the speed improvement, though of course it wouldn't work in my case as the class block begin date has to stay dynamic based on the parent tables begin date so has to be calculated each call as I don't know when the class begin date might change. Though I suppose I could manage that through a trigger that then updates the child table appropriately. I suppose that might be more efficient.Anyone have any thoughts from a design perspective on what would be the better solution (Trigger or function)? I know the drawbacks of the function (performance) but are there any hidden gotcha's from the trigger?
I would definitely use a (correctly written) trigger.
October 12, 2010 at 1:22 pm
Paul,
That's what I get for going through the forum at 2 AM. Ooops!
Todd Fifield
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply