October 6, 2011 at 2:25 pm
A former developer created a wayward T-SQL and it's causing an extraction forever in a 2 million row table called wt
WT Table simplified
create table wt(a1 int
,clsdt_date datetime
,wrkcpl_date datetime
);
I think it is in these clause that's causing the problem
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
(
(wt.clsdt_date is NOT NULL AND wt.wrkcpl_date is NOT NULL
AND wt.clsdt_date = wt.wrkcpl_date
)
OR ( wt.clsdt_date is NOT NULL AND wt.wrkcpl_date is NULL )
OR ( wt.wrkcpl_date is NOT NULL AND wt.clsdt_date is NULL)
OR ( wt.clsdt_date is NULL AND wt.wrkcpl_date is NULL)
)
For some reason the optimizer is confused with the developer's logic.
I can remove the redundant clause and reduce it to
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
( clsdt_date = wrkcpl_date
OR clsdt_date is null
OR wrkcpl_date is null
)
But it is still slow..because the clauses cause a table scan.
Is there any other way to simplify the SQL?
Thanks a lot,
Johnny
October 6, 2011 at 2:36 pm
If you have indexes on the columns in the where clause, this might be faster:
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
clsdt_date = wrkcpl_date
UNION ALL
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
clsdt_date is null
UNION ALL
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
wrkcpl_date is null
October 6, 2011 at 3:05 pm
I don't think so, because of the following condition "clsdt_date = wrkcpl_date".
You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.
You can try adding a calculated column and indexing the calculated column.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 6, 2011 at 3:11 pm
drew.allen (10/6/2011)
I don't think so, because of the following condition "clsdt_date = wrkcpl_date".You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.
You can try adding a calculated column and indexing the calculated column.
Drew
Well, I did say might in my original reply. I didn't make any promises.
October 7, 2011 at 7:35 am
October 7, 2011 at 10:09 am
By this way , you have concluded the coding enhancement part , so what is next is the indexing enhancement part by implementing the below index :
Create nonclustered index wt_index1 on wt
(clsdt_date asc,
wrkcpl_date asc)
include (a1)
If much data entity there , you could apply page compression for that index only where you might find significant performance results
[font="Arial Black"]Performance Guard[/font]
[font="Courier New"]Just ask me to get 0 sec for all queries[/font]
October 7, 2011 at 10:16 am
Performace Guard (Shehap) (10/7/2011)
By this way , you have concluded the coding enhancement part , so what is next is the indexing enhancement part by implementing the below index :Create nonclustered index wt_index1 on wt
(clsdt_date asc,
wrkcpl_date asc)
include (a1)
If much data entity there , you could apply page compression for that index only where you might find significant performance results
That's what I thought, UNTIL I TESTED IT. I've already explained why creating this index will not help improve the performance of this particular query. (It might help with other queries, though.)
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 7, 2011 at 4:04 pm
Since no enough information about how data entity size is there , we could come up with an oustanding wokable solution to perform it perfectly as below :
1-Breaking down into subqueries as below :
select a1, clsdt_date,wrkcpl_date
FROM wt WHERE clsdt_date = wrkcpl_date
union
select a1, clsdt_date,wrkcpl_date
FROM wt where clsdt_date is null
UNION
select a1, clsdt_date,wrkcpl_date
FROM wt WHERE wrkcpl_date is null
2-Then create 3 indexes ( 2 filtered index + one index ) :
Create nonclustered index wt_index1 on wt
(clsdt_date asc,
wrkcpl_date asc)include (a1) where clsdt_date is null
Create nonclustered index wt_index2 on wt
(clsdt_date asc,
wrkcpl_date asc)include (a1) where wrkcpl_date is null
Create nonclustered index wt_index3 on wt
(clsdt_date asc,
wrkcpl_date asc)include (a1)
3-Then apply page compression for the 3 indexes to afford sufficiently more data entity size
[font="Arial Black"]Performance Guard[/font]
[font="Courier New"]Just ask me to get 0 sec for all queries[/font]
October 8, 2011 at 6:28 pm
drew.allen (10/6/2011)
I don't think so, because of the following condition "clsdt_date = wrkcpl_date".You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.
That's really interesting. I'm going to have to start looking for this at work! However, it seems to me that it should be possible to leverage indexes on those fields by rewriting the query. Let's play around with this a bit! 😀
Okay. First of all, I am going to set this up so that I can do my own testing. (Good suggestion, Drew!)
Test Environment Setup
CREATE DATABASE [Test];
USE [Test];
CREATE TABLE WT
( a1 INT IDENTITY NOT NULL
,clsdt_date DATETIME NULL
,wrkcpl_date DATETIME NULL
);
Now, we need some test data...
WITH N0 AS
( SELECT 1 AS N
UNION ALL
SELECT 1
) /*Returns 2 Records*/
,N1 AS
( SELECT 1 AS N
FROM N0 AS T1
CROSS JOIN N0 AS T2
) /*Returns 4 Records (2 x 2)*/
,N2 AS
( SELECT 1 AS N
FROM N1 AS T1
CROSS JOIN N1 AS T2
) /*Returns 16 Records (4 x 4)*/
,N3 AS
( SELECT 1 AS N
FROM N2 AS T1
CROSS JOIN N2 AS T2
) /*Returns 256 Records (16 x 16)*/
,N4 AS
( SELECT 1 AS N
FROM N3 AS T1
CROSS JOIN N3 AS T2
) /*Returns 65,536 Records (256 x 256)*/
,Numbers AS
( SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number
FROM N4
) /*Returns an ordered list of all the integers from 1 to 65,536*/
,Dates AS
( SELECT DATEADD(Day, Number - 1, '01/01/2001') AS TheDate
FROM Numbers
WHERE Number <= DATEDIFF(Day, '01/01/2001', '12/31/2009') + 1
UNION
SELECT NULL /*The test case requires at least one NULL value in here*/
/*Returns one record for each day between Jan 1, 2001 and Dec 31, 2009*/
)
INSERT INTO WT
( clsdt_date
,wrkcpl_date
)
SELECT Date1.TheDate AS clsdt_date
,Date2.TheDate AS wkcpl_date
FROM Dates AS Date1
CROSS JOIN Dates AS Date2;
/*Now we have a combination of every date betwwen 2001 and 2009 (nearly 11 million records!)*/
Note: This query is a variation on a function created by Rob Farley of Lobster Pot Solutions. Here's a good article for those who are interested!
Now we are ready to play around with this. I'll start with the improved query that jaqs2001 already created:
select a1, clsdt_date,wrkcpl_date
FROM wt
WHERE
( clsdt_date = wrkcpl_date
OR clsdt_date is null
OR wrkcpl_date is null
);
With our test table, this query returns 9,862 records, and the execution plan shows that 96% of the processing time is consumed by table scans.
So, let's now add the index that Performance Guard suggested:
Create nonclustered index wt_index1 on wt
(clsdt_date asc,
wrkcpl_date asc)
include (a1)
CREATE INDEX [IX_Clsdt_Wrkcpl_Date] ON WT(clsdt_date, wrkcpl_date) include (a1);
When we run the original query again, we actually do get a performance improvement of about 30%, but the execution plan still shows that 94% of the processing time is taken up in table scans. So, Drew is indeed correct. This is interesting all by itself.
I tried a number of combinations. I tried adding a separate index for each of the date fields. I then tried adding an include on each of those separate indexes to reference the other date field. No matter what, however, the query optimizer has to do at least one index or table scan, because - as Drew pointed out -
You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.
Before we give up, though, let's take a step back.
The original question was "How do I improve the performance of this query?" I created the table, WT, as it was presented in the original question, and I know that this is a simplified version of the production table. I couldn't help noticing, however, that it has no primary key. This is a religious issue for me. i believe that tables without primary keys are an abomination before Codd. (Historical note: The relational database concept was originally developed by E.F. Codd in the early 70's.)
So, let's recreate the table with a primary key, add the data back in, and see what kind of performance we get.
CREATE TABLE WT
( a1 INT IDENTITY PRIMARY KEY
,clsdt_date DATETIME NULL
,wrkcpl_date DATETIME NULL
);
When I run the query again (now against a table with a clustered primary key, but no other indexes). The performance is identical to a table with no primary key. The only difference is that now instead of doing table scans, we are doing clustered index scans.
Let's add Performance Guard's index back in:
CREATE INDEX [IX_Clsdt_Wrkcpl_Date] ON WT(clsdt_date, wrkcpl_date);
Note that we have left off the INCLUDE statement, because the field a1 is now in the clustered index (and is thus at the leaf level of all non-clustered indexes).
When we run the query again, 94% of the processing time is still taken up in clustered index scans, but the overall performance is 10% better than with our original INCLUDES index. This is a 40% improvement over the original query with no indexes.
I then tried adding back in our two separate indexes with the INCLUDE statements:
CREATE INDEX [IX_CLSDT_Date] ON WT(clsdt_date) INCLUDE (wrkcpl_date);
CREATE INDEX [IX_WRKCPL_Date] ON WT(wrkcpl_date) INCLUDE (clsdt_date);
Still, 92% of the processing time is taken up by an index scan, but overall performance has improved by an additional 25%. We are now down to 30% of the original run time. In fact, the run time is now only 45% of the best run time we had when we weren't using a clustered primary key.
So, it is possible to improve the performance - depending on your original configuration in production - but even in a best case scenario, you are only going to reduce the run time to about 30% of the original. I think, however, that there are a couple of good "takeaways" from this exercise:
1. It is really important to test your assertions. I was sure that I could prove Drew wrong, but he had tested his ideas and I had not.
2. Everything works better with a clustered primary key. I know some people will disagree (violently) with this assertion, but I will still hold by it. I think this is true even if your tables are not a part of a transactional database. At work, I take care of a complex data warehouse database environment. Most of the tables don't have primary keys, and it really does make a difference in a lot of ways. At bottom, a primary key is intended to uniquely identify a record. Presumably every record in every table of your database is unique in some way. So, why not let the optimizer in on the secret? You'll be glad you did! 🙂
October 8, 2011 at 9:10 pm
Maybe add a computed column for the datediff between the dates and index that?
no scans then...
October 8, 2011 at 9:42 pm
I certainly could be wrong (seen stranger stuff) but your tests may have a flaw, dmoutray... you're not clearing cache between tests and that may make it look like your additional indexes are actually contributing to performance. If they don't show up in the execution plan, it's a fair bet they're not contributing to performance. They can, however, slow down a query even if they don't show up in the execution plan because it's extra work for the cost based optimizer to see if it can use them.
While you're testing, dmoutray, please try something for me so that we get all the tests in on the same machine...
Try the following absolutely with no indexes on the table you created...
DBCC FREEPROCCACHE; --DON'T DO EITHER OF THESE DBCC COMMANDS ON A PRODUCTION BOX!!!!
DBCC DROPCLEANBUFFERS;
SELECT a1, clsdt_date, wrkcpl_date
FROM wt
WHERE ISNULL(DATEDIFF(dd,clsdt_date,wrkcpl_date),0) = 0
The real problem isn't in trying to index something that will always suffer a full table scan in one form or another... when a table scan will happen no matter what, the real problem is cutting down on the cost of the code... 3 conditions separated by ORs is much more expensive than one simple formula in an equation. Yes, the formula isn't SARGable... but since you're looking at the whole table anyway, it just doesn't matter.
[font="Arial Black"]{Edit} Heh... Yep... I was wrong... maybe not about the testing but about the indexes themselves. I learned something new based on the testing below... I'm apparently not as smart about indexing as I once thought I was. I need to do the same testing that dmoutray did below because Shehap and Lynn came up with some things that I didn't expect.[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2011 at 10:11 pm
I like the isnull datediff. Very elegant.
two questions:
Would that work in a filtered index?
"SARGable"?
Thanks.
October 8, 2011 at 10:13 pm
One other thing (because I see some good folks who GET the idea that "one test is worth a thousand expert opinions" and I want to make life easier for both of them), try the following to build your test table. I believe you'll be pleasantly surprised especially since the code is so comparatively easy to type and is fairly short...
USE TempDB
;
WITH
cteBuildDates AS
(
SELECT TOP (DATEDIFF(dd,'2001','2010'))
SomeDate = DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,'2001')
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
)
SELECT A1 = IDENTITY(INT,1,1), --Makes a NOT NULL column
clsdt_date = bd1.SomeDate,
wrkcpl_date = bd2.SomeDate
INTO dbo.Wt
FROM cteBuildDates bd1
CROSS JOIN cteBuildDates bd2
;
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2011 at 10:16 pm
SpringTownDBA (10/8/2011)
I like the isnull datediff. Very elegant.two questions:
Would that work in a filtered index?
"SARGable"?
Thanks.
I don't know if it would work in a filtered index because, oddly enough, I never use them.
"SARGable" comes from the term "Search ARGument" and has come to mean "will allow an index seek to occur if the correct index is available".
Thanks for the kudo on the ISNULL/DATEDIFF thing but... keep in mind that it ISN'T SARGable and should only be used in places where a table scan (same as a clustered index scan) is already guaranteed.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2011 at 10:27 pm
We've only had 2008 boxes at work for a couple of weeks now and haven't explored all of the new joys of SQL Server 2008... that includes filtered indexes. However, after reading about them in 2008 BOL, it seems that the ISNULL/DATEDIFF trick should work just fine for a filtered index.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 27 total)
You must be logged in to reply to this topic. Login to reply