SQL Improvement

  • A former developer created a wayward T-SQL and it's causing an extraction forever in a 2 million row table called wt

    WT Table simplified

    create table wt(a1 int

    ,clsdt_date datetime

    ,wrkcpl_date datetime

    );

    I think it is in these clause that's causing the problem

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    (

    (wt.clsdt_date is NOT NULL AND wt.wrkcpl_date is NOT NULL

    AND wt.clsdt_date = wt.wrkcpl_date

    )

    OR ( wt.clsdt_date is NOT NULL AND wt.wrkcpl_date is NULL )

    OR ( wt.wrkcpl_date is NOT NULL AND wt.clsdt_date is NULL)

    OR ( wt.clsdt_date is NULL AND wt.wrkcpl_date is NULL)

    )

    For some reason the optimizer is confused with the developer's logic.

    I can remove the redundant clause and reduce it to

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    ( clsdt_date = wrkcpl_date

    OR clsdt_date is null

    OR wrkcpl_date is null

    )

    But it is still slow..because the clauses cause a table scan.

    Is there any other way to simplify the SQL?

    Thanks a lot,

    Johnny

  • If you have indexes on the columns in the where clause, this might be faster:

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    clsdt_date = wrkcpl_date

    UNION ALL

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    clsdt_date is null

    UNION ALL

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    wrkcpl_date is null

  • I don't think so, because of the following condition "clsdt_date = wrkcpl_date".

    You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.

    You can try adding a calculated column and indexing the calculated column.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (10/6/2011)


    I don't think so, because of the following condition "clsdt_date = wrkcpl_date".

    You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.

    You can try adding a calculated column and indexing the calculated column.

    Drew

    Well, I did say might in my original reply. I didn't make any promises.

  • Take a look at this article on how to post performace problems[/url].

    If you post all the information required, I'll be glad to help.

    -- Gianluca Sartori

  • By this way , you have concluded the coding enhancement part , so what is next is the indexing enhancement part by implementing the below index :

    Create nonclustered index wt_index1 on wt

    (clsdt_date asc,

    wrkcpl_date asc)

    include (a1)

    If much data entity there , you could apply page compression for that index only where you might find significant performance results

    [font="Arial Black"]Performance Guard[/font]

    [font="Courier New"]Just ask me to get 0 sec for all queries[/font]

  • Performace Guard (Shehap) (10/7/2011)


    By this way , you have concluded the coding enhancement part , so what is next is the indexing enhancement part by implementing the below index :

    Create nonclustered index wt_index1 on wt

    (clsdt_date asc,

    wrkcpl_date asc)

    include (a1)

    If much data entity there , you could apply page compression for that index only where you might find significant performance results

    That's what I thought, UNTIL I TESTED IT. I've already explained why creating this index will not help improve the performance of this particular query. (It might help with other queries, though.)

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Since no enough information about how data entity size is there , we could come up with an oustanding wokable solution to perform it perfectly as below :

    1-Breaking down into subqueries as below :

    select a1, clsdt_date,wrkcpl_date

    FROM wt WHERE clsdt_date = wrkcpl_date

    union

    select a1, clsdt_date,wrkcpl_date

    FROM wt where clsdt_date is null

    UNION

    select a1, clsdt_date,wrkcpl_date

    FROM wt WHERE wrkcpl_date is null

    2-Then create 3 indexes ( 2 filtered index + one index ) :

    Create nonclustered index wt_index1 on wt

    (clsdt_date asc,

    wrkcpl_date asc)include (a1) where clsdt_date is null

    Create nonclustered index wt_index2 on wt

    (clsdt_date asc,

    wrkcpl_date asc)include (a1) where wrkcpl_date is null

    Create nonclustered index wt_index3 on wt

    (clsdt_date asc,

    wrkcpl_date asc)include (a1)

    3-Then apply page compression for the 3 indexes to afford sufficiently more data entity size

    [font="Arial Black"]Performance Guard[/font]

    [font="Courier New"]Just ask me to get 0 sec for all queries[/font]

  • drew.allen (10/6/2011)


    I don't think so, because of the following condition "clsdt_date = wrkcpl_date".

    You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.

    That's really interesting. I'm going to have to start looking for this at work! However, it seems to me that it should be possible to leverage indexes on those fields by rewriting the query. Let's play around with this a bit! 😀

    Okay. First of all, I am going to set this up so that I can do my own testing. (Good suggestion, Drew!)

    Test Environment Setup

    CREATE DATABASE [Test];

    USE [Test];

    CREATE TABLE WT

    ( a1 INT IDENTITY NOT NULL

    ,clsdt_date DATETIME NULL

    ,wrkcpl_date DATETIME NULL

    );

    Now, we need some test data...

    WITH N0 AS

    ( SELECT 1 AS N

    UNION ALL

    SELECT 1

    ) /*Returns 2 Records*/

    ,N1 AS

    ( SELECT 1 AS N

    FROM N0 AS T1

    CROSS JOIN N0 AS T2

    ) /*Returns 4 Records (2 x 2)*/

    ,N2 AS

    ( SELECT 1 AS N

    FROM N1 AS T1

    CROSS JOIN N1 AS T2

    ) /*Returns 16 Records (4 x 4)*/

    ,N3 AS

    ( SELECT 1 AS N

    FROM N2 AS T1

    CROSS JOIN N2 AS T2

    ) /*Returns 256 Records (16 x 16)*/

    ,N4 AS

    ( SELECT 1 AS N

    FROM N3 AS T1

    CROSS JOIN N3 AS T2

    ) /*Returns 65,536 Records (256 x 256)*/

    ,Numbers AS

    ( SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number

    FROM N4

    ) /*Returns an ordered list of all the integers from 1 to 65,536*/

    ,Dates AS

    ( SELECT DATEADD(Day, Number - 1, '01/01/2001') AS TheDate

    FROM Numbers

    WHERE Number <= DATEDIFF(Day, '01/01/2001', '12/31/2009') + 1

    UNION

    SELECT NULL /*The test case requires at least one NULL value in here*/

    /*Returns one record for each day between Jan 1, 2001 and Dec 31, 2009*/

    )

    INSERT INTO WT

    ( clsdt_date

    ,wrkcpl_date

    )

    SELECT Date1.TheDate AS clsdt_date

    ,Date2.TheDate AS wkcpl_date

    FROM Dates AS Date1

    CROSS JOIN Dates AS Date2;

    /*Now we have a combination of every date betwwen 2001 and 2009 (nearly 11 million records!)*/

    Note: This query is a variation on a function created by Rob Farley of Lobster Pot Solutions. Here's a good article for those who are interested!

    http://stackoverflow.com/questions/1378593/get-a-list-of-dates-between-two-dates-using-a-function/1378788#1378788

    Now we are ready to play around with this. I'll start with the improved query that jaqs2001 already created:

    select a1, clsdt_date,wrkcpl_date

    FROM wt

    WHERE

    ( clsdt_date = wrkcpl_date

    OR clsdt_date is null

    OR wrkcpl_date is null

    );

    With our test table, this query returns 9,862 records, and the execution plan shows that 96% of the processing time is consumed by table scans.

    So, let's now add the index that Performance Guard suggested:

    Create nonclustered index wt_index1 on wt

    (clsdt_date asc,

    wrkcpl_date asc)

    include (a1)

    CREATE INDEX [IX_Clsdt_Wrkcpl_Date] ON WT(clsdt_date, wrkcpl_date) include (a1);

    When we run the original query again, we actually do get a performance improvement of about 30%, but the execution plan still shows that 94% of the processing time is taken up in table scans. So, Drew is indeed correct. This is interesting all by itself.

    I tried a number of combinations. I tried adding a separate index for each of the date fields. I then tried adding an include on each of those separate indexes to reference the other date field. No matter what, however, the query optimizer has to do at least one index or table scan, because - as Drew pointed out -

    You can't do an index seek on cssdt_date until you know the values for wrkcpl_date. You can't do an index seek on wrkcpl_date until you know the values of clsdt_date. Since you can't do an index seek on either of these fields, you will always be forced to do a scan.

    Before we give up, though, let's take a step back.

    The original question was "How do I improve the performance of this query?" I created the table, WT, as it was presented in the original question, and I know that this is a simplified version of the production table. I couldn't help noticing, however, that it has no primary key. This is a religious issue for me. i believe that tables without primary keys are an abomination before Codd. (Historical note: The relational database concept was originally developed by E.F. Codd in the early 70's.)

    So, let's recreate the table with a primary key, add the data back in, and see what kind of performance we get.

    CREATE TABLE WT

    ( a1 INT IDENTITY PRIMARY KEY

    ,clsdt_date DATETIME NULL

    ,wrkcpl_date DATETIME NULL

    );

    When I run the query again (now against a table with a clustered primary key, but no other indexes). The performance is identical to a table with no primary key. The only difference is that now instead of doing table scans, we are doing clustered index scans.

    Let's add Performance Guard's index back in:

    CREATE INDEX [IX_Clsdt_Wrkcpl_Date] ON WT(clsdt_date, wrkcpl_date);

    Note that we have left off the INCLUDE statement, because the field a1 is now in the clustered index (and is thus at the leaf level of all non-clustered indexes).

    When we run the query again, 94% of the processing time is still taken up in clustered index scans, but the overall performance is 10% better than with our original INCLUDES index. This is a 40% improvement over the original query with no indexes.

    I then tried adding back in our two separate indexes with the INCLUDE statements:

    CREATE INDEX [IX_CLSDT_Date] ON WT(clsdt_date) INCLUDE (wrkcpl_date);

    CREATE INDEX [IX_WRKCPL_Date] ON WT(wrkcpl_date) INCLUDE (clsdt_date);

    Still, 92% of the processing time is taken up by an index scan, but overall performance has improved by an additional 25%. We are now down to 30% of the original run time. In fact, the run time is now only 45% of the best run time we had when we weren't using a clustered primary key.

    So, it is possible to improve the performance - depending on your original configuration in production - but even in a best case scenario, you are only going to reduce the run time to about 30% of the original. I think, however, that there are a couple of good "takeaways" from this exercise:

    1. It is really important to test your assertions. I was sure that I could prove Drew wrong, but he had tested his ideas and I had not.

    2. Everything works better with a clustered primary key. I know some people will disagree (violently) with this assertion, but I will still hold by it. I think this is true even if your tables are not a part of a transactional database. At work, I take care of a complex data warehouse database environment. Most of the tables don't have primary keys, and it really does make a difference in a lot of ways. At bottom, a primary key is intended to uniquely identify a record. Presumably every record in every table of your database is unique in some way. So, why not let the optimizer in on the secret? You'll be glad you did! 🙂

  • Maybe add a computed column for the datediff between the dates and index that?

    no scans then...

  • I certainly could be wrong (seen stranger stuff) but your tests may have a flaw, dmoutray... you're not clearing cache between tests and that may make it look like your additional indexes are actually contributing to performance. If they don't show up in the execution plan, it's a fair bet they're not contributing to performance. They can, however, slow down a query even if they don't show up in the execution plan because it's extra work for the cost based optimizer to see if it can use them.

    While you're testing, dmoutray, please try something for me so that we get all the tests in on the same machine...

    Try the following absolutely with no indexes on the table you created...

    DBCC FREEPROCCACHE; --DON'T DO EITHER OF THESE DBCC COMMANDS ON A PRODUCTION BOX!!!!

    DBCC DROPCLEANBUFFERS;

    SELECT a1, clsdt_date, wrkcpl_date

    FROM wt

    WHERE ISNULL(DATEDIFF(dd,clsdt_date,wrkcpl_date),0) = 0

    The real problem isn't in trying to index something that will always suffer a full table scan in one form or another... when a table scan will happen no matter what, the real problem is cutting down on the cost of the code... 3 conditions separated by ORs is much more expensive than one simple formula in an equation. Yes, the formula isn't SARGable... but since you're looking at the whole table anyway, it just doesn't matter.

    [font="Arial Black"]{Edit} Heh... Yep... I was wrong... maybe not about the testing but about the indexes themselves. I learned something new based on the testing below... I'm apparently not as smart about indexing as I once thought I was. I need to do the same testing that dmoutray did below because Shehap and Lynn came up with some things that I didn't expect.[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I like the isnull datediff. Very elegant.

    two questions:

    Would that work in a filtered index?

    "SARGable"?

    Thanks.

  • One other thing (because I see some good folks who GET the idea that "one test is worth a thousand expert opinions" and I want to make life easier for both of them), try the following to build your test table. I believe you'll be pleasantly surprised especially since the code is so comparatively easy to type and is fairly short...

    USE TempDB

    ;

    WITH

    cteBuildDates AS

    (

    SELECT TOP (DATEDIFF(dd,'2001','2010'))

    SomeDate = DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,'2001')

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    SELECT A1 = IDENTITY(INT,1,1), --Makes a NOT NULL column

    clsdt_date = bd1.SomeDate,

    wrkcpl_date = bd2.SomeDate

    INTO dbo.Wt

    FROM cteBuildDates bd1

    CROSS JOIN cteBuildDates bd2

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SpringTownDBA (10/8/2011)


    I like the isnull datediff. Very elegant.

    two questions:

    Would that work in a filtered index?

    "SARGable"?

    Thanks.

    I don't know if it would work in a filtered index because, oddly enough, I never use them.

    "SARGable" comes from the term "Search ARGument" and has come to mean "will allow an index seek to occur if the correct index is available".

    Thanks for the kudo on the ISNULL/DATEDIFF thing but... keep in mind that it ISN'T SARGable and should only be used in places where a table scan (same as a clustered index scan) is already guaranteed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We've only had 2008 boxes at work for a couple of weeks now and haven't explored all of the new joys of SQL Server 2008... that includes filtered indexes. However, after reading about them in 2008 BOL, it seems that the ISNULL/DATEDIFF trick should work just fine for a filtered index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply