How does one improve the performance of a "colA > colB" query?

  • Hi all

    I've got a performance problem with a query that uses a comparison between two columns to pull out some data. Rather than bore you with the details of the system, I've created some SQL that mirrors the issue (acknowledgments to Itzik Ben Gan for the numbers CTEs):

    CREATE TABLE #Dates

    (

    Number int NOT NULL PRIMARY KEY,

    Date1 datetime NOT NULL,

    Date2 datetime NOT NULL

    )

    GO

    DECLARE @number_of_numbers INT

    SET @number_of_numbers = 100000;

    ;WITH

    a AS (SELECT 1 AS i UNION ALL SELECT 1),

    b AS (SELECT 1 AS i FROM a AS x, a AS y),

    c AS (SELECT 1 AS i FROM b AS x, b AS y),

    d AS (SELECT 1 AS i FROM c AS x, c AS y),

    e AS (SELECT 1 AS i FROM d AS x, d AS y),

    f AS (SELECT 1 AS i FROM e AS x, e AS y),

    numbers AS

    (

    SELECT TOP(@number_of_numbers)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number

    FROM f

    )

    INSERT INTO #Dates (Number, Date1, Date2)

    SELECT number, CAST(RAND(number) * 100000 AS Datetime) AS Date1,

    CAST(RAND(number+1) * 100000 AS Datetime) AS Date2

    FROM numbers;

    GO

    -- This is the bit to optimise if possible

    SELECT * FROM #Dates WHERE Date1 > Date2

    GO

    DROP TABLE #Dates

    GO

    On my box, the select statement only returns two rows, which is fairly representative of the real system. The real table also has about 100000 rows.

    Try as I might with adding other indexes I can't get the select to do anything other than index scans. This does make sense, but doesn't help with the performance.

    So - what does everyone else do when faced with such a query? Are there any indexing possibilities that I've missed, or is it time to look at other options, for example an indexed computed column?

    Any help appreciated.

    Duncan

    Edit: Forgot to add - it's a SQL 2005 box, if that makes any difference.

  • a query with Column1 > Column2 best plan is going to be a index scan, right? it has to use the index to get all values greater than to match the query;

    if you were after a single value, you might get an index seek , but i'm pretty sure that the best you can do is an index scan.

    now, from there, your query might benefit with some include columns so the index is more efficient, and doesn't have to go to the leaf data to get the remaining columns.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (6/26/2012)


    a query with Column1 > Column2 best plan is going to be a index scan, right? it has to use the index to get all values greater than to match the query;

    if you were after a single value, you might get an index seek , but i'm pretty sure that the best you can do is an index scan.

    now, from there, your query might benefit with some include columns so the index is more efficient, and doesn't have to go to the leaf data to get the remaining columns.

    I'd absolutely agree with this - you will always get an index scan.

    Currently the temp table has a unique clustered index on number which offers an advantage to scanning an index containing date1 and date2.

    If you change the clustered index to date1, date2 then the clustered index (the table) is scanned, which is slightly more expensive.

    Depending on the granularity of the difference between the two dates, you might consider a computed column; the difference between the two dates, in sensible units.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks both. I didn't think there would be any way to improve it much with an index alone, but it's good to get confirmation I'm not missing anything obvious. I did manage to make a smallish (perhaps 20%) improvement using a nonclustered index on the two columns and including all other columns in the query, but that was as much as I could get.

    Thanks again.

  • Duncan Pryde (6/26/2012)


    Thanks both. I didn't think there would be any way to improve it much with an index alone, but it's good to get confirmation I'm not missing anything obvious. I did manage to make a smallish (perhaps 20%) improvement using a nonclustered index on the two columns and including all other columns in the query, but that was as much as I could get.

    Thanks again.

    Did you try a computed column? The performance lift is dramatic:

    CREATE TABLE #Dates

    (

    Number int NOT NULL,

    Date1 datetime NOT NULL,

    Date2 datetime NOT NULL,

    delta AS DATEDIFF(mi,Date1, Date2)

    )

    GO

    DECLARE @number_of_numbers INT

    SET @number_of_numbers = 1000000;

    ;WITH

    a AS (SELECT 1 AS i UNION ALL SELECT 1),

    b AS (SELECT 1 AS i FROM a AS x, a AS y),

    c AS (SELECT 1 AS i FROM b AS x, b AS y),

    d AS (SELECT 1 AS i FROM c AS x, c AS y),

    e AS (SELECT 1 AS i FROM d AS x, d AS y),

    f AS (SELECT 1 AS i FROM e AS x, e AS y)

    INSERT INTO #Dates (Number, Date1, Date2)

    SELECT

    number,

    CAST(RAND(number) * 100000 AS Datetime) AS Date1,

    CAST(RAND(number+1) * 100000 AS Datetime) AS Date2

    FROM (SELECT TOP(@number_of_numbers) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number FROM f) d

    GO

    -- This is the bit to optimise if possible

    CREATE CLUSTERED INDEX [number] on [#Dates] (number)

    CREATE INDEX [ix_delta] on [#Dates] (delta)

    set statistics io,time on

    SELECT delta FROM #Dates

    WHERE delta < 0 --Date1 > Date2

    set statistics io,time off

    --------------------------------------------------

    DROP TABLE #Dates

    GO

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (6/26/2012)


    Did you try a computed column? The performance lift is dramatic:

    I must confess I hadn't had a chance yet! Thanks for the feedback: the difference is, as you say, dramatic. It certainly gives me an option anyway.

    Thanks again.

  • why isn't there just a simple index on both date columns in the order you need them ?

    create index X_yourtable_dates on yourschema.yourtable (date1, date2);

    might be another solution to avoid side effects for 'select *' queries suddenly showing the computed column and still provide a solution that performs well enough.

    I'm just guessing your 3 column table actually holds more columns ...

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA (6/26/2012)


    why isn't there just a simple index on both date columns in the order you need them ?

    create index X_yourtable_dates on yourschema.yourtable (date1, date2);

    might be another solution to avoid side effects for 'select *' queries suddenly showing the computed column and still provide a solution that performs well enough.

    I'm just guessing your 3 column table actually holds more columns ...

    Thanks for your reply. I tried the index you describe before posting here, with the other columns in the query as included columns. The result was a scan of the index - albeit slightly more efficient than scanning the clustered primary key (about 20% better).

    The table does hold many more columns in reality, but this question was more about investigating the general principle.

    You're also right in that the "select *" problem you mention is one reason why I'd have to tread quite carefully if I added the computed column.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply