Improve Between performance

  • Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!

    Could you explain what's going on here to make it so much faster?

    Thanks!

  • cgreathouse (4/2/2010)


    Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!

    Could you explain what's going on here to make it so much faster?

    Thanks!

    Which script?

    The original or the one that Lutz provided?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • this one

    WHERE LOW <= 982827279

    AND LOW > 982827279 -1000

    AND HIGH >= 982827279

  • It uses better search arguments and also limits the result set. The better search argument allows better index use. If you include your graphical execution plan - you will probably see that the query is using an index now.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • cgreathouse (4/2/2010)


    Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!

    Could you explain what's going on here to make it so much faster?

    Thanks!

    ....

    this one

    WHERE LOW <= 982827279

    AND LOW > 982827279 -1000

    AND HIGH >= 982827279

    It is important to verify if the range value I used (based on guessing) will be larger than the HIGH-LOW difference you actually have and expect in the future. I don't know how much those can vary and the actual range though... If required, you might need to make that range value a parameter stored in a separate table and update it based on an AFTER INSERT/UPDATE/DELETE trigger.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Very nice, Lutz. I always have trouble with this one myself.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Well, this was working very well for a while (almost 2 months!). However I hit the potential problem that was mentioned earlier with needing to adjust the range.

    Just a quick recap on the query...

    SELECT *

    FROM EffectiveItems

    WHERE LOW <= 982827279

    AND LOW > 982827279 -1000

    AND HIGH >= 982827279

    The -1000 works for a majority of the time. However, I hit this one just recently (using a value of 986879999)...

    SELECT *

    FROM EffectiveItems

    WHERE LOW <= 986879999

    AND LOW > 986879999-1000

    AND HIGH >= 986879999

    This wasn't returning any results. It turns out I had to change the 1000 to 1130001 to get results returned. So now querying for value 986879999 works and is still fast (~3 reads and 1 ms). However, if I go back and query for 982827279 the performance is horrible (~45,000 reads and ~2 seconds).

    Any ideas on how to get better performance that works in both cases? As I mentioned before this table has a decent number of rows (~11 million) and for the most part is read only (updated once every 3 months).

    Thanks!

  • cgreathouse (5/30/2010)


    Well, this was working very well for a while (almost 2 months!). However I hit the potential problem that was mentioned earlier with needing to adjust the range.

    Just a quick recap on the query...

    SELECT *

    FROM EffectiveItems

    WHERE LOW <= 982827279

    AND LOW > 982827279 -1000

    AND HIGH >= 982827279

    The -1000 works for a majority of the time. However, I hit this one just recently (using a value of 986879999)...

    SELECT *

    FROM EffectiveItems

    WHERE LOW <= 986879999

    AND LOW > 986879999-1000

    AND HIGH >= 986879999

    This wasn't returning any results. It turns out I had to change the 1000 to 1130001 to get results returned. So now querying for value 986879999 works and is still fast (~3 reads and 1 ms). However, if I go back and query for 982827279 the performance is horrible (~45,000 reads and ~2 seconds).

    Any ideas on how to get better performance that works in both cases? As I mentioned before this table has a decent number of rows (~11 million) and for the most part is read only (updated once every 3 months).

    Thanks!

    How many rows are returned for each of these?

    have you tried OPTION (RECOMPILE)?

    I would also set MAXDOP 1 as an option. Queries such as this should not be parallelized I think (assuming the new query is parallelized).

    did you update all statistics with a full scan on the table? I would do this after each load period.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Only 1 or 2 rows are going to be returned. In the actual query there is also an additional column I'm filtering on. Here's the query I originally started with...

    SELECT *

    FROM EffectiveItems

    WHERE 982827279 between LOW and HIGH

    and 20100505 between [BEGIN] and [END]

    Looking at the execution plan I was getting a clustered index seek, but it was still taking a long time to complete.

    After posting here and trying out some of the suggestions, then query then looked like...

    SELECT *

    FROM EffectiveItems

    WHERE LOW <= 982827279

    AND LOW > 982827279 -1000

    AND HIGH >= 982827279

    AND 20100505 between [BEGIN] and [END]

    this was working well until I hit a value (i.e. 986879999) were no rows were being returned. That was fixed once I changed "-1000" to "-1130001 ". However the performance went downhill.

    I haven't tried the OPTION (RECOMPILE) because this isn't in a stored procedure.

    The MAXDOP 1 option didn't do much. It still takes about 64,000 reads and 1.3 secs.

    It seems like there's got to be a way to speed this up. The combination of the columns LOW/HIGH/BEGIN/END are very selective. Only 1 or 2 rows are returned. I created a clustered index on the LOW/HIGH/BEGIN/END columns expecting that help speed things up because they're very selective when combined. However, it's still taking about 1.5 secs and 65,000 reads. I can also see that the clustered index is being used and it's doing an index seek.

    Any other suggestions?

    Thanks for the help!

  • First I think we should make it clear why this query is so hard for the SQL server.

    The original query looks like this:

    SELECT *

    FROM EffectiveItems

    WHERE 982827279 between LOW and HIGH

    and 20100505 between [BEGIN] and [END]

    This is equivalent to

    SELECT *

    FROM EffectiveItems

    WHERE

    LOW <= 982827279

    and

    HIGH >= 982827279

    and

    [BEGIN] <= 20100505

    and

    [END] >= 20100505

    This is a query on the combination of four separate columns.

    It is impossible to use more than one index for a query like this.

    If you have an index on low, high, begin, end, the SQL server will only use the index to find all the rows with low<=982827279 - it will then scan all those rows to find the rows that satisfy the rest of the conditions.

    So, depending on the value you are searching for the SQL server might have to scan all the rows in the table to find the very few rows that satisfy all the parameters.

    One way to speed up a query like this is to picture this as a geometric problem. Imagine that you plot a point for each item in a diagram using low as the x-coordinate and high as the y-coordinate. All the items will now be represented as a large number of points that are positioned in a fairly narrow band above the line where y=x.

    To find the items where low <= q and high >= q you will have to look in the square above and to the right of the point (q,q)

    To do this efficiently you can divide the total set of items into a grid of squares. if you have 10 million points you can divide them into about 3000 squares with about 3000 points in each square.

    You can then perform the query by first finding the set of squares that touches the wanted area and scan the points in these squares to find the points that actually satisfy the condition.

    This is very efficient because there is such a small number of squares and such a small number of pints in each square.

    The following script demonstrates the method:

    -- let us play somewhere safe

    use tempdb

    -- Create some test data

    -- I assume a uniform random distribution of values for low

    -- I assume a cubic distribution of values for high-low (this gives many rows with a small difference, but some rows with a large difference)

    -- The maximum value of high-low = 1 million

    -- all values are between 0 and 100 million

    if object_id('dbo.items') is not null drop table dbo.items

    select identity(int, 1,1) as id, low, cast(low+d*d*d as int) as high

    into dbo.items

    from (

    select top 10000000 ABS(CHECKSUM(newid()))%99000000 as low,(ABS(CHECKSUM(newid()))%100000)*0.001+1 as d

    from sys.all_columns c1,sys.all_columns c2

    ) t

    -------------------------------------------------------------------

    -- Now preprocess the test data to increase query performance

    -------------------------------------------------------------------

    -- calculate a table of grid squares

    -- the idea here is to create a grid so the number of used squares and the maximum number of items in a square is

    -- of the same order of magnitude

    -- adjust @d to achieve this balance with your data

    declare @d int = 100000

    if object_id('dbo.grid') is not null drop table dbo.grid

    select low*100000+high as grid, low*@d as low, high*@d+@d as high, rows

    into dbo.grid

    from (

    select low/@d as low, high/@d as high, count(*) as rows

    from dbo.items

    group by low/@d, high/@d

    ) t

    -- display the resulting number of grid rows and items per grid

    select count(*) as gridrows from dbo.grid

    select top 10 * from dbo.grid order by rows desc

    -- Add a calculated column with the grid number to the original table

    declare @sql as varchar(max)

    set @sql='alter table dbo.items add grid as low/@d*100000+high/@d'

    set @sql=replace(@sql,'@d', @d)

    exec (@sql)

    -- Add a non-clustered covering index to help query performance

    create index ix1 on dbo.items(grid,low) include(id,high)

    create unique index ix1 on dbo.grid(grid) include (low,high)

    -- Add another covering index to test performance without using the grid column

    create index ix2 on dbo.items(low) include(id,high)

    go

    ----------------------------------------------------------

    -- preprocessing is now finished - test query performance

    ----------------------------------------------------------

    set statistics time on

    set statistics io on

    dbcc dropcleanbuffers

    declare @q as int = 50000000 -- the value to query for

    select id,low,high

    from dbo.items

    where grid in (

    -- find all grid squares that includes the desired value

    select grid from dbo.grid

    where low<=@q and high>@q

    )

    and low <= @q and high >= @q

    and (id % 1000) = 0 -- just to avoid getting too big result set

    --Table 'items'. Scan count 78, logical reads 622, physical reads 3, read-ahead reads 168, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'grid'. Scan count 1, logical reads 41, physical reads 2, read-ahead reads 39, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 0 ms, elapsed time = 18 ms.

    go

    -- perform a query without using the grid just to compare

    dbcc dropcleanbuffers

    declare @q as int = 50000000 -- the value to query for

    select id,low,high

    from dbo.items

    where low <= @q and high >= @q

    and (id % 1000) = 0 -- just to avoid getting too big result set

    --Table 'items'. Scan count 3, logical reads 11483, physical reads 165, read-ahead reads 11418, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 857 ms, elapsed time = 2646 ms.

    go

    As you can see the performance difference between the grid method and a normal query is quite dramatic.

  • Just curious: Did you test my last suggestion ? How did it work for you ?

  • I haven't tried it yet. I'll let you know once I have.

    Thanks for the idea!

  • OK, I gave this a try on my data and it works pretty well. With a table of about 3 million rows, the query is executing in about 2 ms. However this only works when working with the low & high columns. I also have a begin & end columns that need to be referenced. Once I add those into the query my response times go back to ~500 ms.

    Just to recap, here's what the table begin queried looks like...

    CREATE TABLE [dbo].[EffectiveItems](

    [ItemID] [int] NOT NULL, -- foreign key

    [BEGIN] [int] NOT NULL,

    [END] [int] NOT NULL,

    [LOW] [int] NOT NULL,

    [HIGH] [int] NOT NULL,

    [Val1]int NOT NULL,

    [Val2] int NULL,

    [Val3] int NOT NULL,

    [Val4] int NULL

    )

    the query I need to run looks something like this...

    select ItemID, Val1, Val2, Val3, Val4

    from EfectiveItems

    where 982827279 between low and high

    and 20090103 between [begin] and [end]

    Any suggestions on how to get this query to use your technique?

    Thanks!

  • This should work just fine:

    -- Add a non-clustered covering index to help query performance

    create index ix1 on dbo.EffectiveItems(grid, low)

    include(ItemID, high, [begin], [end], Val1, Val2, Val3, Val4)

    select ItemID, Val1, Val2, Val3, Val4

    from dbo.EffectiveItems

    where grid in (

    -- find all grid squares that includes the desired value

    select grid from dbo.grid

    where low <= 982827279 and high > 982827279

    )

    and 982827279 between low and high

    and 20090103 between [begin] and [end]

    If the above code is still too slow, please post the actual execution plan.

    Also, please post the result of this:

    -- display the resulting number of grid rows and items per grid

    select count(*) as gridrows from dbo.grid

    select top 10 * from dbo.grid order by rows desc

  • Wow, that did it. The query now takes about 1ms to complete. Nice!!

    Here are the results that you asked for...

    gridrows

    1810

    gridlowhighrows

    84040840484040000084050000037696

    84090840984090000084100000037366

    0010000037162

    99200992099200000099210000033111

    84110841184110000084120000031619

    98000980098000000098010000028291

    84400844084400000084410000028048

    84070840784070000084080000027147

    84060840684060000084070000026154

    98050980598050000098060000025176

    Thanks for the help!!

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply