Improve Between performance

Question

Post reply

Improve Between performance

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply

cgreathouse SSC Eights! Points: 855 More actions · Answer 1

Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!

Could you explain what's going on here to make it so much faster?

Thanks!

SQLRNNR SSC Guru Points: 281334 More actions · Answer 2

cgreathouse (4/2/2010)
Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!
Could you explain what's going on here to make it so much faster?
Thanks!

Which script?

The original or the one that Lutz provided?

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

cgreathouse SSC Eights! Points: 855 More actions · Answer 3

this one

WHERE LOW <= 982827279

AND LOW > 982827279 -1000

AND HIGH >= 982827279

SQLRNNR SSC Guru Points: 281334 More actions · Answer 4

It uses better search arguments and also limits the result set. The better search argument allows better index use. If you include your graphical execution plan - you will probably see that the query is using an index now.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

LutzM SSC Guru Points: 107049 More actions · Answer 5

cgreathouse (4/2/2010)
Excellent! I'm now getting a duration of 1 ms and with only 3 reads!!
Could you explain what's going on here to make it so much faster?
Thanks!
....
this one
WHERE LOW <= 982827279
AND LOW > 982827279 -1000
AND HIGH >= 982827279

It is important to verify if the range value I used (based on guessing) will be larger than the HIGH-LOW difference you actually have and expect in the future. I don't know how much those can vary and the actual range though... If required, you might need to make that range value a parameter stored in a separate table and update it based on an AFTER INSERT/UPDATE/DELETE trigger.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Very nice, Lutz. I always have trouble with this one myself.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

cgreathouse SSC Eights! Points: 855 More actions · Answer 7

Well, this was working very well for a while (almost 2 months!). However I hit the potential problem that was mentioned earlier with needing to adjust the range.

Just a quick recap on the query...

SELECT *

FROM EffectiveItems

WHERE LOW <= 982827279

AND LOW > 982827279 -1000

AND HIGH >= 982827279

The -1000 works for a majority of the time. However, I hit this one just recently (using a value of 986879999)...

SELECT *

FROM EffectiveItems

WHERE LOW <= 986879999

AND LOW > 986879999-1000

AND HIGH >= 986879999

This wasn't returning any results. It turns out I had to change the 1000 to 1130001 to get results returned. So now querying for value 986879999 works and is still fast (~3 reads and 1 ms). However, if I go back and query for 982827279 the performance is horrible (~45,000 reads and ~2 seconds).

Any ideas on how to get better performance that works in both cases? As I mentioned before this table has a decent number of rows (~11 million) and for the most part is read only (updated once every 3 months).

Thanks!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 8

cgreathouse (5/30/2010)
Well, this was working very well for a while (almost 2 months!). However I hit the potential problem that was mentioned earlier with needing to adjust the range.
Just a quick recap on the query...
SELECT *
FROM EffectiveItems
WHERE LOW <= 982827279
AND LOW > 982827279 -1000
AND HIGH >= 982827279
The -1000 works for a majority of the time. However, I hit this one just recently (using a value of 986879999)...
SELECT *
FROM EffectiveItems
WHERE LOW <= 986879999
AND LOW > 986879999-1000
AND HIGH >= 986879999
This wasn't returning any results. It turns out I had to change the 1000 to 1130001 to get results returned. So now querying for value 986879999 works and is still fast (~3 reads and 1 ms). However, if I go back and query for 982827279 the performance is horrible (~45,000 reads and ~2 seconds).
Any ideas on how to get better performance that works in both cases? As I mentioned before this table has a decent number of rows (~11 million) and for the most part is read only (updated once every 3 months).
Thanks!

How many rows are returned for each of these?

have you tried OPTION (RECOMPILE)?

I would also set MAXDOP 1 as an option. Queries such as this should not be parallelized I think (assuming the new query is parallelized).

did you update all statistics with a full scan on the table? I would do this after each load period.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

cgreathouse SSC Eights! Points: 855 More actions · Answer 9

Only 1 or 2 rows are going to be returned. In the actual query there is also an additional column I'm filtering on. Here's the query I originally started with...

SELECT *

FROM EffectiveItems

WHERE 982827279 between LOW and HIGH

and 20100505 between [BEGIN] and [END]

Looking at the execution plan I was getting a clustered index seek, but it was still taking a long time to complete.

After posting here and trying out some of the suggestions, then query then looked like...

SELECT *

FROM EffectiveItems

WHERE LOW <= 982827279

AND LOW > 982827279 -1000

AND HIGH >= 982827279

AND 20100505 between [BEGIN] and [END]

this was working well until I hit a value (i.e. 986879999) were no rows were being returned. That was fixed once I changed "-1000" to "-1130001 ". However the performance went downhill.

I haven't tried the OPTION (RECOMPILE) because this isn't in a stored procedure.

The MAXDOP 1 option didn't do much. It still takes about 64,000 reads and 1.3 secs.

It seems like there's got to be a way to speed this up. The combination of the columns LOW/HIGH/BEGIN/END are very selective. Only 1 or 2 rows are returned. I created a clustered index on the LOW/HIGH/BEGIN/END columns expecting that help speed things up because they're very selective when combined. However, it's still taking about 1.5 secs and 65,000 reads. I can also see that the clustered index is being used and it's doing an index seek.

Any other suggestions?

Thanks for the help!

Stefan_G SSCertifiable Points: 6609 More actions · Answer 10

First I think we should make it clear why this query is so hard for the SQL server.

The original query looks like this:

SELECT *

FROM EffectiveItems

WHERE 982827279 between LOW and HIGH

and 20100505 between [BEGIN] and [END]

This is equivalent to

SELECT *

FROM EffectiveItems

WHERE

LOW <= 982827279

and

HIGH >= 982827279

and

[BEGIN] <= 20100505

and

[END] >= 20100505

This is a query on the combination of four separate columns.

It is impossible to use more than one index for a query like this.

If you have an index on low, high, begin, end, the SQL server will only use the index to find all the rows with low<=982827279 - it will then scan all those rows to find the rows that satisfy the rest of the conditions.

So, depending on the value you are searching for the SQL server might have to scan all the rows in the table to find the very few rows that satisfy all the parameters.

One way to speed up a query like this is to picture this as a geometric problem. Imagine that you plot a point for each item in a diagram using low as the x-coordinate and high as the y-coordinate. All the items will now be represented as a large number of points that are positioned in a fairly narrow band above the line where y=x.

To find the items where low <= q and high >= q you will have to look in the square above and to the right of the point (q,q)

To do this efficiently you can divide the total set of items into a grid of squares. if you have 10 million points you can divide them into about 3000 squares with about 3000 points in each square.

You can then perform the query by first finding the set of squares that touches the wanted area and scan the points in these squares to find the points that actually satisfy the condition.

This is very efficient because there is such a small number of squares and such a small number of pints in each square.

The following script demonstrates the method:

-- let us play somewhere safe

use tempdb

-- Create some test data

-- I assume a uniform random distribution of values for low

-- I assume a cubic distribution of values for high-low (this gives many rows with a small difference, but some rows with a large difference)

-- The maximum value of high-low = 1 million

-- all values are between 0 and 100 million

if object_id('dbo.items') is not null drop table dbo.items

select identity(int, 1,1) as id, low, cast(low+d*d*d as int) as high

into dbo.items

from (

select top 10000000 ABS(CHECKSUM(newid()))%99000000 as low,(ABS(CHECKSUM(newid()))%100000)*0.001+1 as d

from sys.all_columns c1,sys.all_columns c2

) t

-------------------------------------------------------------------

-- Now preprocess the test data to increase query performance

-------------------------------------------------------------------

-- calculate a table of grid squares

-- the idea here is to create a grid so the number of used squares and the maximum number of items in a square is

-- of the same order of magnitude

-- adjust @d to achieve this balance with your data

declare @d int = 100000

if object_id('dbo.grid') is not null drop table dbo.grid

select low*100000+high as grid, low*@d as low, high*@d+@d as high, rows

into dbo.grid

from (

select low/@d as low, high/@d as high, count(*) as rows

from dbo.items

group by low/@d, high/@d

) t

-- display the resulting number of grid rows and items per grid

select count(*) as gridrows from dbo.grid

select top 10 * from dbo.grid order by rows desc

-- Add a calculated column with the grid number to the original table

declare @sql as varchar(max)

set @sql='alter table dbo.items add grid as low/@d*100000+high/@d'

set @sql=replace(@sql,'@d', @d)

exec (@sql)

-- Add a non-clustered covering index to help query performance

create index ix1 on dbo.items(grid,low) include(id,high)

create unique index ix1 on dbo.grid(grid) include (low,high)

-- Add another covering index to test performance without using the grid column

create index ix2 on dbo.items(low) include(id,high)

go

----------------------------------------------------------

-- preprocessing is now finished - test query performance

----------------------------------------------------------

set statistics time on

set statistics io on

dbcc dropcleanbuffers

declare @q as int = 50000000 -- the value to query for

select id,low,high

from dbo.items

where grid in (

-- find all grid squares that includes the desired value

select grid from dbo.grid

where low<=@q and high>@q

)

and low <= @q and high >= @q

and (id % 1000) = 0 -- just to avoid getting too big result set

--Table 'items'. Scan count 78, logical reads 622, physical reads 3, read-ahead reads 168, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

--Table 'grid'. Scan count 1, logical reads 41, physical reads 2, read-ahead reads 39, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

-- SQL Server Execution Times:

-- CPU time = 0 ms, elapsed time = 18 ms.

go

-- perform a query without using the grid just to compare

dbcc dropcleanbuffers

declare @q as int = 50000000 -- the value to query for

select id,low,high

from dbo.items

where low <= @q and high >= @q

and (id % 1000) = 0 -- just to avoid getting too big result set

--Table 'items'. Scan count 3, logical reads 11483, physical reads 165, read-ahead reads 11418, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

-- SQL Server Execution Times:

-- CPU time = 857 ms, elapsed time = 2646 ms.

go

As you can see the performance difference between the grid method and a normal query is quite dramatic.

Stefan_G SSCertifiable Points: 6609 More actions · Answer 11

Just curious: Did you test my last suggestion ? How did it work for you ?

cgreathouse SSC Eights! Points: 855 More actions · Answer 12

I haven't tried it yet. I'll let you know once I have.

Thanks for the idea!

cgreathouse SSC Eights! Points: 855 More actions · Answer 13

OK, I gave this a try on my data and it works pretty well. With a table of about 3 million rows, the query is executing in about 2 ms. However this only works when working with the low & high columns. I also have a begin & end columns that need to be referenced. Once I add those into the query my response times go back to ~500 ms.

Just to recap, here's what the table begin queried looks like...

CREATE TABLE [dbo].[EffectiveItems](

[ItemID] [int] NOT NULL, -- foreign key

[BEGIN] [int] NOT NULL,

[END] [int] NOT NULL,

[LOW] [int] NOT NULL,

[HIGH] [int] NOT NULL,

[Val1]int NOT NULL,

[Val2] int NULL,

[Val3] int NOT NULL,

[Val4] int NULL

)

the query I need to run looks something like this...

select ItemID, Val1, Val2, Val3, Val4

from EfectiveItems

where 982827279 between low and high

and 20090103 between [begin] and [end]

Any suggestions on how to get this query to use your technique?

Thanks!

Stefan_G SSCertifiable Points: 6609 More actions · Answer 14

This should work just fine:

-- Add a non-clustered covering index to help query performance

create index ix1 on dbo.EffectiveItems(grid, low)

include(ItemID, high, [begin], [end], Val1, Val2, Val3, Val4)

select ItemID, Val1, Val2, Val3, Val4

from dbo.EffectiveItems

where grid in (

-- find all grid squares that includes the desired value

select grid from dbo.grid

where low <= 982827279 and high > 982827279

)

and 982827279 between low and high

and 20090103 between [begin] and [end]

If the above code is still too slow, please post the actual execution plan.

Also, please post the result of this:

-- display the resulting number of grid rows and items per grid

select count(*) as gridrows from dbo.grid

select top 10 * from dbo.grid order by rows desc

cgreathouse SSC Eights! Points: 855 More actions · Answer 15

Wow, that did it. The query now takes about 1ms to complete. Nice!!

Here are the results that you asked for...

gridrows

1810

gridlowhighrows

84040840484040000084050000037696

84090840984090000084100000037366

0010000037162

99200992099200000099210000033111

84110841184110000084120000031619

98000980098000000098010000028291

84400844084400000084410000028048

84070840784070000084080000027147

84060840684060000084070000026154

98050980598050000098060000025176

Thanks for the help!!