I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).
I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.
Question is, is that really the case?
Let’s do some experiments and find out.
CREATE TABLE TestingRangeQueries ( ID INT IDENTITY, SomeValue NUMERIC(7,2), Filler CHAR(500) DEFAULT '' ) -- 1 million rows INSERT INTO TestingRangeQueries (SomeValue) SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511) FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b -- One cluster and two nonclustered indexes on the column that will be used for the range filter CREATE CLUSTERED INDEX idx_RangeQueries_Cluster ON TestingRangeQueries (ID) CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1 ON TestingRangeQueries (ID) CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2 ON TestingRangeQueries (ID) INCLUDE (SomeValue) GO
The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.
SELECT SUM(SomeValue) FROM TestingRangeQueries WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table
I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.
First up, the clustered index.
As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.
Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 110 ms.
So if the advice is correct, this should be the best (lowest CPU, lowest IO). Let’s see…
The first nonclustered index does not cover the query. Hence, seeing as this query returns a substantial portion of the table, we could assume that the optimiser probably wouldn’t chose to use it because of the cost of the key lookups. If that is the case, then if the query probably won’t be very efficient if I force the use of that index.
Table ‘TestingRangeQueries’. Scan count 1, logical reads 551413, physical reads 0.
SQL Server Execution Times:
CPU time = 562 ms, elapsed time = 560 ms.
Ow. Not very efficient at all. Those key lookups hurt.
One last index to test, the covering non-clustered index.
Table ‘TestingRangeQueries’. Scan count 1, logical reads 338, physical reads 0.
SQL Server Execution Times:
CPU time = 62 ms, elapsed time = 67 ms.
2/3 the CPU usage of the query using the clustered index and about 3% of the reads. No question about it, this one’s faster and less resource intensive. That pretty much invalidates the claim that the clustered index is best for range queries.
So what’s going on here?
The technet article I linked to at the beginning of this post states the following as reasoning for recommending a clustered index for range queries:
After the row with the first value is found by using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent.
Um, well, ignoring that there’s no guarantee of physical adjacency with an index at all, how does this differ from a nonclustered index?
In a clustered index, the leaf pages are logically ordered by the clustered index key (meaning that SQL can follow a page’s next page pointer to get the next page in the key order). To do a range query using the clustered index, SQL will seek down the b-tree to the start of the range and then read along the leaf pages, following the next page pointers, until it reaches the end of the range.
In a nonclustered index, the leaf pages are logically ordered by the index key (just the same as in a cluster). To do a range query using the nonclustered index, SQL will seek down the b-tree to the start of the range and then read along the leaf pages, following the next page pointers, until it reaches the end of the range. If additional columns are needed, SQL will then do a key/RID lookup for each row to retrieve the additional rows.
Not much difference there, other than the key lookups. So ‘physical adjacency’ is pretty much ruled out as a reason using the clustered index (if it was even true)
What is important, as we saw from the two tests of the nonclustered index, is that when the query is retrieving a significant portion of the table (and by ‘significant’ I mean more than about 1%), the index needs to be covering, or the cost of the key lookups becomes overwhelming. Hence, what we want for a range query is a covering index.
The clustered index is always covering, because it contains, at the leaf level, all the columns of the table. It is, however, the largest index on a table1. The larger the index, the less efficient that index becomes. Hence while the clustered index is good for a range query, it’s not the best possible index for a range query.
The best possible index for a range query is the smallest index that is seekable and covers the query (the same as for just about any other query).
Now it’s not always possible to cover a query, and some queries shouldn’t be covered. There will be times when the cluster is the best choice for range queries, either because of the number of columns required or because just about every query filters on a particular column and that column is a good choice for the cluster. Just don’t make the mistake of thinking it’s the only choice.
(1) It is possible to have a nonclustered index that’s larger than the clustered index. Takes some work though and is far from a usual case.