Introduction
When we run a query, the choice of whether to use an index or
not is decided by the query optimizer. Using appropriate indexes on tables so
that the query optimizer chooses the most efficient strategy can yield
significant performance gains . We will try to examine a few scenarios to
familiarize ourselves with the way indexes are used to retrieve data and also
try to figure out which index clustered or non clustered is good on each
occasion.
Let us create a sample table so that we can use it for our
purpose.
( member_number INT NOT NULL,
account_number INT NOT NULL,
member_SSN INT NOT NULL,
account_balance MONEY NOT NULL,
member_name VARCHAR(30) NOT NULL,
member_address VARCHAR(60) NOT NULL )
We will use a number of scenarios here.
Retrieving a single row from the members table
Select account_balance from members where member_number =
9000
Let us say there is a clustered index on the member_number
column.
SQL Server will first obtain the page number of the root page of
the index from the sysindexes table. Then it will traverse through the index
pages looking for a key value that is not greater than the current key value. In
a clustered index the index consists of index key and a pointer which is the
Page ID. So just one level above the leaf level it gets the page pointer to the
data page holding the particular row and it scans the data page for the row
containing the key.
Let us say there is a non clustered index on the
member_number column.
In case of a non clustered index too the traversal works in a
similar manner, the pointer to the key value we want to retrieve is found in the
leaf level which is the Row ID of the data row. Using the Row ID sql server
pulls the data row.
The difference here is, the non clustered index has one more
logical read. The performance difference is very less between the two indexes
here.
Retrieving a range of rows
Select account_balance from members where member_number
between 9000 and 10000
In case of a clustered index on member_number column sql
server will look again for the highest key value not greater than the lowest key
we wish to retrieve. It will traverse through the index and will find the data
page that contains the lowest key value which is 9000 in our case. It scans the
data page for 9000 and when it finds it, it then retrieves the rows sequentially
until it reaches 10000, as we know in a clustered index the data rows are in
arranged in the same order as the clustered index key.
In case of a non clustered index on member_number
column the traversal of the index happens in a similar way. But once the
leaf level is reached and the key value is found, the index entry contains the
Row ID of the data row to be found which in this case will be the Row ID for
member_number 9000. Using this Row ID, sql server will pull the data page
holding that row. In a similar way it has to pull all the other rows for
member_numbers between 9001 and 10000. We have to note a difference here. The
key values are in sequence in a non clustered index, but the data pages are not.
Let us say we have 50 rows satisfying the query and approximately 10 rows are
present in each data page. So sql server has to retrieve the 50 rows in 50/10 =
5 logical reads to the data pages using a clustered index and sql server will
need 50 logical reads to the data pages using non clustered index. This is a
significant performance difference.
So it is always better to create a clustered index when we are
accessing a range of rows.
Covered queries
Select account_balance from members where member_number
between 9000 and 10000
Same query as before , but this time let us say that the members
table has a composite non clustered index on member_number and
account_balance columns. The query is called a covered query since the items
in the where clause and select clause belong to the index. The query optimizer
will realize that this is a covered query. In this case the sql server does not
have to go to data level to work on the query. It can find the results in the
leaf level of the non clustered index. It traverses the index based on the
member_number and when it reaches the leaf level , it finds the account_balance
information next to it in the key itself with out having to access the data page
through the pointer.
Let us say there are 500 rows that satisfy this query and there
are 100 index entries per data page. So to get the results sql server has to do
5 logical reads of data pages. This is more efficient than having a clustered
index on the member_number column since we do not have to access data pages at
all.
So sometimes a covered non clustered index is more efficient
than an equivalent clustered index.
Retrieving rows with clustered index and non clustered index on
the table
Select account_balance from members where member_number
between 9000 and 10000
Here we will examine a different scenario where there is a
non clustered index on the member_number column and clustered index on
the account_number column. The main thing to note here is that the members
table will now has leaf level index entries containing the clustered index key
as a pointer instead of Row ID when there is no clustered index on the table. So
in order to access the data rows, sql server has to traverse through the
clustered index. In our example in order access row with member_number 9000, sql
server will traverse through the non clustered index and in the leaf level it
will find the index entry for 9000 having a clustered index key which will be
the account_number. Based on this account_number it traverses through the
clustered index and finds the data row. Suppose we have 100 rows to fetch, we
have to access the clustered index 100 times to fetch the data pages containing
the rows.
This is very inefficient and probably the query optimizer will
choose a table scan instead of an index scan.
In the same scenario let us change the query and execute the
following query.
Select account_number from members where member_number
between 9000 and 10000
So sql server will traverse through the non clustered index as
before until the leaf level where it finds the clustered index key which is the
account_number in this case. So sql server need not go fetch the data pages
traversing through the clustered index again, since it found the member_number
and account_number side by side in its index entries.
So the query optimizer will choose the index scan since it is
very efficient in this case.
Retrieving rows with multiple non clustered indexes on the
table
Select * from members where account_balance between 500 and
5000 and member_number between 9000 and 9500
Let us say there is a non clustered index on account_balance
column and another non clustered index on the member_number column with out
any clustered index on the table. In this case the query optimizer can use both
the indexes. Since there is no clustered index, the leaf level of the indexes
has Row IDs as pointers to the rows. So the query optimizer examines both
indexes and gets the sets of Row IDs that match the given selection.Then it gets
the intersection of both the sets of Row IDs. So instead of pulling the
individual data pages, the query optimizer first gets a result set of all the
RowIDs that match both the criteria. Based on how many rows satisfy the
condition it estimates whether to make use of the indexes or to do a table scan
to retrieve the rows. Let us say there are 100 rows that satisfy this query and
there are 10,000 rows spread over 500 data pages (20 rows per page) in the
table, it may be efficient to use the index scan and do 100 logical reads to get
the data pages than scanning all the 500 data pages.
Conclusion
The decision of the query optimizer depends on the cost it would
incur or in other words how many logical or physical reads it has to do to
retrieve the result of a query. Hopefully these different scenarios will help a
little bit in better understanding the usage of indexes by query optimizer.
Any comments are welcome.
References
Microsoft SQL Server 2000 Performance Optimization and
Tuning Hand Book by Ken England.