Introduction
A while back I wrote an article about clustered indexes (Cluster
that index!), where I stated that in my opinion all tables should have a
clustered index defined for them. Since I wanted to keep the focus specifically
on the advantages and disadvantages of clustered indexes I intentionally left
some information and discussion regarding indexes out of the article. In this
article I would like to elaborate on these matters as well as responding to some
comments by you readers to the previous article. If you haven't read that
article I recommend you to do that prior to reading this one as much of this
information relates to that of the other article.
Non-clustered indexes are not bad in any way!
First of all I want to clear any confusion regarding non-clustered indexes. I am
not in any way saying they are bad or should not be used. What I am saying is
that I think every table should have a clustered index, so if you're only going
to have one index on a table it should be a clustered one. The focus on the
prior article was on showing the differences between tables that have a
clustered index and those that don't (heap tables, or simply heaps). In this
article I will describe some more details regarding indexes and also try to give
some advice on where to use what index type.
Indexes and modifications
One question I was asked was how indexes and their effect on modifications
(INSERT, UPDATE and DELETE) fits in with the clustered/non-clustered pros and
cons. First of all, I don't really see this as something that should define
whether a table should have a clustered index or not. Consider table T1 below,
where we have a column id that we'll probably want to define as primary
key, and a second column name which is probably the argument used when
searching this table.
CREATE TABLE T1 (
id int identity(1,1) NOT NULL , name varchar(25) NOT NULL)
Even if id is sometimes specified as search argument (it might be an employee id
for instance), it will probably not be used for range searches, i.e. WHERE id
BETWEEN 1 AND 10. The second column, name, will be much more likely to do range
searches with. Remember that a right-truncated search (using % as a suffix in a
LIKE clause) is a range search, i.e. WHERE name LIKE 'a%'. In the prior article
I showed that clustered indexes really excel in range queries, while they won't
have any different effect (neither positive nor negative) than a non-clustered
index for 'normal' queries. So, we can easily see that name is a good candidate
for a clustered index, and id would not be a good choice. Remember however that
id is the primary key of the table, and when you define a primary key for a
table SQL Server will automatically create an index for that key. By default SQL
Server will create a clustered index for a primary key, but you can specify
explicitly that you want the index to be a non-clustered index. But as I showed
in the prior article all tables should have a clustered index defined for them
as that effectively changes the way the table is physically stored. As I
described, since heap tables can suffer from some very performance heavy
disadvantages, the biggest advantage of having a clustered index was not that it
was great for range searches but rather the disadvantages avoided by having it.
So, to get back to the question, how does clustered/non-clustered indexes fit in
with indexes and their disadvantages on modifications? As I said earlier, I
don't think it is relevant at all. If you have a table that is heavily modified
more or less all the time and you therefore don't want to have too many indexes
on it, then let your the id column (which is the table's primary key) have a
clustered index and be done with that. You probably won't be able to use the
nice features of range searches with it, but at least you'll avoid the troubles
of heap tables. However, since I work not only as a DBA but also as a search
engine developer I tend to favor quick responses when searching, and therefore I
like to index not only keys but also columns used in search arguments. And
that's my recommendation, if you are going to have just a single index, make it
a clustered index, but if you're going to have several indexes you should think
hard about which one to choose as a clustered index. But always have a clustered
index on every table.
Size
Another question I got was how much disk space a clustered index takes up
compared to the table it is indexing, and if this extra disk space is a
disadvantage of clustered indexes. Before I answer this, let me just say that if
I had to choose between the extra disk space a clustered index takes or leaving
a table as a heap table I would probably choose a clustered index (with the
extra disk space it takes) in 99% of all situations. Of course, every choice you
make regarding performance and configuration should always be thought through
and well tested, but disk space is so cheap today that I think it would be very
unusual if you had to make that choice.
It is still interesting to calculate the size of an index, and as I said in some
situations you might have to take this into consideration. Actually, a clustered
index does not take that much extra disk space. First of all, remember that the
leaves of the clustered index is the actual data, sorted in index order, so even
if you drop the clustered index these pages will be (more or less) the same. The
extra space usage instead comes from the levels above the leaf level in the
index B-tree. Exactly how much space these extra levels use depends on the
amount of data in the table and the size of the indexing key (the column(s) that
the index is defined on). Each data page, and remember that in a clustered index
this is the same as the leaf level pages of the index, requires one index row in
the index page on the level directly above it. To calculate the number of index
pages needed for the level directly above the leaf level we use the following
formula:
P / (8096 / K)
where P is the number of leaf level pages and K is the size (in bytes) of the
index key. 8096 is the maximum amount of bytes that can be stored per index
page. If this result is more than 1 we take this number and use it as P in the
same formula to find out the number of index pages in the next level. When we
end up with just one page we've found the root level. In Inside SQL Server
2000 Kalen Delaney shows an example of this. For a table with 10.000 data
pages (each page containing 8 KB data) there is a clustered index defined on a
fixed-length character column of bytes (char(5)). Each index key row in this
index use 12 bytes (key size + overhead). This means that 15 index pages (10000
/ (8096 / 12) = 15) is needed at the level directly above the leaf level. The
level above this will consist of a single page, the root level. The extra size
used by adding this clustered index is therefore 16/10000, i.e. less than 1%. As
you can see, a clustered index doesn't really use very much extra space. 1% is
normally a good estimate that you can use if you don't want to calculate it
exactly, it will even be a bit high (as in the example). As always it is of
course necessary to define the index key as small as possible.
FILL FACTOR & PAD INDEX
One thing that I intentionally left out of the prior article was a description
of fill factor and pad index, in order to keep focus on the problems of not
having a clustered index on a table. One thing I did mention was that one slight
disadvantage you might run into by having a clustered index on a table are the
page splits that can occur when adding data (or sometimes when changing existing
data) to the table. This is a part from that article:
"Because the data is stored in the order of the index, to insert a new row
SQL Server must find the page with the two rows between which the new row shall
be placed. Then, if there is not room to fit the row on that page, a split
occurs and some of the rows get moved from this page to a newly created one."
As some of you readers commented, one way to counter this (at least to a degree)
is to specify a value for the fill factor and possibly pad index for the
clustered index. Fill factor specifies the percentage that SQL Server should
fill every data page to when creating an index. Specifying a fill factor value
of 50 therefore means that each data page will only be half full, in effect
meaning that the storage space needed for the table is doubled (since there are
twice as many data pages). The good thing is that page splits are not as
necessary as with a higher fill factor, since there will usually be room on a
page to insert a new row without splitting the page into two and moving rows
between them. If fill factor is specified you can also specify the option to use
pad index (it uses the same value as fill factor), which works in the same way
as fill factor only it is used for index pages in the intermediate levels
(between the root level and leaf level).
Note that both options are only used when creating or rebuilding an index. They
are not maintained after initial 'filling' by SQL Server, as that would defeat
the original purpose to use them, i.e. avoiding page splits (as SQL Server would
have to use page splits to maintain them). Also note that these options apply
for non-clustered indexes as well, the index pages of non-clustered indexes use
page splits in the same way as clustered indexes of course. As always, think
carefully and test the effects of changing the fill factor and pad index values.