In our earlier post, we have discussed about the structure and storage of clustered index. In this post we will discuss about the design consideration of clustered index.There are couple of points that need to be considered while selecting clustered index key.There is no hard rule in selecting the clustering key . It is best practices and guidelines and the internal knowledge will help us to select right clustering key.
Uniqueness
SQL server will allow you to create a clustered index on non unique column, but uniqueness is one of the most desirable attribute for any indexes especially for a clustered index.Even if SQL server allow to create clustered index on a non unique column , internally SQL server add 4 bytes value for all duplicate instance of clustered key and this 4 bytes variable length column is known as uniquiifier .In this case SQL server consider the clustered key as the combination of non unique column on which clustered index is defined and internally generated uniquifier column.This value will be stored where ever the clustered key to be stored. For example in the leaf level of a non clustered index defined on a clustered table.
Let us see an example.I am going to create a copy of SalesOrderDetail table and define clustered index on the productid column which has duplicate values.
Use MyDb
GO
SELECT * INTO dbo.SalesOrderDetailDupCI FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE CLUSTERED INDEX ix_SalesOrderDetailDupCI ON dbo.SalesOrderDetailDupCI(ProductId)
Now let us run the DBCC IND command
DBCC IND('mydb','SalesOrderDetailDupCI',1)
This returns me 1608 pages and the root page number is 3570(The value of PagePID column of the record having highest value for indexlevel column ) . Below is the output of DBCC Page command for the root page and one intermediate page
In the output, we can see one additional column called UNIQUIFIER which we did not seen when we created a unique clustered index in earlier post. When a clustered index created on a non unique column , SQL server add 4 bytes random value for all duplicate occurrence of clustered key. It will not generate the uniquifier value for the first occurrence. That is the reason there are 0 for uniquifier column for some records. So clustered key defined on the non unique column has an overhead of generating the uniquifier value and also widen the cluster index key.In our example we defined clustered index on a 4 bytes column but actual clustered key size become 8 bytes due to uniquifier.This combination has to replicated to leaf level of all non clustered index .The magnitude of the issue will increase when there is non clustered index defined on non unique column where this clustered index value need to be stored in the non leaf level pages also.(We will discuss about non clustered index on non unique column in later post).
If your table does not have a single unique key to define clustered index, try to make it unique by adding one or two narrow column to the clustering key. That will avoid the overhead of adding the uniquifier and reduce the bookmark look up operation as the non leaf level of non clustered index has more columns (The additional column added to clustered key to make it unique)
Static
Another desired property of the clustered index key is to be a Static. when we define clustered index on a non static column , that makes the update statement(updating the clustered index key) more costly as it has to move the record into different page to make sure the data is stored in the logical order of clustered index and leaf level of all non clustered index need to be updated. Let us see a sample
set statistics io on
GO
--Unique clustered index on SalesOrderDetailIdUPDATE SalesOrderDetail_StaticIndex SET ProductID =99 WHERE ProductID=707
GO--Nonunique clustered index on ProductIdUPDATE SalesOrderDetail SET ProductID =99 WHERE ProductID=707
In the below output you can see a huge change in IO for the second update statement.
Even in small table (Customer_id,Firstname.lastname) on which clustered index defined on a non static column (Lastname) and which has one non clustered index, any change in the clustered index key need to make changes in minimum of two pages.One data page and one leaf level page of non clustered index.
Size of the clustered index key
Size of the clustered index refers to the number of bytes requires to store the clustered index key.As the size of the clustered index increases, more IO required to fetch the records.This is happening because an index page can store lesser index row if the clustered index is wider. That increase number of pages in the intermediate level and depth of the indexes (Levels in the b tree structure) .For example, a table which contain millions of records might need only 3 level in b tree structure if the clustered index in defined on integer column. If we defined clustered index on wider column (say a uniqueidentifier column which need 16 bytes), the depth of index might increase to 4 (Level of index). Any clustered index seek requires 4 IO operation compared to 3( if the clustered index is narrow).
This issue propagate to non clustered index also as the clustered index keys are stored in the leaf level of all non clustered index as a pointer to the clustered index. If the non clustered index is defined on the non unique column , the clustered key need to be stored in non leaf level pages of the non clustered index . Which again might cause for more page in intermediate level and to increase the depth of the in non clustered index which in turn increase the IO operation in non clustered index seek/scan also. As the clustered index depth is increased to 4, each bookmark look up operation also need to do 4 IO operation.
Sequential
It is a best practice to define clustered index on a ever increasing(sequential) column. That is the reason we can commonly seen clustered index defined on identity column. Clustered index defined on a non sequential column cause for fragmentation . To read more on fragmentation refer this posts . A non sequential clustered index key force the SQL server to insert record in between to maintain the logical ordering of the data.This lead to page split which cause for external fragmentation and internal fragmentation.
Conclusion
We have discussed about the desired qualities of the clustered index key and the reason behind that.The above points that we discussed are general best practices to be considered while deciding the clustered index key.Apart from this, data access pattern also influence in deciding the clustered index key.If we do not have complete understand of the data access pattern , we might need to test the performance with different strategies.
If you liked this post, do like my page on FaceBook