Earlier this month, I presented a session for Pragmatic Works Training on the T’s on the selecting the clustered index for a table in a SQL Server database. In the session, titled Choosing Your Clustered Index, in this session I discussed the patterns around choosing clustered indexes and why you might choose one pattern over another.
The abstract for the sessions is:
We often hear that you should always have clustered indexes on your tables. What’s left off of this discussion is the how to choose the columns for this index. Should it be the primary key? Or some other column or set of columns? Having the wrong key column(s) in the clustered index can lead to performance problems, possibly worse than having no clustered index on the table at all. In this session, we’ll review common patterns for selecting clustered indexes and how to determine which pattern you need. Also, we’ll look at how to analyze tables to help select the best clustered index for every table.
Webcast Recording
If you missed the webcast, you can access the recording from the Pragmatic Works Learning Center. A recording of the session is available here. The presentation deck and demo code can be download here (Choosing Your Clustered Index), or you can flip through the deck below. If you are interested in having this presentation, or others, delivered to your user group or company, feel free to reach out to me in the comments or through twitter.
Session Questions and Answers
The questions below were asked during the session. I’ll keep the answers fairly brief, so if I miss a point and need to explain more, let me know.
Q: All of the examples relate to improving reads of the data. How do they affect the writes when you are loading many rows through ETL?
There are a few ways in which writes are affected by clustered indexes. First, if the write occurs as an insert, the clustered index determines where the write will occur. With an ever-increasing key, that write will occur at the right edge of the B-Tree for the clustered index. If the key is not ever-increasing and rows are sorted based on a non-sequential value, the writes during the ETL process will generate page splits. These page splits will negatively impact the performance of the writes. Also, depending on the number of rows being inserted, it is possible that the inserts can overload the systems ability to allocate pages to the index, this can be checked by monitoring PAGELATCH_* waits.
Q: Are there any data types that you should avoid using?
I would avoid data-types other than int/numeric/decimal size greater than 9 bytes and char and varchar, and there unicode counterparts, that are larger than 10 bytes. Of course, I usually try to avoid uniqueidentifier due to size and the ever-increasing value issues mentioned in the presentation. The key with the clustering key is to choose the smallest data type possible to store the key. These sizes aren’t hard and fast, but guidelines to help identify when to look closer at the data type and key column choice.
Q: Is the foreign key as the clustered key better than using an identity field as the clustered and the foreign key as a non-clustered index
If the primary method for accessing the records is based on a foreign key then it can be better. For instance, if the parent table is an order header table and the child table is order details, then it is likely that the order Id from the order header table will be used to access and retrieve rows from the order detail table. You can see this when querying, since with the foreign key from the order header table, the order detail table will more often be accessed with index seeks that with index seeks coupled with lookup operations.
Q: Do you have any white pages regarding the different patterns?
For a fuller explanation of the clustered index patterns, check out Expert Performance Indexing for SQL Server 2012. This is the book, that I wrote, in which the presentation is based off of.
Q: For a multi-column index does the order of the column matter?
Yes, the order of the columns does matter. The rows for the index are sorted in the order in which they are listed in the index. The left most column is the primary sort on the index and other columns are sorted left to right under that column. Because of this, it is important that columns that will most often be used for identifying rows are first in the column order.
Q: Can you use columns of different types for a multi-column index?
Yes, different data types can be used from column to column within in index.
Q: How about a column with a bunch of NULL and Nvarchar? Do you think to add the clustered index and also add the PK to the table, if there is no PKs in the table?
Nvarchar data types can be used for the data type of the clustering key without issue. There shouldn’t be any issue, as long as it’s the right column for the table and how it will be used. There are three reasons not to use columns with NULL values for primary keys and clustered indexes. The first is that NULL values are not supported within primary keys. The second is that if there are a large number of NULL values, then the columns are not ideal for identifying a row. Also, columns that allow NULL values have a good chance of future updates which goes against the concept of clustering key columns that are not static.
Q: I am getting about 50 mil rows, its static data and will get some incremental data in few months, questions came up if should use SQL Server or Oracle & if SQL Server how to design it so analyst can read data without waiting for hours
For a data warehouse of that size, both SQL Server and Oracle can easily work with the design. As a preference, I work with SQL Server. This is due to the ease of use for the platform and considerations around the TCO for a SQL Server platform. For building out the data warehouse, I would focus on Kimball design and leverage a star schema. Depending on the size of the data today, and into the future, you can look at Microsoft’s reference architecture. For data warehouse’s that won’t exceed 10 TB in size, I’d take a look at the Fast Track Reference Architecture. If the size will exceed 10 TB, I’d look to the Parallel Data Warehouse, with an MPP architecture.
Q: Is it the same performance if you add INCLUDE fields to the ICP index? It would eliminate the need for the Key Lookup but it would still be more initial Logical Reads correct?
Non-clustered indexes are ideal for improving the performance of secondary search patterns for data in a table. Adding in INCLUDED columns can reduce the IO and improve the CPU performance for execution plans that use non-clustered indexes and leverage key lookups to retrieve additional data. In most cases, an index with INCLUDE columns will note cause more logical reads than an non-clustered index seek and key lookup.
Q: Is there a way to access you test code on Foreign key and Multi-column patterns to play with it more?
Yes, the download is listed in the previous section.
Q: Just a comment: Page splits will still happen, even if cl index is ever increasing, wont they? when a new row needs to be entered into a page and there is no room, there will be a page split.
Correct, page splits occur whenever a new page is allocated to an index. The key is to look for page splits that occur in the middle of an index. That that occur on the right edge of the B-Tree are usually not an issue.
Q: In case this is not covered: how do we choose a cl index based on usage patterns of indexes on a table (number of seeks, scans, lookups)?
The key is to look for the column that is most used to access rows within the table. It requires considering how applications will interact with the table. Both seeks and scans should be considered. The column(s) that is most often used for this activity is the most likely candidate for the clustered index key column(s).
Q: Should the fill factor of identity-column clustered indexes be always the default 100%?
Usually, it should. If you are using a non-ever-increasing value for the clustered index keys, then there can be cases where other fill factors should be considered.
Q: Is this pattern mainly to pull less pages into the buffer cache?
Generally, the signs of a good clustered index key column selection is increased selectivity in queries. A side effect of increased selectivity is a reduction in the number of pages in memory. One of the key strategies to a well performing system is to keep only the most required data in memory. Since memory is not limitless, focusing on finely tuned indexing can extend the life of a system and allow an expansion on the data that can be serviced by the SQL Server instance.
Q: Using many indexes also reduce the performance so how do we identify how many indexes do we have to use
To know how many indexes are required for a table, look at index usage stats and missing indexes to start. These will help identify main issues and usage patterns with indexes. Add in the indexes that are needed and then remove those that are not used, or not important. The other thing to look at is performance in general, if there is high levels of parallelism in use or more scans than seeks, there are chances that additional indexes may be required.
Q: Why is it called an Index Allocation Map when there’s no index? Does it have to have an empty IAM?
IAM is just the name of the page, since that is what it is primarily used for. The purpose of the IAM page with heaps is to identify all of the extents that are associated with the heap. Though there is no implied order to the heap, SQL Server still needs to identify all of the pages that are associated with it.
Q: With cluster index, it make selection statement work fast, what does cluster index impacts on the insert/update statement?
The same benefit in selection that occurs with reading data can be leverage with updates. The row that needs to be updated can be found quickly and the update can be completed. In cases where the row no longer fits on the page, a page split can occur. On inserts, heaps are usually quicker to insert into. The downside to that speed gain on insert is the loss of being able to directly access the row on the select. With inserts on clustered indexes, if the insert occurs on the end of the index, then the impact is minimal. If the insert will occur in the middle of the index, then page splits may occur.
Q: What is the different between clustered index vs non-clustered index?
A clustered index provides a sorting for the key columns within the table. At the leaf levels of the clustered index, all of the data for the table is stored. A non-clustered index is similar in that the key columns for the index are sorted. The difference is that at the leaf levels of the non-clustered index, instead of the data for the table there is a pointer back to the location in the clustered index where the data is stored.
Thanks for Attending
Thanks to those that attended the webcast. The session went pretty much up to the end of the hour so the questions that were asked are listed above. If you had a question that didn’t get answered or asked during the webcast, feel free to leave the question in the comments below.