November 11, 2014 at 2:20 am
Hi,
I know this subject has been done to death but i find so many conflicting views on picking the correct clustered index. In the past i have always followed the advise given by Michelle Ufford https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/ . The points are all well constructed and make complete sense and are very hard to argue against.
Our development team always create tables with the generic ID int identity(1,1) clustered primary key constraints without even giving the tables use a seconds thought. On reflection though, following Michelle's advice you cant argue against this development approach. The generic ID column is narrow, unique, static and ever increasing.
Then there's the other side of the debate whereby the clustered index should be chosen by analyzing the queries and seeing which columns are accessed most, particularly ones which use ranges in the where clause. I cant disagree with this approach either other than the downsides what it leads to index fragmentation and SQL having to add a uniqueifier to the index.
A typical example case would be our CRMCases. A subset of the table would be as follows
Create table CRMCase
(
ID int identity(1,1) primary key,
EmployeeID int,
CreateDate datetime default getdate()
)
go
Create table CrmCaseNote
(
ID int identity(1,1) primary key,
CRMCaseID int,
Note varchar(200)
)
Now, i know that in the case of the CRMCase table that virtually all access will be done by EmployeeID. So, should ID be the clustered index or should EmployeeID? Typically we would just have a nonclustered index on EmployeeID and leave the CI on ID.
Again with the CrmCaseNote table, all the joins and all the lookups are done against the CRMCaseID column. However choosing this as the CI breaks most of Michelles guidelines. Narrow - Yes, Unique - No (there will likely be multiple notes against a case), Static (Yes), Ever increasing - No (Entries will be made against cases that could have been created at any point in time).
So my question is, what wins? Or is it a case of "it depends" for each situation?
Cheers,
John
November 11, 2014 at 3:50 am
It's a little more complex than that. You also have to take into account the logical, or business, identifier for each row. This is sometimes referred to as the natural key. Michele's choice for a clustered index is the best possible choice for performance, assuming the clustered key is going to be used to retrieve the data. But, if that key is not being used to retrieve the data, then you may need to look to other alternate keys. As much as possible, I'd strongly suggest keeping the clustered key unique. If it's not, you'll get an additional, internal, column added called a uniquifier. That column will be part of the key, even though you can't see it, and will be used throughout.
My approach is to use the most common access path to the data. In most cases this can be an artificial key as you described (you'll still need a unique constraint on the natural key). But, in other cases, the natural key may make a lot more sense, depending on the data access pattern. You may also see situations where the foreign key, or keys, are the best choice. I'd suggest a strong bias towards Michelle's advice, but don't get completely hung on it as a hard and fast, carved in stone, rule. Rely more on the fact that the clustered index stores the data, so, use the most common access path to the data as your driver for picking a good clustered index.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
November 11, 2014 at 4:32 am
Thanks for the reply Grant and may i also take the opportunity to say thank you for all your excellent work in the SQL field, your work has been a great source of education for me.
Your answer pretty much confirms what I thought, and whilst its good to have a natural clustered index which fits both Michelle's ideology and also the most common data access methods i constantly come across the scenario in the OP and can never make my mind up on what is best practice. I find myself constantly questioning my decisions when i go against the NUSE theory.
In the simplified scenario in the OP the only way i could make the clustered index unique on CrmCaseNote would probably be by making it a composite CI on (CRMCaseID,ID). As ID is a primary key it will satisfy the uniqueness but that's still a 4byte overhead, which is just the same as letting SQL add its own uniqueifier. Adding the composite CI has the added benefit that if any queries did use the ID column (in production they don't) then the CI could be used, whereas it wouldn't be if i go down the uniqueifier route. My only other logical choice for making it unique would be using the date field on CRMCaseNote (I didn't show that on the subset of the schema but it does exist) which adds an 8 byte overhead, rather than the 4 byte overhead in the previous 2 options.
Thanks
John
November 11, 2014 at 4:40 am
Thank you for the kind words.
I don't know that I would sweat adding the additional 8 bytes. It's not that much. Yes. it'll be a little wider, but if it's both unique and a good access path to the data, it's worth the added width. I wouldn't suggest adding in the PK, just because, to get it unique. That's a bad approach. You really do want to build the key based on what will be used, not on what simply makes it unique. And, as I said, unique is absolutely preferable, but it's not the be all, end all.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
November 12, 2014 at 3:35 pm
I agree with what Grant says, but there's an additional point that may apply here. Suppose you don't make the primary ke the identity column, but make it a two column key, with the table definition something like this:
Create table CrmCaseNote
(
ID int identity(1,1),
CRMCaseID int,
Note varchar(200),
constraint pk_CRMCaseNote primary key (CaseID,ID)
)
If you have access through ID, this may need an extra non-clustered index on ID, so maybe isn't useful. But if access through ID is sufficiently infrequent you can do without that extra index.
I've seen a few cases where something like this worked to get performance improvements well after unravelling some unneccessary and pointless use of the the ID column in other tables, as if it were a real key instead of just a visible uniquifier (one of the problems with automatic, unthinking generation of a identity primary key which isn't even a surrogate for any real key is that people will use it in other tables just because it's there).
Tom
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply