March 4, 2009 at 5:56 pm
Hello, I'm wondering what the general consensus is for when I should add an extra column to a table specifically to use as a primary key, rather than having a primary key using only the necessary "business function" columns. As an example, let's say I have a Customer table with CustomerID, and then I want to have a CustomerState table which tells how many years a customer lived in a particular state. Should I go with (CustomerID, StateID) as my primary key, or should I introduce a separate column CustomerStateID, which serves no business purpose, and have that be my primary key? And as a follow-up, does the answer change if there can be multiple records for a customer in a state (say, each row gave the start and end year when the customer lived in the state)? In that case, I would need to introduce a new non-business column in any case, so should I just go with CustomerStateID, or (CustomerID, StateID, CustomerStateID)?
I've worked on several projects which have used different approaches to this, but now that I'm designing my own DB I was wondering what factors would cause me to go with one approach vs. the other.
Thanks!
March 4, 2009 at 8:19 pm
There are probably quite a few opinions out there as to what is best and to be honest it really depends on your particular business circumstances.
The things I take into account when choosing Primary Keys are:
* Is there a suitable candidate key already in the table?
* Is it suitable for indexing? (numbers are more suitable than large text fields)
* Is the primary key going to be used as a Foreign key in another table? (If so it can - sometimes! - be easier to just use an ID column)
* Make sure you don't define a primary key on a column that may change (E.g. using someones last name as the PK in your customers table will cause problems if they change their name!)
* I'm sure there are more that I can't think of right now!
In your example above the candidate keys would be (CustomerID, StateID, startDate) OR (CustomerID, StateID, endDate). The first one would probably be a bit more user-friendly. 🙂
Bevan
March 4, 2009 at 8:24 pm
For SQL Server, the general rule is to have a small montonically increasing column as the (unique) clustering key for the table - most people choose from one of the integer data types and make it an identity column. This can be the primary key if you wish or a unique constraint. The important thing is that it is the clustering key.
The other fields (customerid, stateid, year etc) may be the primary key or a unique contraint.
Which one of the above is the primary key is (in my opinion) personal preference. There are definitely other opinions on this but for performance, the clustering key is the most important.
You may be about to ask why, so...
Each non-clustered index contains the value of the clustered index. If the clustered index is large, then the non-clustered index is also large (large is relative however, have 2 or 3 INT columns as the clustered key is bigger than a single INT columns). The larger the index, the more work SQL Server needs to do.
The monotonically increasing requirement comes about because when you are adding rows to a table, you do not want to be attempting to add them to database pages that are already pretty much full. So, when adding a new record using a key with increasing values for new records, you add the records to whatever is currently the last page (and at the end of the page). So, when the page is full, there is not need to split the page (which is an expensive operation) so that you can get the new record into it. Instead, if the page is full, you simply need to allocate another empty page to the table and start adding data to it.
March 4, 2009 at 8:50 pm
The monotonically increasing requirement comes about because when you are adding rows to a table, you do not want to be attempting to add them to database pages that are already pretty much full. So, when adding a new record using a key with increasing values for new records, you add the records to whatever is currently the last page (and at the end of the page). So, when the page is full, there is not need to split the page (which is an expensive operation) so that you can get the new record into it. Instead, if the page is full, you simply need to allocate another empty page to the table and start adding data to it.
This can be quite true for tables that will have a large number of inserts. However, if your table is relatively static this is not so important. You will just have an int column that adds no value.
The best implementation will depend on the business needs.
March 5, 2009 at 4:10 am
The integer key is but one implementation of how to get good performance. For tables that are not subject to update activity will still benefit from this strategy (or any other strategy that results in a clustering key that is not unnecessarilly large). Remember, the clustered key is what is at the leaf level of all non-clustered indexes. Hence, if you have a large key, your non-clustered indexes will require more space and more room in cache. Both of these factors will contribute to potential performance problems.
If appropriate, you could use character columns in the clustered index. The key thing to remember is that it, ideally, should be the smallest that you can reasonably get away with.
March 5, 2009 at 5:28 pm
OK, Just to make sure I understand...
Option A: Add an identity column to the table, so I have CustomerStateID, CustomerID, StateID, with CustomerStateID being the clustered index. I would then add a nonclustered index on CustomerID and StateID, which is what my application would use.
Option B: Do not add an identity column, and just have a clustered unique index & PK on CustomerID and StateID.
It seems like the rules of thumb are to choose option A if:
1) I need a lot of indexes on a table, or just a few indexes but option B results in a wide clustered index
2) Insert performance is important
...but I could go with option B if:
1) I have a simple table that doesn't need a lot of indexes, and if option B keeps a narrow index.
2) Insert performance is much less important than select performance.
I like the simplicity and maintainability of having my clustered unique index and PK in sync, and not having the overhead of another column and extra index if I don't need them.
Let me know if I'm headed in the right direction.
Thanks!
April 20, 2009 at 8:24 am
I can't add much to what was written, except to extend another consideration: those unique and arbitrary integer-based primary key identity columns can be very handy if your front-end can use them to ease processing. A simple example of why comes from an order-detail scenario, where the detail table has a foreign key link to the order header. If your front end has a display where the detail rows are listed, and you need to call them for any purpose (such as seeing more details of the row), having that ID can simplify the call from the front-end. This can dramatically reduce the size of data blocks you're pulling on the front-side if used wisely.
April 21, 2009 at 4:11 am
Hi Jay
yes your understanding is OK. If you want more information, there are various web sites that are really good resources including MSDN, sql-server-performance.com and, of course, SQLServerCentral.com
Cheers
Stephen
April 23, 2009 at 3:20 am
What nobody's mentioned so far is that non-natural keys (ie identity columns) create uniqueness, but they don't enforce it. So if your table is CustomerStateID IDENTITY PRIMARY KEY, CustomerID, StateID, what's to stop you from entering the same CustomerID and StateID twice? Yes, you can create a unique index on CustomerID, StateID, but why bother with an unnecessary identity column when there's a perfectly good natural natural key?
I usually try to avoid identity columns unless there is a good reason for their use, such as:
(1) Natural key is very wide and therefore will hamper performance, especially if used in a foreign key
(2) Natural key may change
(3) You may not wish to expose the natural key to users (although exposing an identity surrogate key would enable users to guess the identifying value of other rows)
John
April 23, 2009 at 6:40 am
In data warehousing solutions the recommendation (Ralph Kimball) is to have a single field surrogate key.
In OLTP solutions I think it depends on how static the data is.
In my case where we have data being received as a result of customer activity then our clustered keys are on the Date/Time of the customer activity and the primary key is on the natural key as much as is practical.
Our partitioning is on the Date/Time and the clustered index on this allows rapid purge/archive activity.
The worse case I've experienced is when the database had an identity column on every single table with the primary key on it but the application never used these columns. The first time a DBA tried to sync up the data between environments the app was broken because the DBA assumed that the primary key was the primary identifier of the record.
Partitioning also allows us to rebuild and index within the partition so fragmentation is a non-issue.
Of equal importance is the method of access. If you are looking to retrieve data by the primary key then fragmentation won't affect you. If you are looking to retrieve data across a range then fragmentation will affect you.
If you are doing a huge number of inserts then remember to look at the fill factor
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply