Surrogate clustered index to guarantee "sequentiality"

  • There are big advantages to having a clustered index which is naturally ordered and always growing. For example, if I had an identity based clustered index, every insert would be guaranteed to create a value (or set of values) that would logically place these new rows at the end of the existing values. This prevents the problems of page splits and index fragmentation (both internal and external), thus giving a performance benefit for both reads and writes.

    My question is whether this principle should be overriding. That is to say, should my clustered be designed this way even if there is a column, or combination of columns, which form the natural key and which are commonly used in query predicates, but where the key is not naturally sequential?

    As a contrived example, let's say I have a table containing entries made by users. A user can't make two entries at the same time, so I have a natural key, the createdDate and the userId. Now let's assume that, say, 80% of the queries are going to include these columns in their predicate.

    I now have two competing interests. Making the natural key also the clustered index will clearly result in a major benefit to 80% of the queries. But the natual key is not naturally ordered. Whether we put the date column first or the user column first, there is no guarantee that the next entry will be greater than all previous entries. In other words, we're going to get page splits and fragmentation.

    In such a case, what I could do instead would be to create a surrogate identity column and set that to be my unique clustered index. Now my splits and fragmentation will be minimal. But just about every query in the system will have to do a key lookup.

    Is the best answer only able to be determined on a case by case basis using profiling, or are there some general rules that can be applied to answer this question?

  • The other point to consider is that the clustered index key, as it is the row's 'identifier' will be in every single nonclustered index. So, by choosing a wide clustering key, all your nonclustered indexes are larger and hence more expensive to use. Your clustered index may also be larger and more expensive to use.

    In situations like this, I'll typically use the cluster to organise the table and nonclustered indexes for data retrieval, if the table is big and frequently inserted into. By using the identity as the clustering key, my cluster is narrow, unique, unchanging and ever-increasing, all of which are desirable traits for a clustered index.

    Also consider that the clustered index doesn't have to go on a natural or artificial key. There may be another column (or set of columns) that is a better choice for the cluster.

    This is a rule-of-thumb, not a guiding principal or general rule. Try both, test both options, see what works better in your specific situation and do not take what I've said as a 'gospel truth' never to be questioned.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • allmhuran (2/21/2009)


    As a contrived example, let's say I have a table containing entries made by users. A user can't make two entries at the same time, so I have a natural key, the createdDate and the userId. Now let's assume that, say, 80% of the queries are going to include these columns in their predicate.

    In this case, the created date is an ever-increasing column assuming users can't insert a row in the past, or in the future. In this case, I might make the cluster (createDate, UserID) providing userID is an int, not a char(50). 12 bytes for a clustering key is pretty good, it's unique (which means SQL doesn't have to do tricks to make the clustering key unique) and it's pretty unlikely to change.

    Now, if I had another table where the natural key was two uniqueidentifiers, a char(15) and a date far in the past (say date of birth), then I'd be very likely to put the cluster on an identity and use a noncluster to enforce the natural key.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Yup, I chose to use date (I should have said datetime) as an example specifically for contrivane... ie, one user can't put entries in at the same time, but two users could. This makes fragmentation unlikely, but not impossible.

  • You will always have fragmentation, it's unavoidable. Small amounts (like in your example) is not a major concern. It's when inserts are all over the table (like Uniqueidentifier DEFAULT NewID() or a surname) that there's a major problem.

    Fragmentation's not only caused by inserts. If there are any updates that increase the size of the row, that will cause fragmentation. Deletes leave gaps in pages which reduce the efficiency of reads. That's why indexes need to be periodically rebuilt/reorganised.

    In your example, there's only a chance for a page split when two users insert at exactly the same millisecond, and the last page in the index only has space for one of them.

    Also, bear in mind that fragmentation's only a major concern if you're reading ranges of the tables. It won't affect singleton seeks much, if at all.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Through our testing and long-term production maintenance on several systems using wider clustered indexes, I've found that fragmentation, even on systems with lots of inserts, is not that big a problem. You may need to make a small adjustment or two in the free space on the index, but overall, the performance benefits from having just the right access to the data has completely overwhelmed any hits we've taken on fragmentation. Now, most of these tables have a single path for querying, which means we haven't had to maintain lots of extra indexes. If we did, the performance might have changed. If your access patterns are very consistent, you may see a huge benefit from this approach.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • If you Google Surrogate Key you will find lots of opinions, often in opposition to other opinions and sometimes (excessively) strongly held. The main viewpoints are 'Never', 'It depends', 'Always'. Eventually you will agree with one of these viewpoints.

    FWIW my view is It Depends. If you can show that defining a non-identity column as a cluster key is good for your queries, then use it. However, you need to be aware of why many people think that an identity column is normally the best choice for a cluster index in SQL Server, and only do something different when it is an advantage to you.

    Although many other DBMSs also have the concept of a cluster index, not all of them implement it in the same way as SQL Server. Arguements about what is good or bad in Oracle or DB2, etc, can give useful background but may not directly apply to this feature of SQL Server.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Hmm, but if a user bulk inserts several rows at once, many of them could get the same datetime value (at least if we're talking about std datetime accuracy to ~3ms).

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Yep, that's true. I'm just using this as a contrived example.

    (IE, that's why it's not *guaranteed* to be an ever increasing key)

  • It's not even guaranteed to be unique, which would be a bigger problem .

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Again, yeah, but that's not really the point of the example.

    I'd actually argue that it is guaranteed to be unique, depending on how you define "guaranteed". Is it possible to send duplicates for insert? Yeah, sure. But that's always true for anything other than an internally generated surrogate. I'll issue the a challenge to dispute that. Can you come up with any natural key where it is not possible to insert (well, try to insert) duplicates? ISBN? Social security number? Nope, I can still attempt duplicate inserts. They'll fail if that's the key, sure. So will the date+user combo.

    I think using a date + user is pretty safe. Yes, we can try to insert multiple rows at once, but where did those rows come from? How did they originally all get the same datetime? That suggests that some user actually physically created more than one entry within a time span of 3ms.

    You might say yes, but what if we are importing from excel or something like that? Well, that just pushes back my question. How did that data get into excel with the same date and time for all three entries? Remember, this is a createdDateTime, not the insertedDateTime. That is to say, it's not the datetime that the row was inserted into the SQL Server table, but the date and time that the data was actually created, wherever that was, to begin with.

  • People are too concerned with consistency and theoretical issues. Why all this worry about having the clustered index unique? 🙂

    Making a clustered surrogate key (that you won't really use in any ranged select statements) makes little sense.

    I'd say just use the created column as clustered index alone. It will by far be the most effective, and in reality you will probably never have two entries on the same exact datetime - and there's no practical issue if you do.

    Thorbjorn Kvam,
    Project manager and DBA (design) at Payex (http://www.payex.com)

  • one (3/3/2009)


    Why all this worry about having the clustered index unique? 🙂

    A clustered index is always unique. The only question is do you make it unique, or does SQL make it unique?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (3/3/2009)


    one (3/3/2009)


    Why all this worry about having the clustered index unique? 🙂

    A clustered index is always unique. The only question is do you make it unique, or does SQL make it unique?

    Yes of course, but his concern is that his natural key won't be in natural order. if he use the date, it will be - and the index will probably be smaller compared to a covering clustered index. And it's the simplest solution (and probably as efficient or more efficient than any other) - which is always good in my book.

    (If he never does any selects based on date anyway, I guess it might be worth reconsidering a small surrogate key)

    Thorbjorn Kvam,
    Project manager and DBA (design) at Payex (http://www.payex.com)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply