Choosing the best column(s) for a clustered index?

  • I've always struggled trying to determine the best column(s) to create the clustered index on. I know generically just put it on the primary key identity column (int).

    Back in my college days i had an excellent professor who specialized in databases. He had a phd in databases and ended up moving (after his wife finished her degree) to work for IBM on DB2. When he got to the clustered index discussion i remember him talking that a foreign key column in a detail table, that would most likely be the main thing you would be selecting off of, would be an excellent candidate for a Clustered Index.

    I know i didn't word it in the best way so here is an example:

    Order(OrderId)

    OrderDetail(OrderDetailId,OrderId).

    OrderDetail ClusteredIndex(OrderId)

    Probably the most common select criteria will be something similar to:

    Select ....

    From OrderDetail

    Where OrderId = @OrderId

    Back in the day we were working with informix. He stated that in a clustered index there would be 1 row in a clustered index for each unique value in your clustered. Don't know if it would be any different with Sql Server. So in my example above the # of rows would be 'Distinct (OrderId)'. This of course would have the advantage of the index being slightly smaller.

    In the example above the value would generally be always increasing (course there would be exceptions) so i would think that would minimize having to reorder the index.

    The advice i've heard from the sql server community is to always put it on the primary key. The disadvantage to this of course would be the size of the index. Also if your select criteria is by OrderId rather than OrderDetailId.

    Is there any merit to what i learned back in the day in college? If not is it because of Sql Server or maybe would the advice be different in another db engine (oracle, postgres, ect...)? If you should always use the identity column is there many/any circumstances where choosing another clustered index key would be advantageous?

  • Just in terms of generic storage and retrieval, not talking about any particular query or structure, the best place to put a clustered index is on a monotonically increasing, as narrow as possible (meaning as small a data footprint as possible, smallint over int, int over bigint), unique data value that never changes. There has been tons and tons of testing published that bears this out. So, what's that mean? Put it on the identity column.

    Now, let's talk real world.

    In some cases, this is absolutely the right answer. In others it's not. As a general rule, my own recommendation is to put the clustered index onto the most common data access path. For many tables, the most frequent path to the data is through the primary key. Further in many instances, the natural key for the table is not going to index well, so creating an artificial primary key though an identity column and having that be the clustered index makes sense. In others, another path is the right way to go. In lots of cases, that could be the foreign key or another column or columns. Don't be scared to have a compound key, if that makes sense for the design. Here's an example [/url]of where a compound key made a huge positive performance impact. It's also an example of using foreign keys as both the clustered index and as part of the primary key (it's a design that can make people a little crazy when they first see it, but it worked extremely well).

    One additional note, since the clustered index identifies the storage of the row, there is an understandable bias towards making it a unique value (hence, so many primary keys). If the defined key is not a unique value, SQL Server will add a counter to the cluster in storage, called a uniqueifier, which will make the values unique in order to successfully recover the data after storage. This means that a non-unique key is actually going to be wider than the data type implies within the b+tree structure of the index.

    In short, your college prof was not nuts. It is a valid approach. Just one you should take with some judgement around what is best for the given situation.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Ncage (3/9/2016)


    Back in the day we were working with informix. He stated that in a clustered index there would be 1 row in a clustered index for each unique value in your clustered. Don't know if it would be any different with Sql Server.

    In SQL Server, there is one index entry for every row in the table, in every index (*). So in your example, if you have 1000 orders of 4 order lines each, you will have 4000 index entries in every index on the table.

    (*) For the pedantic readers out there, filtered nonclustered indexes are the exception to this rule

    In the example above the value would generally be always increasing (course there would be exceptions) so i would think that would minimize having to reorder the index.

    The advice i've heard from the sql server community is to always put it on the primary key. (...) Also if your select criteria is by OrderId rather than OrderDetailId.

    In the case of your example, the two advices can easily be combined. I see no reason whatsoever to add a surrogate key to the OrderDetail table. For that table, both the primary key and the clustered index should be on (OrderId, OrderLineNumber).

    If you should always use the identity column

    You already made the mistake of assuming that there should always be an identity column.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Ncage (3/9/2016)


    Order(OrderId)

    OrderDetail(OrderDetailId,OrderId).

    OrderDetail ClusteredIndex(OrderId)

    Probably the most common select criteria will be something similar to:

    Select ....

    From OrderDetail

    Where OrderId = @OrderId

    There is a missing piece in your example:

    Select ....

    From OrderDetail

    Where OrderId = @OrderId

    ORDER BY [????]

    This will tell you how to define your clustered index:

    OrderDetail ClusteredIndex(OrderId, [????])

    This would be perfect if you always select details for one order at the time, never for a bunch of orders.

    _____________
    Code for TallyGenerator

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply