Indexing - Should primary key be a natural key or an identity?

  • I have long been in the camp that says all rows should have a single unique identifier, usually an Int. It is compact, makes nonclustered indexes tighter, and makes life MUCH easier when a value of the natural key changes.

    But I'm sure there are people out there who disagree. I would love to hear any reasons pro and con. So if you feel strongly about this topic from either side, please help me out here.

    Thanks

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • My 2 cents worth on this subject: it depends!

    😎

    Few quick thoughts:

    There is no "one size fits all", it all depends on the subject.

    As a general rule a table should have a clustered index and if no unique natural key is present then it is appropriate to use an identity for the clustered index.

    If a unique natural key exist then that would normally be the first choice for a clustered index. There are exceptions to this where the key is too complex/wide to be practical.

    The clustered index does not have to be the primary key and often it is better to separate the two, especially when the primary key inserts would cause heavy fragmentations, i.e. GUID etc..

    Appropriate modelling of the database can and often will reduce the number of complex keys.

  • Here are some reasons to consider.

    A PK on a "popular" look-up table has to be mentioned in every other table referencing it.

    If that PK has 3 or more columns - the overhead will be quite significant.

    Not to mention complexity of JOIN statements.

    If there is a string column participation in the key then you need to consider the overhead of matching strings (do not forget variable length, collation rules, etc.) comparing to matching 4 bytes of integers.

    On some occasions you might need to change the definitions of the unique key.

    Say, you go international, and you need to add CountryCode to the key definitions.

    With natural key approach it would mean changes to every table referencing that key, causing major redevelopment.

    _____________
    Code for TallyGenerator

  • Eirikur Eiriksson (9/3/2016)


    if no unique natural key is present then it is appropriate to use an identity for the clustered index.

    Those are completely non-related things.

    Clustered index does not have to be unique.

    Clustrered index must represent logical sequence of the records in the table.

    IDENTITY values might happen to represent the actual logical sequence of data - then it's a good candidate for a clustered index.

    To see if it's the case - check which columns are used for range search and grouping.

    Those columns must be the right candidates for clustering.

    _____________
    Code for TallyGenerator

  • If there is a valid natural key present, use it. As an extreme example, tables that have an ID, state code, and state name, when state code and state name are sufficient are perplexing to me. The first allows a record to refer to Ohio as OH. This can stand on its own. The second would require some number, such as 24, to represent Ohio. This cannot stand on its own and would require a join to the lookup table. While the comment that changes to the key would be much easier is partly true (it is easier, but the change doesn't have to be particularly hard either), using the business key as the primary key also better prevents duplicates. I've seen tables with duplicates. A unique index would prevent this, but so would its use as the primary key.

    Sometimes there is no good business key. Then you must use an integer. Don't use a GUID unless there is a requirement to merge information with another system.

    Although the SSN would seem to make a good primary key, never use it as such.

  • I'm sure this topic will get a lot of "It Depends ..." answers. Here's mine.

    An identity as PK is easy. It will guarantee that any record inserted into the table will not violate the primary key.

    However that easiness creates another complexity (as I'm sure the venerable posters already in this thread know, but it hasn't yet been said, and some readers may not know it). If you use an identity, there is nothing inherently prohibiting you from entering two duplicate records except for the key. In the example of the state table, unless the state code is defined to be unique, you could enter another OH record and SQL would dutifully add the record with a new index value, say 56. Now you have 2 Ohio's in your table.

    The OP asks what happens if the value natural key changes. The answer to that again is it depends. Should this be treated as a change in data for the same record? Or is it a new instance? That depends on the application.

    Another consideration is foreign keys. Numeric ID's are easier here. But again you need constraints and coding to ensure the values you are joining tables on are correct.

    Aside: Some may notice I used the word "record" here. That's partially my old-school background (ISAM files in FORTRAN!). But it's partially intentional here. A "record" to me refers to the data properties that define why this should be a row - or a tuple, a word I never got used to using.

  • gvoshol 73146 (9/6/2016)


    If you use an identity, there is nothing inherently prohibiting you from entering two duplicate records except for the key. In the example of the state table, unless the state code is defined to be unique, you could enter another OH record and SQL would dutifully add the record with a new index value, say 56. Now you have 2 Ohio's in your table.

    Use of identity as a primary key does not mean that a corresponding natural key should not be implemented as a unique constraint.

    I always create unique constraints along with identity PK's.

    Identity is just a token, reference number, link to the actual entity.

    Like SSN for USA citizens - a single number identifying a person with all associated names and properties.

    _____________
    Code for TallyGenerator

  • Sergiy (9/3/2016)


    Eirikur Eiriksson (9/3/2016)


    if no unique natural key is present then it is appropriate to use an identity for the clustered index.

    Those are completely non-related things.

    Clustered index does not have to be unique.

    Clustrered index must represent logical sequence of the records in the table.

    IDENTITY values might happen to represent the actual logical sequence of data - then it's a good candidate for a clustered index.

    To see if it's the case - check which columns are used for range search and grouping.

    Those columns must be the right candidates for clustering.

    Except that MS SQL Server adds a uniqueifier to the index of a clustered index if the index is not unique.

  • Except that MS SQL Server adds a uniqueifier to the index of a clustered index if the index is not unique.

    This can't be a showstopper, can it? This seems to drive a lot of people to reflexively say that the clustered index should be unique. But in the case of an ID field for which ranges aren't common WHERE criteria, this doesn't seem like a good idea to me.

  • RonKyle (9/6/2016)


    Except that MS SQL Server adds a uniqueifier to the index of a clustered index if the index is not unique.

    This can't be a showstopper, can it? This seems to drive a lot of people to reflexively say that the clustered index should be unique. But in the case of an ID field for which ranges aren't common WHERE criteria, this doesn't seem like a good idea to me.

    It isn't a showstopper. SQL Server needs each value in the index to be unique as it represents the logical order of the data on disk. If two key values are the same, which row of data does it point to and how does it order them? That is why it adds a unique value to key.

  • I reject entirely the notion that there is such a thing as a "default" clustering key. That is the issue to me. While certainly there are times when an identity column really is the best clustering key, by my testing and tuning, it's less than half the time. Yet far too many people default (nearly) every table to cluster on identity. From my experience, that's a serious mistake that is the single biggest performance detriment overall to SQL tables.

    The default approach for many people seems to be to slap an identity on the table, regardless of any lookup considerations, and then start creating covering indexes for (nearly) every (major) query. But that ignores the total cost of such an approach. You have to figure in the additional time not only to maintain those indexes, but to read and buffer them as well.

    By definition, row buffers read by the clustered key can be shared by every query that needs those rows. If you have custom indexes, the same rows, albeit for different columns, must be read from a different covering index. Since typical table usage is that more recent rows are read vastly more often, often the single, shared buffers is better for performance than multiple, customized buffers.

    Again, there are certainly times when a shorter clustering key and more covering indexes are warranted. But not by default. In fact, that should never be by default. Rather, the actual index missing stats, index usage stats and index op stats should be reviewed as part of index setup, esp. immediately after table creation. Naturally you review "hard-hitter" queries too, but within the context of overall performance.

    Finally, we shouldn't get overly concerned about a few page splits, as long as they aren't excessive. Don't stick with an identity just because you might get a few out-of-clustering-key-order rows inserted. Keep in mind, too, that the row will be read 100x, 1000x, whatever times more than the 1 time it is inserted.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Eirikur Eiriksson (9/3/2016)


    My 2 cents worth on this subject: it depends!

    😎

    Few quick thoughts:

    There is no "one size fits all", it all depends on the subject.

    As a general rule a table should have a clustered index and if no unique natural key is present then it is appropriate to use an identity for the clustered index.

    If a unique natural key exist then that would normally be the first choice for a clustered index. There are exceptions to this where the key is too complex/wide to be practical.

    The clustered index does not have to be the primary key and often it is better to separate the two, especially when the primary key inserts would cause heavy fragmentations, i.e. GUID etc..

    Appropriate modelling of the database can and often will reduce the number of complex keys.

    This is pretty much my approach. On non-lookup tables I'll usually create an identity column (often as a non-clustered PK), even when a natural key is present. This is, as Dixie points out, in case the natural key changes; it can be used to change the natural key and/or replace it. In standard use it does not get referenced in any code or from any other table though.

    RonKyle


    tables that have an ID, state code, and state name, when state code and state name are sufficient

    The numbers of times I've advocated this kind of thing, and people just look at me like I've got two heads...

    Country codes and Currencies are good examples, backed by ISO standards.

    Often the problem is the application code has Enum classes with the id in strung throughout them. Which is a whole 'nother headache.

  • I reject entirely the notion that there is such a thing as a "default" clustering key.

    The numbers of times I've advocated this kind of thing, and people just look at me like I've got two heads...

    Always good to know there are others that agree.

  • The other really huge performance factor to consider is how the tables are joined. Often natural keys allow for merge joins, which are extremely efficient even on very large data sets.

    For example, probably the most common natural-key-over-just-identity is for something like an OrderItems table. Since you will (almost) always join by order#, it's always made sense to me to cluster the OrderItems table on ( OrderId, $IDENTITY ) rather than just $IDENTITY. [Yes, even with a longer key.] The OrderId is itself an identity, and thus already sequential, although of course a few rows may be inserted out of order into the items table.

    Another common case is when your initial select criteria is (almost) always by a date/datetime range. Then it's typically best to cluster the table on the date/datetime column first (or only).

    Edit: Corrected typo; edited for clarity.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Again, there are certainly times when a shorter clustering key and more covering indexes are warranted. But not by default. In fact, that should never be by default.

    I wish somebody told it to the guys in Microsoft.

    _____________
    Code for TallyGenerator

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply