Primary Key and Clustered Index

  • A topic I always get confused with is primary key vs. clustered index. I have an understanding of what each is and/or is not but I struggle with the application of the two. Suppose I have a table with an account number field in which all account numbers are unique. Is it worth having an ID field that increments when a new record is added? Any filtering or joining will be on the account number and the ID field will be completely irrelevant as far as a useful data goes. Suppose my table is similar to this

    create table TestTbl
    (
    ID int IDENTITY(1,1)
    ,AccountNumber bigint
    ,field1 int
    ,field2 varchar(10)
    ...
    ,field15 numeric(9,2)
    )

    Is the ID column necessary given the provided information? If so, what's the best way to assign the keys and indexes?

  • No, ID is absolutely not needed.  It's likely to hinder processing since people tend to inevitably cluster on it.

    Use the AccountNumber as the clustering key and the primary key.  You don't have to specify a primary key, but it's a good idea to specify one when you can.

    If the AccountNumber column is not inherently sequential, you may have to rebuild or partition the index, but that is generally preferable overall for performance to using the wrong clustering column.

    As to the difference between a PK and a clus index, consider the case where you (almost) always GROUPed by a set of columns that weren't unique, but whose values were generally assigned consecutively.  In that situation, it could make sense to cluster by the GROUPed set of columns but have a separate PK of the $IDENTITY column.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I totally agree with Scott on this one... the IDENTITY column is not needed, occupies space on each page (which is also reflected in memory and disk), and is a temptation to people who can't resisting putting unique Clustered Indexes on IDENTITY columns everywhere...

    ...Ah... but hold on a minute... are you sure that the "AccountNumber" column will ALWAYS AND FOREVER BE ABSOLUTELY 100% IMMUTABLE?

    If you're working for a bank or other financial institution and having lived through the nightmare 7 or 8 times now, I can guarantee you that the account number is NOT immutable.  Banks merge and, when they do, account numbers change.  If you've used "AccountNumber" as a reference in hundreds of tables, you'll need to change the value of the "AccountNumber" in those hundreds of tables.  And, it doesn't take a merger.  It could be a bank simply changes name or ownership and management feels compelled to change the bloody account numbers.  I've even seen it where some idiot savant gets a bee in their bonnet and makes a stupid decision to do something like regionalizing account numbers or adding/removing site numbers, etc, etc, ad infinitum.

    Like I said, I've lived the nightmare many times now so as the Knight said to Indiana Jones when it came to selecting the actual Holy Grail, "choose wisely" and be aware of what will be needed to be done if the account numbers ever do change.

    If you do decide to go the IDENTITY route, then you'd have to train everyone to NEVER use the "AccountNumber" as a reference in other tables (treat it like you would properly treat an SSN).  That's probably more difficult than it's worth especially if you "get lucky" and the "AccountNumber" is never changed.

    I'll also state that financial institutions (especially banks) aren't the only ones that end up changing account numbers for some of the same ill-begotten reasons that banks do.

    To state it again, there are some pretty significant ramifications of either using the "AccountNumber" as the Primary Key or using and IDENTITY column as the PRIMARY KEY (and the "AccountNumber" would be reduced to simply a non-null unique constraint, which would also create an index on it).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for all of the information. It was very helpful because some things I read swear by adding an identity column for the primary key regardless of what type of data is in the rest of the table. This clears all of that up.

  • RonMexico wrote:

    Thanks for all of the information. It was very helpful because some things I read swear by adding an identity column for the primary key regardless of what type of data is in the rest of the table. This clears all of that up.

    I'll tell you, as I know Scott will, that automatically just adding an ID column to every table is a mistake.  They do have their uses, though, and not all of them are for the purposes of being a surrogate or real Primary Key.  Sometimes they can be used to greatly reduce the size of bridge-tables, control the fragmentation of high volume inserts on large tables, or act as a temporal tie-breaker for Clustered Indexes that are based on dates and times, just to name a few of the uses.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If the AccountNumber must be changed, it will have to change whether it's the clustering key, and/or a PK, or not.

    In that situation, you'd temporarily disable FK checking as needed and make the changes you need to make.  For this table, you'd want to write out a new table rather than update-in-placee.  That's not really a disadvantage.  Once the data is fully converted, you simply rename the tables to change the new table name to the original table name.

    It's very likely changing the AccountNumber PK for this table would be the easy part of changing Account Numbers.  The business ramifications through the other code would likely be far more difficult to deal with.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jeff's response about account numbers is the same as my reason for almost always using an IDENTITY as the primary (and clustered) key. As long as it's the IDENTITY column that gets used in foreign keys, etc. you don't have a major change to work through when someone decides to change the way the natural key gets allocated. There's nothing to stop you having a unique index on the natural key as well, if that's what will be used for lookups, etc.

  • Acount Numbers do sometimes change - for example in the case of identity theft a customer may want to change his/her account number. However it surely ought to be unique in your table so it's right to make it either PRIMARY KEY or NOT NULL UNIQUE (those two types of constraint achieve exactly the same thing). Whether Account Number makes the best cluster key depends on what other columns you have and how the data is used. It may well be the best candidate for clustering.

    I suggest you do not use IDENTITY for any business key. Use a SEQUENCE to generate the number because that way it's easy to update. You don't necessarily need a surrogate key for your table as well.

    I agree with Jeff and Scott in other words. I just thought I should mention sequence because no one else has.

  • A part of the reason why I use IDENTITY instead of sequence is because IDENTITY IS much more difficult to update and it's flat out not supposed to ever be updated.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Business data belongs to nominated data controllers and data subjects. Data does not belong to database designers and developers and as a technology professional myself I try to keep that fact at the forefront of software design decisions. Knowing that the business domain is complex and not always logical or predictable I like to try to accommodate change rather than put unneeded technical obstacles in the way. That's why I have never been a fan of the IDENTITY feature in SQL Server. It's a different matter for other DBMSs because the no-update limitation of auto-incrementing columns is peculiar to SQL Server.

  • nova wrote:

    Business data belongs to nominated data controllers and data subjects. Data does not belong to database designers and developers and as a technology professional myself I try to keep that fact at the forefront of software design decisions.

    I believe that's one of the primary reasons for data, design, security, and business failures.  While I absolutely DO agree that the business processes belong to such "nominated data controllers and data subjects", they simply do NOT know the best way to retrieve the data nor even store it, never mind protect it or the machines on which it lives.

    nova wrote:

    Knowing that the business domain is complex and not always logical or predictable I like to try to accommodate change rather than put unneeded technical obstacles in the way.

    Now that, I agree with.  Well, up to certain limits because one person's idea of what an "unneeded" technical obstacle is quite different than another's.  For example...

    nova wrote:

    That's why I have never been a fan of the IDENTITY feature in SQL Server. It's a different matter for other DBMSs because the no-update limitation of auto-incrementing columns is peculiar to SQL Server.

    I do NOT agree with forsaking the proprietary extensions of one RDBMS just because other RDBMSs do not have such functionality.  It's a bit like telling everyone in the company that they can only use 4 function calculators because, someday, they might hire someone that only knows how to use a 4 function calculator.  By the same token, it would be a really bad thing if everyone in the company were suddenly forced to give up their common "scientific notation" calculators and force all of them to all use HP Reverse Polish Notation calculators.

    A good example of this is to compare the intrinsic date/time functions in Spark SQL with T-SQL... DATEDIFF and DATEADD in Spark SQL are relatively crippled compared to T-SQL.  By the same token Spark SQL has other date/time functions that make some date/time calculations a whole lot more simple that T-SQL doesn't have.

    To boil it all down, don't ever program like you might someday need to migrate code from one RDBMS to another because, no matter how simple something may be, there will ALWAYS be a rework requirement to make the code work on the other RDBMS unless you stick with the most basic and pure functionality, which also would mean you're using the proverbial 4 function calculator.

    Pure migration is a myth and a false expectation.

    That being said, if you DO have to work with multiple RDBMSs, you'd better know how to accomplish the same functionality in all of them.  For example, you say...

    nova wrote:

    That's why I have never been a fan of the IDENTITY feature in SQL Server. It's a different matter for other DBMSs because the no-update limitation of auto-incrementing columns is peculiar to SQL Server.

    The normal use case for an IDENTITY column is to create an auto-incrementing value that DOES need to be rather immutable.  With that, IDENTITY columns ARE appropriately difficult to change the values of and that's a good thing.  Sequences don't have that functionality built in.  While sequences DO have their uses that simply cannot be accomplished by the use of an IDENTITY column, using sequences exclusively because some other RDBMS has no such equivalent to an IDENTITY column is a horrible pre-optimization that would make Knuth puke and is a bit like using only a 4 function calculator because some poor slob may not know how to use a scientific calculator.

    I'll also state that IDENTITY columns ARE the right thing to do for surrogate keys because they are difficult to mutate.  You shouldn't be mutating such columns to begin with.  If you're going to use sequences to replace such columns on the false hope of pure migration or practicing for someday working on another RDBMS, then you should also have code that makes people jump through the same hoops to mutate their values as IDENTITY does because the values in such designed columns SHOULD be difficult to mutate.  If you're not doing that, then you're setting up the database for eventual failure by mistake.

    And that takes us all the way back to who owns the data...

    nova wrote:

    Business data belongs to nominated data controllers and data subjects. Data does not belong to database designers and developers and as a technology professional myself I try to keep that fact at the forefront of software design decisions.

    A part of the job of designers, developers, and (especially) DBAs is to protect the data and that means taking ownership of the data and the design to protect the data (and the machines it lives on) at all cost.

    People that think otherwise are setting the company up for failure of one type or another and history is littered with examples of such failures.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In my previous post I wasn't making any point at all about portability, which I agree is no necessary consideration in designing a database. My point was just about SQL Server's limitations on the use of an IDENTITY column. Designing out the need to update a column may well be a good thing, but denying the technical ability to do it at all just takes away the choice from the people who do have the competence to make a wise decision. It seems that we both agree that some keys do need to be updated but Microsoft limits our options for key generation in that case - Microsoft's fault, not the database designer's.

  • nova wrote:

    In my previous post I wasn't making any point at all about portability, which I agree is no necessary consideration in designing a database. My point was just about SQL Server's limitations on the use of an IDENTITY column. Designing out the need to update a column may well be a good thing, but denying the technical ability to do it at all just takes away the choice from the people who do have the competence to make a wise decision. It seems that we both agree that some keys do need to be updated but Microsoft limits our options for key generation in that case - Microsoft's fault, not the database designer's.

    My point is that the choice to update an IDENTITY column has NOT been made impossible and we have NOT been denied the opportunity to do so.  It has been made appropriately difficult and that's a very good thing because you should NOT be updating the values in an IDENTITY column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply