PK Identity as a convention, always more effective when there is a single int natural key?

  • So every table in the database has an identity ID field as the PK.

    If I take a table like Person and everything related to a person is a FK PersonID, what benefit does a normal ID field give here?

    It makes sure that the tables clustered index is always inserted to in order... though it's an artificial order having nothing to do with the table data.

    Since every join with these tables will be Person.ID <- table.PersonID, every index and join will reference PersonID as well as the ID, which is the actual address.

    If these truly are 1:1 tables, just extensions of the Person record, isn't it more effective to just have the PersonID as the primary key as that's what everything is going to be joining on? The application inserting records can just insert with a PersonID instead of trying to get back the Identity field

    I know there is a large divide between people who prefer artificial and natural keys. I normally lean on the artificial because I feel it's just a bit easier and maybe efficient to key off an int than what could a multiple field natural key. But in this case it seems like the natural key has more benefit with very little drawback on this very limited example of a 1:1 relationship.

  • You're right, this can be practically a religious war.

    In general, I've found that identity works well, but can be abused. In the situation you're describing for example, I'd go with what you're suggesting and not give those tables an identity column.

    Just remember, don't mix the primary key and the clustered index. Yes, by default the PK is clustered, but it absolutely doesn't have to be.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • It is very much a debate.. I tend to prefer surrogate/artificial keys. Business keys tend to change over time and while they are definitely needed, but I rarely key tables off them. Grant is absolutely right that the PKey and Clustered index don't have to be the same, often they are.

    As far as identity vs. not, like many things, it depends. Sometimes I like to have more control over the value and want to prevent/control gaps, in those cases I handle the value myself. In other cases where speed is a major factor and/or control is not as pressing I go with identity.

    However many of these decisions come down to design philosophy and experience. And there is not necessarily a "right" answer for most cases.

    CEWII

  • I'm a surrogate key fan too. I do tend to treat them like natural keys on the child tables. I was taught a mechanism where the master tables are are identity but the child tables are collections of the parent tables, and clustered. It can work really well.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Please allow me to spice it a little 🙂

    For starters, if you come to my desk with a 1:1 relationship better for you to have very, really strong arguments about why those two tables are not just one. I'll ask how and why a single entity is being implemented that way and I'll want to see the rationale of it from A to Z.

    In regards to PKs, in general I prefer natural keys for OLTP systems while I favor the use of surrogate keys on some DSS tables, specifically on Dimesion tables.

    Just my two cents.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Grant Fritchey (5/29/2011)


    You're right, this can be practically a religious war.

    In general, I've found that identity works well, but can be abused. In the situation you're describing for example, I'd go with what you're suggesting and not give those tables an identity column.

    Just remember, don't mix the primary key and the clustered index. Yes, by default the PK is clustered, but it absolutely doesn't have to be.

    Yes, you're correct. I'm being lazy by not differentiating them, but generalizing based on my own current schema.

    Thank you.

  • PaulB,

    I agree about 1:1 tables, there are not many times I go for those. As for the surrogate vs. natural discussion, it depends..

    CEWII

  • PaulB-TheOneAndOnly (5/30/2011)


    For starters, if you come to my desk with a 1:1 relationship better for you to have very, really strong arguments about why those two tables are not just one. I'll ask how and why a single entity is being implemented that way and I'll want to see the rationale of it from A to Z.

    If I was dealing with relatively normalized data I think I'd share your opinion a bit more.

    In fact if a Person Table always needed 100 fields in one shot, why not have a 100 field table? Why make it 5 tables? What if Person Table has 100 fields, only 3 will be provided up front, another 10 later on and another 15 at another stage. Many fields are in groupings that only apply to 1-60% of the rows, so there are just a lot of nulls being replicated... or a bunch of default values indicating that nothing has been selected.

    I'm working in a system trying to be a OLTP and OLAP and doing neither very well. Things are not denormalized properly enough to make for fast easy reporting, and things aren't normalized enough to prevented bloated records filling up pages for no reason. IMO, sometimes something can be said easier and more efficiently by just not even existing than existing and telling the application to ignore it.

    If I could just just download your brain Paul I'm sure I'd take that advice and run with it, but I don't have your experience and this seems to be the most linear path to a maintainable system. A more normalized database.

    Oh and I'm not sure if it would influence you one way or the other, but our director of development is using Entity Framework to link directly to tables for the future promise of being able to dynamically generate screens that can update tables. So I'm dealing with a lot of SELECT * from tables and direct update/inserts.

    I can see normalizing these tables backfiring on me if the director wants to dynamically create a data entry form on a set of normalized tables and he asks me to denormalize them into one record.

  • For starters, if you come to my desk with a 1:1 relationship better for you to have very, really strong arguments about why those two tables are not just one. I'll ask how and why a single entity is being implemented that way and I'll want to see the rationale of it from A to Z.

    Two reasons come quickly to mind: information that has special security requirements and data which applies to only a small set of the 1:1. In a company of 5K employees where 150 have a company car, I would put the company car information into another table. Otherwise there are a lot of blank fields in an employee table as the vast majority don't have the cars.

  • ShawnTherrien (5/28/2011)


    So every table in the database has an identity ID field as the PK.

    If I take a table like Person and everything related to a person is a FK PersonID, what benefit does a normal ID field give here?

    It makes sure that the tables clustered index is always inserted to in order... though it's an artificial order having nothing to do with the table data.

    Since every join with these tables will be Person.ID <- table.PersonID, every index and join will reference PersonID as well as the ID, which is the actual address.

    If these truly are 1:1 tables, just extensions of the Person record, isn't it more effective to just have the PersonID as the primary key as that's what everything is going to be joining on? The application inserting records can just insert with a PersonID instead of trying to get back the Identity field

    I know there is a large divide between people who prefer artificial and natural keys. I normally lean on the artificial because I feel it's just a bit easier and maybe efficient to key off an int than what could a multiple field natural key. But in this case it seems like the natural key has more benefit with very little drawback on this very limited example of a 1:1 relationship.

    Personally, although you don't provide much detail, what you are describing doesn't really seem to be a question of whether to use a natural key or an artificial key in the typical sense. You seem to be referring to PersonID as your "natural key" but people do not naturally have IDs, unlesss by ID you are referring to their name (which usually makes a lousy natural key for a nmber of reasons, e.g. it might not be unique). It would seem that someone already created an artificial key for Persons (i.e., PersonID) and you are debating whether to use it in your "Person" table or whether to use another artificial key built specifically for that table. While, strictly speaking, PersonID may be a natural key in that it is part of the business model and not simply created for the data model, it was probably created with the same goals which usually go into designing a surrogate key so unless there is something decidedly atypical about this PersonID, something which we couldn't begin to guess without knowing how it was constructed, I would guess that is pretty much moot which one you choose to use for your primary key -- if you are planning to put PersonID in the tables anyway you probably might as well use it and not waste the space and effort on a separate key.

    That being said, the typical arguments regarding the advantages of surrogate keys over natural keys tend to involve either the unfortunate traits natural keys often have which make them: poor identifiers (e.g. non-uniqueness); inefficient (e.g. a 1000 character description field); or unstable (e.g. a key based on a subscription number where the business rules have the subscription number change on every renewal). Choosing to always use a surrogate key which is compact in size, and guaranteed to be both unique and not to change makes a lot of things easier in the long run . . . but admittedly has its own drawbacks (like the fact that you might find one day that you still need a unique index on a 1000 character field to guarantee that your business logic is as sound as your primary key; or the fact that most of your queries might be based on the natural key and the surrogate key really only gets used for controlling the joins) which is why the debate often becomes rather heated. FWIW, in general, I prefer surrogate keys but will sometimes use natural keys if I am confident that the natural key has the right traits. Since, in that regard, I have on occasion been confident and wrong (e.g. creating a natural key based on some business object which seemed like the Rock of Gibralter only to have someone decide sometime later that we should retroactively change the naming convention for that object, rendering all of my "natural keys" either obsolete or turning them into rather obscure surrogate keys:crazy:), I tend to lean heavily on surrogate keys.

    - Les

  • Grant Fritchey (5/29/2011)


    You're right, this can be practically a religious war.

    Heh... "It Depends". In my shop, they do it my way or I take them out for a nice pork chop dinner. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I come down on the side of the surrogate key for a simple reason:

    I have had many occasions to regret when I or (usually) someone else used a natural key, but have never regretted using a surrogate key.

    They are not exactly mutually exclusive. There is no reason to not have a unique constraint on a natural key with a surrogate primary key. At least later when you find that the natural key is not unique or is changing for business reasons you don't have a major change, and you don't have to redesign the whole database to work around it.

  • ShawnTherrien (5/30/2011)


    PaulB-TheOneAndOnly (5/30/2011)


    For starters, if you come to my desk with a 1:1 relationship better for you to have very, really strong arguments about why those two tables are not just one. I'll ask how and why a single entity is being implemented that way and I'll want to see the rationale of it from A to Z.

    I'm working in a system trying to be a OLTP and OLAP and doing neither very well. Things are not denormalized properly enough to make for fast easy reporting, and things aren't normalized enough to prevented bloated records filling up pages for no reason.

    ...and you are correct, you can't cook and take care of the cat at the same time 🙂

    I would approach this as a two steps project.

    Step #1 - be sure there is a efficient OLTP system to take care of the operation of the company.

    Step #2 - design and build a data warehouse to take care of reporting and business intelligence needs.

    If I was in charge, I'll make sure OLTP system is based in a normalized design - 3NF would do the trick and, I'll rely in dimensional modeling for the Data Warehouse project.

    Just my two cents 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • My comments:

    1. If you use an artificial / surrogate key, it should be used in conjunction with the natural key. The column(s) of the natural key should have it's own unique constraint to enforce uniqueness.

    2. Another reason to use an artificial key comes down to whether this will be your clustered index (CI), and in understanding how the choice of the columns in your clustered index affect other things in your database. I'd suggest reading The Clustered Index Debate blogs by Kimberly L. Tripp. Lots of good reasons there to use an identity column.

    3. An artificial key can simplify joins between tables (one column vs. all that make up the natural key).

    4. An artificial key can reduce your storage requirements in child tables (one column to enforce FK RI, vs. all that make up the natural key).

    5. If the artificial key is the clustered index:

    5a. reduce storage requirements of clustered indexes, if chosen for the CI.

    5b. reduce storage requirements of non-clustered indexes (again, if chosen for the CI) (since the keys of the CI are stored in the non-clustered indexes).

    5c. If the artificial key is an identity column, and is the CI, it will reduce page splits of the CI. This leads to less disk fragmentation (so disk IO will be faster), and pages will be fuller. With less page splits, the system can load extents (8 8k pages) (if the next logical page is the next physical page); otherwise it will have to load pages one at a time (8k). Multiple trips down the IO stack will take longer, and cause increases waits in your system.

    5d. With both fuller pages, and less storage requirements, this will allow more pages to fit into memory, increasing the overall speed of your system. Yes, disk space is (relatively) cheap these days; getting all that data into memory isn't. Wasting gigs of storage space means that gigs of unnecessary data will be loaded into memory, pushing out other pages. It will also take longer to load into memory.

    5e. With less storage requirements, backups and restores will be faster. (Consider how long it would take to recover your db in a disaster...)

    So, like has been said, lots of reasons on both sides of the fence. Evaluate each case, and make the decision that makes the most sense for your situation.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Well thank you all for your input.

    I can see that I really opened up a can of worms in the way I asked this question... maybe it was too general. I'm sure just the title drew down attention totally aside from what my actual question is 😀

    I'm talking about really specific cases and maybe that in itself is an indication that I shouldn't have special rules for some tables vs other tables unless it's just needed for performance.

    Really in this case, as pointed out, I can still set up the PK to the FK value if it's unique and order the ClusteredIndex off of this other value. That would include all of the benefits I have for using an identity, with whatever pro's/con's I was looking at for not Clustering or Keying on the Identity.

    I'm going to test things with the clustering on these tables. I'm really not sure if it's worth the time from deviating from the identity for PK/Cluster. It may be a negligable performance increase and just cause confusion for developers because it's not following the normal rules we have for tables.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply