Overusing Identities

  • I think every table design has to stand on its own merits.

    I use whetever design techniques are at my disposal.

    Like any feature you use, you use it more if it solves more problems in your requirements environment. You use it less if the problems it solves best (biggest advantages, fewest drawbacks) aren't very common (like dynamic sql, for example)

    There are a lot of questions to ask about any table during its design and lifecycle.

    Is there a natural key? That's a good candidate for primary key. If you cluster on the natural key, will there be page issues caused because of the volume of data?

    Is is quite likely that the candidate natural keys are not truly immutable? Will changes to the natural keys be difficult/impossible to handle? Will changes to natural keys make it difficult to maintain audit trails if the natural keys are used to identify rows?

    Can an identity PK resolve these issues? Is it a good choice for clustering or not? Are there other unique constraints/indexes we should enforce but not necessarily use them as foreign keys in other tables.

    Does the ID need to be generated in application prior to the database being involved?

    If an identity PK can't resolve the issue, do we need a distributed and globally unique identity? - COMBID/NEWSEQUENTIALID

    I found a performance problem where I had a natural multi-column key (with an extremely wide key for the account and a supplemental statistic identifier necessary to make it unique - normalized design of multiple stats per account), and I changed the design to add a surrogate foreign key for the widest part representing the account in a pseudo-dimensional design, so that my resulting design was simply ACCOUNT_SURROGATE, STAT_CD, STAT_VALUE. This narrowed the table immensely, and allowed processing of this high-cardinality table to proceed much more quickly.

  • Les Cardwell (3/1/2010)


    Steve,

    I'm shocked! :w00t: The only purist rationalization for a surrogate PK is when no 'natural key' exists. Otherwise, it's just a throwback to mainframe days when files used data keys to reference (point) records, and at the least, a corruption/minimalization of a relational model. So much for unordered relations :crazy:

    Shocked? What's a natural key? A PO number? Ever talked to someone about those? they're made up, usually by someone that keeps a spreadsheet, or even a piece of paper on a desk. Employee numbers? made up completely. SSNs, some logic, but essentially made up, and not guaranteed to be unique. SKUs? sequential numbers applied for, same for ISBNs.

    So many business numbers we use are made up. Granted there are items like an ISBN that are regulated, and should provide a strong natural key, but even those change over time. ISBNs have changed (Grown) in the time I've been in publishing.

    I am not arguing that natural keys don't exist or that they shouldn't be used, but there are many time in business when you find something that appears to be a natural key, but isn't. Especially across time.

    An identity is different than a record number or locator. It has nothing to do with physical location, or even insertion order, and you must be aware of that.

  • Steve Jones - Editor (3/1/2010)


    Shocked? What's a natural key? A PO number? Ever talked to someone about those? they're made up, usually by someone that keeps a spreadsheet, or even a piece of paper on a desk. Employee numbers? made up completely. SSNs, some logic, but essentially made up, and not guaranteed to be unique. SKUs? sequential numbers applied for, same for ISBNs.

    So many business numbers we use are made up. Granted there are items like an ISBN that are regulated, and should provide a strong natural key, but even those change over time. ISBNs have changed (Grown) in the time I've been in publishing.

    LOL - well, Invoice#'s, PO#'s, and the like should be the PK (the PK describes the attributes within a given tuple - or rather, the attributes describe the PK, and only the PK - e.g. all attributes in an invoice record describe that Invoice#). Employee numbers - same, whether system generated or manually derived. SSN's suffer from data integrity on a larger scale (and now have PC issues as well), so are relegated to attributes of a person, especially since all persons do not have SSN's. Rather, persons usually exist in a class context (Customers, Vendors, etc) and therefore are most appropriately relegated to surrogate PK's, especially given the data-correctness and uniqueness aspects of their name attributes. Then there are those PK's that might change, and if it's decided that those that change can be mutable (e.g. ISBN), it's the reason we have Cascade RI options. And 'Attribute tables' (UDC's, etc) pose signficant tertiary join problems when saddled with surrogate keys, adding significant amounts of transactional code to what is otherwise a simple 'lookup', especially when multiple dependencies exist in creating an assembly item record with attribute dependencies (e.g. -'Pears', 'Anjou', 'Box', '55ct', 'Size4', 'GradeA' - where each subsequent attribute is dependent on the one prior)

    As to identity keys being ordered and clustered... kinda nutz when one realizes that unless clustered on another key, the tuples alread exist as clustered unless the identity key is renumbered somewhere along the line.

    AAR, JM2c 🙂

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • So many business numbers we use are made up. Granted there are items like an ISBN that are regulated, and should provide a strong natural key, but even those change over time. ISBNs have changed (Grown) in the time I've been in publishing.

    I am not arguing that natural keys don't exist or that they shouldn't be used, but there are many time in business when you find something that appears to be a natural key, but isn't. Especially across time.

    What do you mean by "natural key" in this case? I much prefer the term business key but natural key means the same. It means an identifier used by the business / end user that has some external meaning outside the database. Whether it is made up or not is pretty irrelevant and arbitrary it seems to me. Ultimately all names and numbers are symbols invented by humans. I don't see why it is useful to make any distinction about where keys originate from. If it is intended to be used by the business to identify something then de facto it is a business key (= "natural" if you prefer).

    By contrast a surrogate is a key which has no external meaning at all. It is never used outside the database.

  • It's not a question of the origin of the number, but the fact that it's arbitrary-generated, and as such isn't necessarily a PK. In the paper world, when someone generates a duplicate PO number it may or may not get corrected, and may or may not cause issues. In a database, it's a bigger issue as we try to build rules for how the computer manages data.

    If the key is only used in the db, and used to manage data and maintain links and integrity, is it an issue?

  • the basic rule is that if bad data is possible to exist by database constraints (even if it shouldn't by business rules) somehow it will get into your database

    if you have a natural key that should be unique, then make it unique

    but for a primary key I tend to always use a surogate key, as this will always be a single column Int for smaller indexes and fast joining, and to allow for changes to business logic that might affect the uniqueness of the natural keys

  • Steve Jones - Editor (3/1/2010)


    If the key is only used in the db, and used to manage data and maintain links and integrity, is it an issue?

    It is a most important issue for the business user, yes. The business user needs to be able to identify in the real world the things which the database is supposed to record for her. A surrogate key does not do that. A business key does. Without a business key the data in the database is probably useless - or at least ambiguous.

  • Steve Jones - Editor (3/1/2010)


    It's not a question of the origin of the number, but the fact that it's arbitrary-generated, and as such isn't necessarily a PK. In the paper world, when someone generates a duplicate PO number it may or may not get corrected, and may or may not cause issues. In a database, it's a bigger issue as we try to build rules for how the computer manages data.

    If the key is only used in the db, and used to manage data and maintain links and integrity, is it an issue?

    Of course it creates issues... I'm dealing with a wagon load of such right now due to the fact that the legacy system did not use a WO# as a PK, and in converting to a new system (JDE1 - ugg), we're left with resolving. Not to mention the times that an engineer was using one WO#, logging costs, only to find that it was a dup... who's transactions had then posted to all other related systems. Can one say 'job security'?

    A PK has 'meaning', and creates business dependencies throughout the domain...

    Codd is sputtering... :pinch:

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • Les Cardwell (3/1/2010)


    Of course it creates issues... I'm dealing with a wagon load of such right now due to the fact that the legacy system did not use a WO# as a PK, and in converting to a new system (JDE1 - ugg), we're left with resolving. Not to mention the times that an engineer was using one WO#, logging costs, only to find that it was a dup... who's transactions had then posted to all other related systems. Can one say 'job security'?

    A PK has 'meaning', and creates business dependencies throughout the domain...

    Codd is sputtering... :pinch:

    I think people are getting caught up in semantics. For me the PK is *purely* about linking records between tables and has no "meaning" beyond that. For performance reasons and others you want that PK as small as possible, guaranteed unique, and never repeated. An identity meets those needs admirably. It also prevents page fragmentation because records are never inserted, only appended (assuming the PK is clustered, which it probably should be).

    A unique natural key should be *marked* unique. That way you can't get duplicate work orders or whatever. That, after all, is the *point* of a unique constraint. But it does NOT have to be the PK to do that.

    As for the arbitrary quality of natural keys, yes, and what's your point? 🙂 That's a *good* reason to make them a non-clustered index, that way if you have to change them (like an ISBN) you won't be as constrained. Set up properly and your code can handle the change without modification.

    Always assuming the key doesn't change from integer to string or some such...

  • I think Roger has put things well. I'm not saying you don't have a PK. I'm saying identities work well in that role.

  • {edit}... never mind. Bad post.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Codd is sputtering...

    Who is this Codd character?? Probably someone related to that Celko character! :w00t:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (3/1/2010)

    ...that Celko character...

    Oh, no! His GOOGLE search bot is sure to find this thread now!

  • Question?

    I followed an advice when formatting hard disks for SQL servers, I should choose the allocation unit size to match the SQL extent size is 64KB (another reference: Disk Partition Alignment Best Practices for SQL Server, "...whether the cluster size is the NTFS default of 4,096 bytes or has been explicitly defined at 64 KB, which is a best practice for SQL Server.", http://msdn.microsoft.com/en-us/library/dd758814.aspx).

    Am I right in thinking that identity, GUID; primary and indexed columns, they all would be better in performance if their data type size matches multiples of 8 or so? Not that the size of identity and GUID could be changed.

    In general, unless the business requirements say otherwise, data type size should be in multiples of 8? Or this is another thread for a day?

    TIA.

  • I am amazed how many times that users have clearly identified unique identifiers which, during their day to day operation, do not always get set, change or perhaps do not get created from the outset in all conditions. Often this is a rare circumstance but it only needs to occur once to stuff both the DB and the application.

    A great example of this is when systems use email addresses as the ID. Unless you maintain your own domain, all it takes is a change in company (even a rebranding) or ISP and you might no longer have access to that email address.

    Sure we can blame the users, Business Analysts, application developers, database developers, user acceptance testers and such like but that doesn't resolve this type of issue. Change is very expensive in systems.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

Viewing 15 posts - 31 through 45 (of 70 total)

You must be logged in to reply to this topic. Login to reply