Give Up on Natural Primary Keys

  • nova wrote:

    roger.plowman wrote:

    As a developer I vehemently disagree. You need a primary key to be mechanically useful. That means it must be immutable and have other desired characteristics, like uniqueness and small size. Thus surrogate keys comprised of int identities (or bigint if necessary) are both convenient and essential. Natural keys are by their nature seldom immutable or unique, for all they're supposed to be.

    But are you saying you would allow duplicate login names for your web site? I suspect your customers might have some problems with that. For obvious practical (and legal) reasons identifiers that work in the real world are essential for effective use of a system. The way we ensure our identifiers are reliable in a database is by putting uniqueness constraints on stable, but potentially mutable values such as login names and account numbers. I can't imagine responsible developers, say at a bank or anywhere else for that matter, allowing account numbers to be duplicated just for their own convenience. In the present regulatory environment, GDPR and its ilk, that kind of laxity could probably end with company execs taking the rap in court.

    That's what alternate keys are for. 🙂 You can have multiple unique keys on a database that are NOT the primary key. You can even have a clustered index (the table's physical order) not be a primary key although I think PK clustered is almost always the way to go.

    Primary keys need to be small. This reduces the size of any index containing them and any FK (which should generally also be indexed) as well--which always increases performance (small = less data to read). Primary keys should be immutable, this is a nice bit of stability that keeps A) the database from doing excess work and B) removes one potential failure point as they're never updated.

    An ever increasing physical key (eg an identity or similar functionality) helps reduce table fragmentation (admittedly it does nothing for alternate key index fragmentation but eh, better reduced fragmentation, right?)

    Natural keys, from a mechanical standpoint, have none of these advantages. Yes, you need to keep them unique, but that doesn't mean using them as primary keys.

  • nova wrote:

    You already said, and I agree, that login names can change. Similarly customer account numbers, so those are two relatively common cases of keys that may need to be updated. There are plenty of other examples around. As evidenced by some of the comments in this thread, the expectation that other keys might need to change is one of the major reasons why people use surrogate keys at all. If keys only rarely needed updating then there would be fewer reasons to add a surrogate key to a table.

    never a PK

    If I update a surrogate primary key does it stop being a PK? How do I make it a PK again? I'm not convinced that it matters in any practical sense.

    The point of using a surrogate key is so that it's never updated. 🙂

     

  • Roger, I agree with nearly all you have said but when I said before that natural keys could be unique but mutable you said you "vehemently" disagreed with me! Perhaps we are now in vehement agreement.

    The one part of your reply I would disagree with is your suggestion that a primary key should be treated differently because of the need to keep foreign keys small. Not every primary key is referenced by a foreign key and not every foreign key references a primary key, so I believe the best answer is "it depends". As for keeping indexes small, you apparently agree that the natural key should be indexed anyway so creating another index on another column is not going to make it smaller. Perhaps you are thinking of clustered indexes but the choice of clustered index is another thing that needs to be determined on a case by case basis. It's unwise to assume that a primary key should always be clustered.

    The point of using a surrogate key is so that it's never updated

    For sure surrogate keys are usually intended to be stable but it's not the end of the world if you do update them. I've done that a few times and I know I'm not the only one.

    CORRECTION: I first typed immutable in the opening sentence when I meant mutable! Sorry about that.

    • This reply was modified 5 years, 4 months ago by  nova.
    • This reply was modified 5 years, 4 months ago by  nova.
  • nova wrote:

    Roger, I agree with nearly all you have said but when I said before that natural keys could be unique but mutable you said you "vehemently" disagreed with me! Perhaps we are now in vehement agreement.

    The one part of your reply I would disagree with is your suggestion that a primary key should be treated differently because of the need to keep foreign keys small. Not every primary key is referenced by a foreign key and not every foreign key references a primary key, so I believe the best answer is "it depends". As for keeping indexes small, you apparently agree that the natural key should be indexed anyway so creating another index on another column is not going to make it smaller. Perhaps you are thinking of clustered indexes but the choice of clustered index is another thing that needs to be determined on a case by case basis. It's unwise to assume that a primary key should always be clustered.

    The point of using a surrogate key is so that it's never updated

    For sure surrogate keys are usually intended to be stable but it's not the end of the world if you do update them. I've done that a few times and I know I'm not the only one.

    CORRECTION: I first typed immutable in the opening sentence when I meant mutable! Sorry about that.

    Ok, I think I see three points where we do, in fact, vehemently disagree. The first is in treating primary keys as special. They are indeed special, they are indeed (supposed to be) the only keys used as foreign keys in other tables by design. Thus, mechanically, they need to be as small as possible for space and performance reasons.

    Second, the point of surrogate keys is they are immutable. That is the point of them. There is no point in updating them because by design they never change. In fact if you use identities for PKs it's pretty hard to update them--and pretty pointless. If they do change they are not surrogate keys, they are data. And data makes bad primary (and thus foreign) keys. In fact, data makes  poor keys in general, the only use for them is in fact to insure their uniqueness.

    And third, no, I'm not thinking of clustered keys. 🙂 Primary keys are used in every foreign key index pointing to it. Alternate keys are only generated in the primary table, not the ones who have a foreign key to the primary table.

    That means you can burn a lot more room on all the foreign key indexes than you would a single alternate key index.

  • Using integer identity primary keys with a consistent naming convention (e.g. TableNameId) has a lot of advantages in developing and supporting a database. You will know the column name and data type of the primary key without even having to look at the table. You need to know very little about what the database stores to be able to contribute constructively to the development of it. If you have another column that could have been a natural primary key, a unique index or constraint can be created on it. I see very little advantage in using natural primary keys.

  • roger.plowman wrote:

    If they do change they are not surrogate keys

    Suppose I have to merge data from two systems which happen to have conflicting surrogate primary key values even though they represent different subsets of data. I decide the easiest way to achieve this is to do a one-time update of the surrogate keys in the target system before I do the merge. You want me to stop calling those keys "surrogate" from that point forward? I could live with that I suppose. But given that the keys are performing exactly the same function they did before that just seems like a fairly inconsequential point of terminology rather than any real database design issue. I'm more interested in practicalities than in discussions about the meaning of words.

    Microsoft has this to say about foreign keys:

    "A foreign key constraint does not have to be linked only to a primary key constraint in another table; it can also be defined to reference the columns of a UNIQUE constraint in another table."

    https://docs.microsoft.com/en-us/sql/relational-databases/tables/create-foreign-key-relationships?view=sql-server-2017

    It's prudent to check which key is actually being referenced rather than just assume it's the one someone defined as primary.

    Obviously we've strayed some way from the topic of GDPR but some interesting points nevertheless.

  • nova wrote:

    roger.plowman wrote:

    If they do change they are not surrogate keys

    Suppose I have to merge data from two systems which happen to have conflicting surrogate primary key values even though they represent different subsets of data. I decide the easiest way to achieve this is to do a one-time update of the surrogate keys in the target system before I do the merge. You want me to stop calling those keys "surrogate" from that point forward? I could live with that I suppose. But given that the keys are performing exactly the same function they did before that just seems like a fairly inconsequential point of terminology rather than any real database design issue. I'm more interested in practicalities than in discussions about the meaning of words.

    Microsoft has this to say about foreign keys:

    "A foreign key constraint does not have to be linked only to a primary key constraint in another table; it can also be defined to reference the columns of a UNIQUE constraint in another table."

    https://docs.microsoft.com/en-us/sql/relational-databases/tables/create-foreign-key-relationships?view=sql-server-2017

    It's prudent to check which key is actually being referenced rather than just assume it's the one someone defined as primary.

    Obviously we've strayed some way from the topic of GDPR but some interesting points nevertheless.

    In the merging scenario these would have to be independent systems that didn't know of each others existence. In which case your master system needs its own independent PK so the link to the old systems could be maintained--meaning you change neither system's surrogate keys. 🙂

    The only scenario I can envision where surrogates would be changed is when you're replacing an old system which uses identities (or equivalent) where deletions have occurred and you want to "condense" the keys to recover the deleted ones.

    But in that case you aren't changing the originals, you are translating the old keys into the new ones rather than changing the old ones. As it happens I'm in the process of doing exactly that, but it involves an intermediate staging tables to translate the old keys into new ones. At no point are the old keys touched. Thus the old keys are still immutable.

    As to your second point, ok, while it's mechanically possible according to MS it violates relational design I should think. As such it's bad design and definitely needs to be avoided.

     

  • nova wrote:

    roger.plowman wrote:

    If they do change they are not surrogate keys

    Suppose I have to merge data from two systems which happen to have conflicting surrogate primary key values even though they represent different subsets of data. I decide the easiest way to achieve this is to do a one-time update of the surrogate keys in the target system before I do the merge. You want me to stop calling those keys "surrogate" from that point forward? I could live with that I suppose. But given that the keys are performing exactly the same function they did before that just seems like a fairly inconsequential point of terminology rather than any real database design issue. I'm more interested in practicalities than in discussions about the meaning of words.

    Microsoft has this to say about foreign keys:

    "A foreign key constraint does not have to be linked only to a primary key constraint in another table; it can also be defined to reference the columns of a UNIQUE constraint in another table."

    https://docs.microsoft.com/en-us/sql/relational-databases/tables/create-foreign-key-relationships?view=sql-server-2017

    It's prudent to check which key is actually being referenced rather than just assume it's the one someone defined as primary.

    Obviously we've strayed some way from the topic of GDPR but some interesting points nevertheless.

    I've had to merge partial data from one database to another, the PK's of the source rows have to change on the target as they are being inserted into the target. But there are effective methods of doing this, you just need to store the old pk and the new pk in temporary tables (using the OUTPUT clause). Then when you insert the child rows (from the tables with the FK's) you just join to the temporary table you've created and insert the new PK (in the FK column). It's not difficult to do.

  • My example of merging data was a scenario where the old system was being discontinued and the target system had to do the job of both. That was why it made sense to update the keys. Sure, I could have re-inserted all the data but the end result would have been the same and doing the update was simpler and more efficient.

    A foreign key referencing any candidate key is not a violation of "relational design" and there are reasons to do it sometimes. In SQL Server for example you can take advantage of a non-clustered index with included columns - which isn't possible in the case of a PK even if the PK is non-clustered. In other DBMSs there are other benefits.

     

    • This reply was modified 5 years, 4 months ago by  nova.
  • Interesting discussion.

    My main point that started this is that natural keys become problematic for the GDPR and other regulation because of the hassle of decoupling the data from a human at some point. Whether this is in masking/obfuscating for dev/test or because of requests to remove info. As I age, I find less reasons where natural keys are superior to surrogate keys, and many reasons they are inferior. While the use of PatientID might link to a patient being treated, if I remove that linkage (be xxxing out the name/etc.), the id no longer identifies a human, but potentially, I can still use the data and keep it around for analysis.

  • Hi Steve

    Natural keys serve a very different purpose to surrogates so it's pretty meaningless to say one is "superior" to the other. You will need both. If my patients are identified by care providers or insurers with a hospital or health-service issued key (in many health systems it's a contractual and/or regulatory requirement to do so) then a database-assigned surrogate is no substitute for that. It would be remiss not to enforce uniqueness on the natural key or to fail to meet your GDPR obligations in relation to it. Natural keys are essential to good data management and data protection. Effective use of natural keys is far more critical to satisfying GDPR than effective use of surrogate keys.

    • This reply was modified 5 years, 4 months ago by  nova.
  • nova wrote:

    Hi Steve

    Natural keys serve a very different purpose to surrogates so it's pretty meaningless to say one is "superior" to the other. You will need both. If my patients are identified by care providers or insurers with a hospital or health-service issued key (in many health systems it's a contractual and/or regulatory requirement to do so) then a database-assigned surrogate is no substitute for that. It would be remiss not to enforce uniqueness on the natural key or to fail to meet your GDPR obligations in relation to it. Natural keys are essential to good data management and data protection. Effective use of natural keys is far more critical to satisfying GDPR than effective use of surrogate keys.

    I think perhaps the disconnect in this discussion hinges on your belief that natural keys are necessary as keys. I suspect you are viewing things from a logical model rather than the physical model I prefer. From a mechanical standpoint surrogate keys are superior to natural ones in critical ways.

    When viewed from the requirement that third parties require natural keys and that those natural keys must be unique (globally across disparate systems) then natural keys are required in a global sense.

    But in the global sense natural keys are more akin to data than they are mechanically useful for relational linking of tables within a database. Therefore absolutely place a unique constraint, use them for alternate keys but not as primary keys.

     

  • A key is a key whether it is implemented using a PRIMARY KEY constraint, UNIQUE constraint or other means. Referential integrity is something else because not all keys are referenced by a FOREIGN KEY constraint - a table can have several keys but typically only one of them will be referenced by foreign key(s).

    GDPR is not concerned with how keys or referential integrity are implemented but it is very much concerned with identifiers, and in a database the keys (or more accurately the superkeys) of a table are usually the most important identifiers.

    Again, the idea that surrogate keys are "superior" to natural keys suggests a very partial understanding of those concepts. They are not comparable because they serve different purposes and one is not a replacement for the other. You might as well say a hammer is superior to a screwdriver - but would you use a hammer to turn a screw?

    • This reply was modified 5 years, 4 months ago by  nova.
  • nova wrote:

    A key is a key whether it is implemented using a PRIMARY KEY constraint, UNIQUE constraint or other means. Referential integrity is something else because not all keys are referenced by a FOREIGN KEY constraint - a table can have several keys but typically only one of them will be referenced by foreign key(s).

    GDPR is not concerned with how keys or referential integrity are implemented but it is very much concerned with identifiers, and in a database the keys (or more accurately the superkeys) of a table are usually the most important identifiers.

    Again, the idea that surrogate keys are "superior" to natural keys suggests a very partial understanding of those concepts. They are not comparable because they serve different purposes and one is not a replacement for the other. You might as well say a hammer is superior to a screwdriver - but would you use a hammer to turn a screw?

    No, but...

    The purposes of a primary key have nothing to do with data, they have to do with the mechanical linkage of referential integrity, and for that purpose surrogate keys are absolutely superior. In terms of GDPR they are also somewhat superior in that they can be used to shield PII by isolating it in one highly protected table as opposed to scattering PII across myriad indexes.

    It is always easier to defend a single well defined point than many poorly understood ones. Increasing the attack surface benefits only the attacker, after all.

  • This reminds me of discussions on "NULLs".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 76 through 90 (of 108 total)

You must be logged in to reply to this topic. Login to reply