Overusing Identities

  • Gary Varga (3/1/2010)


    A great example of this is when systems use email addresses as the ID. Unless you maintain your own domain, all it takes is a change in company (even a rebranding) or ISP and you might no longer have access to that email address.

    Sure we can blame the users, Business Analysts, application developers, database developers, user acceptance testers and such like but that doesn't resolve this type of issue. Change is very expensive in systems.

    That kind of change isn't necessarily expensive if it is foreseen by the designers. An email address is a perfectly reasonable unique identifier provided you have the ability to change it. Changing an email address key is ultimately a tiny change if the system is designed for it.

  • David Portas (3/2/2010)


    Gary Varga (3/1/2010)


    A great example of this is when systems use email addresses as the ID. Unless you maintain your own domain, all it takes is a change in company (even a rebranding) or ISP and you might no longer have access to that email address.

    Sure we can blame the users, Business Analysts, application developers, database developers, user acceptance testers and such like but that doesn't resolve this type of issue. Change is very expensive in systems.

    That kind of change isn't necessarily expensive if it is foreseen by the designers. An email address is a perfectly reasonable unique identifier provided you have the ability to change it. Changing an email address key is ultimately a tiny change if the system is designed for it.

    I don't disagree with David's point, however, I guess part of what I was trying to say, and didn't explicitly, is that sometimes requirements are not fully captured and information is used as keys in the technical implementation when the consequences of that implementation on the operational use of the data is not fully understood.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Steve Jones - Editor (3/1/2010)


    I think Roger has put things well. I'm not saying you don't have a PK. I'm saying identities work well in that role.

    I agree that they can, just that it's not optimal, especially when one starts abstracting relations. The real atrocity is when one uses an identity, but fails to make the PK unique. One here is that they used identities throughout, and for example, a 'Poles' table has a Pole_Nbr attribute, but it's not unique. We now have duplicate Poles throughout the district 🙁 Yet another bit of magic to be levied in its resolution... 😎

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • An email address is a perfectly reasonable unique identifier provided you have the ability to change it.

    David,

    You have got to be kindding. Not everyone has an email address and some people share an email address . . . either of which breaks the "unique identifier" aspect of an email address. You might as well say that a First, Middle, & Last Name or a telephone number (even including the country code) form a "reasonable unique identifier"!

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • Les,

    The real atrocity is when one uses an identity, but fails to make the PK unique.

    Excuse me but what DBMS allows you to create a P that is NO unique and why are you using it?

    In most DBS's that I have used, a PK, by definition, has to be unique although it may not be used as a clustered index.

    Also, by definition, an IDentity column is unique, whether it is used as a PK, a Unique Key, a Unique Index, or just as a column.

    Now, if you set up an index on your PoleID and create an index on it but do not enforce the uniqueness with the index, that is jsut really sloppy design and implementation. Even if there are identical PoleID's out there in the field, there has got to be some other way (such as a county designator along with the PoleID) to get a unique result. If so, then the index is simply not complete enough to be made unique.

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • RalphWilson (3/2/2010)


    An email address is a perfectly reasonable unique identifier provided you have the ability to change it.

    David,

    You have got to be kindding. Not everyone has an email address and some people share an email address . . . either of which breaks the "unique identifier" aspect of an email address.

    Amazon successfully uses my email address as a unique identifier. So does MS Passport and so do other web sites I use. The email address uniquely identifies my account because the site won't allow two accounts with the same email address. Whether the address is shared or not is irrelevant. As far as Amazon's business process is concerned the email address still identifies exactly ONE account, which is all that the business process requires.

    You are correct that an email address does not uniquely identify a person but that isn't the issue because the business process does not require a person to be identified - it only needs to identify an account. If it did need to identify people then other attributes would be used as well - such as a name for example.

    Also, by definition, an IDentity column is unique, whether it is used as a PK, a Unique Key, a Unique Index, or just as a column.

    Not if you reseed the column or use IDENTITY_INSERT to insert values. If an IDENTITY column doesn't have a unique constraint or index then it may not be unique.

  • David Portas (3/2/2010)


    RalphWilson (3/2/2010)


    An email address is a perfectly reasonable unique identifier provided you have the ability to change it.

    David,

    You have got to be kindding. Not everyone has an email address and some people share an email address . . . either of which breaks the "unique identifier" aspect of an email address.

    Amazon successfully uses my email address as a unique identifier. So does MS Passport and so do other web sites I use. The email address uniquely identifies my account because the site won't allow two accounts with the same email address. Whether the address is shared or not is irrelevant. As far as Amazon's business process is concerned the email address still identifies exactly ONE account, which is all that the business process requires.

    You are correct that an email address does not uniquely identify a person but that isn't the issue because the business process does not require a person to be identified - it only needs to identify an account. If it did need to identify people then other attributes would be used as well - such as a name for example.

    Also, by definition, an IDentity column is unique, whether it is used as a PK, a Unique Key, a Unique Index, or just as a column.

    Not if you reseed the column or use IDENTITY_INSERT to insert values. If an IDENTITY column doesn't have a unique constraint or index then it may not be unique.

    Just an aside, but Amazon has mucked up my account and I actually have two accounts with the same email address, just different passwords. Evidently, they don't have a unique constraint on their database.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Les Cardwell (3/1/2010)


    Steve Jones - Editor (3/1/2010)


    ...

    A PK has 'meaning', and creates business dependencies throughout the domain...

    Codd is sputtering... :pinch:

    Actually, A PK (or Primary Key) is the primary means within the database for identifying the row in question. Its main use is in PK-FK linkages or in the distinction of a given row from another given row which may be especially important in situations where there is a need to delete or update a given row.

    PK's do not "create business depndencies throughout the domain" but, rather, are used to enforce data dependencies and database intergrity throughout the database.

    The persistent concept of the PK having to have meaning to the business side of the house is, IMHO, misplaced. The PK needs to have meanng to the database and, if it has meaning to the business side of the house, then that's "gravy".

    Now, don't get me wrong, there is a need for some sort of means for the business side of the house to identify specific rows. However, becuase the business side of the house is composed of humans, what the busines side considers to be "unique" (e.g. an Invoice Number or a Purchase Order Number or even a VIN) may or may not be unique but is very possible not immutable. In other words, even if it does happen to be unique it may wind up changing in a given row because of a correction due to a data entry error.

    That is why I tend to assign Identity columns as the PK and then make other columns either a composite index or an index and, if I need them to be unique, I may append the Identity column if I have any reason to suspect that a column may not remain unique.

    As for clustering by PK, which is often the default in SQL databases, using an Identity column will ensure that you do not have new rows being inserted between existing rows (unless you are forced to use GUIDs which are, essentially, random numbers). That means that you won't be having your table periodically resorting or else remaining fragmented.

    The definition of a clustered index, i.e. it is a physical sort of the table, means that anything other than a uniformly ascending Key is going to result in either the fragmentation or the restorting of the table.

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • RalphWilson (3/2/2010)


    Excuse me but what DBMS allows you to create a P that is NO unique and why are you using it?

    In most DBS's that I have used, a PK, by definition, has to be unique although it may not be used as a clustered index.

    Also, by definition, an IDentity column is unique, whether it is used as a PK, a Unique Key, a Unique Index, or just as a column.

    Now, if you set up an index on your PoleID and create an index on it but do not enforce the uniqueness with the index, that is jsut really sloppy design and implementation. Even if there are identical PoleID's out there in the field, there has got to be some other way (such as a county designator along with the PoleID) to get a unique result. If so, then the index is simply not complete enough to be made unique.

    LOL... ya think? Most of the db's I come across are... well... crap 🙂

    In that case, they defined the PoleID as a Surrogate/PK/UID, and PoleNumber as just a varchar... not a PK, even though it is the 'natural' PK...and no constraints to make it so.

    Part of the point is that just because every table has a Surrogate/UID/PK, and joins are instantiated within the dbms (vs ad-hoc), does not make the db 'normal', and in fact, creates far more work than using 'natural' PK's from the db architecture through to the UI...adding significant amounts of complexity that are otherwise not needed to achieve 'normal'.

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • An email address is a mutable attribute. As such it would make a terrible primary key for a database.

    Just because you use a email address to login to Amazon or anywhere else, doesn't mean it's a primary key.

    Converting oxygen into carbon dioxide, since 1955.
  • I'm not sure I like email as a PK either. I might use it as a unique column, and display it, but since it is something that can easily change, it's not a change I want to propagate through my db when it changes. Or worse, when it's reused by the company.

    Email is a great attribute that you have no control over in your business process. Someone else assigns it, so it's not something I'd pick as the PK.

    Besides, in which circumstance does having email as a unique column and an identity as a PK cause a problem in your db?

  • RalphWilson (3/2/2010)


    Actually, A PK (or Primary Key) is the primary means within the database for identifying the row in question. Its main use is in PK-FK linkages or in the distinction of a given row from another given row which may be especially important in situations where there is a need to delete or update a given row.

    PK's do not "create business depndencies throughout the domain" but, rather, are used to enforce data dependencies and database intergrity throughout the database.

    The persistent concept of the PK having to have meaning to the business side of the house is, IMHO, misplaced. The PK needs to have meanng to the database and, if it has meaning to the business side of the house, then that's "gravy".

    In context, it was in reponse to Steve's posit...

    ...In the paper world, when someone generates a duplicate PO number it may or may not get corrected, and may or may not cause issues. In a database, it's a bigger issue as we try to build rules for how the computer manages data..."

    Tersely speaking, the relational model does not distinguish between primary keys and other candidate keys as expressed through relational calculus/relational algebra, except that one candidate key is chosen as the Primary Key.

    However, from a design perspective, using natural keys, superkeys, and abstract keys derived from natural keys can produce powerful design benefits. If so inclined, read the 'Case Study'... http://sites.google.com/a/whiteboxinc.com/www/anf.pdf ...which was an actual implementation of an ERP solution which allowed for stripping thousands of lines of transactional code through simplification and abstraction of PK/FK constructs. At least a good cure for insomnia :doze:

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • Steve Cullen (3/2/2010)


    An email address is a mutable attribute. As such it would make a terrible primary key for a database.

    Just because you use a email address to login to Amazon or anywhere else, doesn't mean it's a primary key.

    Les said it before I could. There is no difference between a primary key and a candidate key. They mean the same thing. The concept of designating a primary key at all is an entirely arbitrary distinction of no logical or practical significance.

    As I said in my first post, what really matters is that you have whatever candidate keys are required for the integrity of your data. On that basis an email address makes a fine candidate key if your business requires it to be unique (even though Amazon apparently doesn't!).

  • Les Cardwell (3/2/2010)


    In context, it was in reponse to Steve's posit...

    ...In the paper world, when someone generates a duplicate PO number it may or may not get corrected, and may or may not cause issues. In a database, it's a bigger issue as we try to build rules for how the computer manages data..."

    Tersely speaking, the relational model does not distinguish between primary keys and other candidate keys as expressed through relational calculus/relational algebra, except that one candidate key is chosen as the Primary Key.

    However, from a design perspective, using natural keys, superkeys, and abstract keys derived from natural keys can produce powerful design benefits. If so inclined, read the 'Case Study'... http://sites.google.com/a/whiteboxinc.com/www/anf.pdf ...which was an actual implementation of an ERP solution which allowed for stripping thousands of lines of transactional code through simplification and abstraction of PK/FK constructs. At least a good cure for insomnia :doze:

    Again, as long as you can guarantee that whatever you choose to enforce relationships doesn't keep changing at the drop of a hat, sure - you can get some benefit from natural keys. That said -continually having to maintain foreign keys because your natural key keeps changing faster that a ceiling fan can spin means you have a design problem.

    Depending on how much data you have, having your primary keys change could become a HUGE nightmare (at a previous gig we spent most of a year ripping out a government-issued physician identifier touted to "never change", spread over 17Million patient records, 20 or so years of care, etc...). If you're a fan of Murphy - the data guaranteed to NEVER change...will.

    Just be careful with the ideal data model, lest reality come crashing your house down.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • However, from a design perspective, using natural keys, superkeys, and abstract keys derived from natural keys can produce powerful design benefits.

    This is true, from a design perspective. However, the question arises, is this still true from an implementation perspective.

    For instance, I encountered a very well designed database that had made extensive use of PK-FK relationships to maintain data integrity. It had also made very extensive use of GUIDs for the PK's in manyy instances and, in others, the designer had chosen to use "Natural Keys" such as the 3 line address, city, state/province (fully spelled out), 9 characters for the Zip/Postal code, and the GUID for the country . . . resulting in a total of something like 275 characters for the PK. In the data model and database diagrams it looked great and, in the test environment, it worked slicker than goose grease on a plastic floor. However, there were some, uh, _issues_ when it went into production.

    Years ago, I learned the hard way that "smart keys" (as we used to call "natural keys" 😉 tend to look great in theory but become less "smart" in practice. However, I also learned that those who favor Natural Keys will neither sway to their way of thinking nor be swayed away from their way of thinking by those who do not . . . and vice versa. 😉

    However, based on the fact that I have at least 9 different email addresses and at least 4 different accounts at each of EBay, Amazon, and a couple of other sites based on those email addresses. I would contend that your theory that an email address is a good Natural Key candidate for a PK on enay of those sites may have a flaw in it. If you want to find out my current address, which of my accounts would you bring up? If you want to track my purchases, which of my email addresses would you use? (Whic, by the way, is part of why those accounts exist, although another part is that at various times one or another of the accounts got locked out because of site issues. 😉

    I end to favor the Identity columns for PK's and other columns or composites for Unique keys/indexes. This has proven useful in my life. Some favor Natural Keys as PKs and abhor Identity columns unless absolutely necessary and that has usually proven useful in their lives. As a UK friend of mine puts it, "Horses for courses." 😉

    Anyone want discuss the allowance of NULLs? 😉

    or, to quote from "War Games":

    "How about a nice game of chess?"

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

Viewing 15 posts - 46 through 60 (of 70 total)

You must be logged in to reply to this topic. Login to reply