April 29, 2008 at 7:11 pm
Matt Miller (4/29/2008)
The primary characteristic for a primary key (in addition to uniqueness) is immutability, specifically because of its "deisgnated use" in table relations. And that's where I think the natural keys tend to fall down rather quickly, since they are "naturally" prone to changing. Rarely do natural keys tend to be as "thin" or as immutable as arbitrary system generated keys.
I will agree that a key value should not be prone to being changed often, not quite as often as "ordinary" fields. However, I think immutability is a far cry from being "the primary characteristic" of any key, primary or otherwise. If the updates are cascaded, your relationships will still work. You've just lost a little efficiency is all.
Unless you have hard-coded key values in scripts and other code. Then you have a problem but it's mostly of your own making.
Tomm Carr
--
Version Normal Form -- http://groups.google.com/group/vrdbms
April 30, 2008 at 4:37 am
Tomm Carr (4/29/2008)
Matt Miller (4/29/2008)
The primary characteristic for a primary key (in addition to uniqueness) is immutability, specifically because of its "deisgnated use" in table relations. And that's where I think the natural keys tend to fall down rather quickly, since they are "naturally" prone to changing. Rarely do natural keys tend to be as "thin" or as immutable as arbitrary system generated keys.
I will agree that a key value should not be prone to being changed often, not quite as often as "ordinary" fields. However, I think immutability is a far cry from being "the primary characteristic" of any key, primary or otherwise. If the updates are cascaded, your relationships will still work. You've just lost a little efficiency is all.
Unless you have hard-coded key values in scripts and other code. Then you have a problem but it's mostly of your own making.
Surely if you're Primark key is not stable then it's not a Primary Key in the strictest sence of the term? In database design Stability, along with being Unique, applying to all rows and not being null is the requirements of a Primary Key?
Sorry for being pedantic but just thought I'd point it out.
April 30, 2008 at 7:10 am
Because of privacy issues regarding the SSN, I would never recommend using it as a PK. When used as a PK, it will exist as in FK in multiple child tables. When not so used, it can be confined to a single table.
April 30, 2008 at 7:14 am
RonKyle (4/30/2008)
Because of privacy issues regarding the SSN, I would never recommend using it as a PK. When used as a PK, it will exist as in FK in multiple child tables. When not so used, it can be confined to a single table.
This is a really good point, i am not sure what the privacy laws are in the US, but in the country I used to work it was law that we could not use our equivalent of SSN as an identifer on any databases.
April 30, 2008 at 7:23 am
Tomm Carr (4/29/2008)
Matt Miller (4/29/2008)
The primary characteristic for a primary key (in addition to uniqueness) is immutability, specifically because of its "deisgnated use" in table relations. And that's where I think the natural keys tend to fall down rather quickly, since they are "naturally" prone to changing. Rarely do natural keys tend to be as "thin" or as immutable as arbitrary system generated keys.
I will agree that a key value should not be prone to being changed often, not quite as often as "ordinary" fields. However, I think immutability is a far cry from being "the primary characteristic" of any key, primary or otherwise. If the updates are cascaded, your relationships will still work. You've just lost a little efficiency is all.
Cascading updates (whether using the built-in functionality or something home-grown) of a Primary key to all of the foreign key relations in my experience is not at all a "little efficiency loss". It's a huge drain on resources, so I avoid them like the plague.
The history "table" (actually - tables, manually partitioned) we would have to update looking for natural keys that changed, involves sifting through 400M records and redoing certain "cold storage backups", so even a handful isn't a happy thing in my book.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 30, 2008 at 12:53 pm
Tomm Carr (4/29/2008)
GSquared (4/29/2008)
The only reason I mention SSN as a problem is because SO many people think "oh, that's perfect!", and use it as a primary or unique key, and then cause the problems I outlined.The only reason I used it in the first place was to illustrate the point that there could be non-technical reasons for choosing one key over another to make the PK. The fact that there may also be technical reasons is beside the point really. Come up with a better example.
But your argument has brought up two points worth commenting on.
1) Identifying keys can be a lot of trouble. Some designers think that the use of surrogate keys gets you around that trouble. It does not.
2) Someone may have, at one time or another, chosen an inappropriate field for use as the PK. This can cause any number of problems which may not be easily fixed. But this means that the decision to use a natural key as the PK must be made with care; it does not mean that it can't or shouldn't be done at all. Bridges collapse, but we haven't given up building bridges.
Another point specific to the use of SSNs. I agree that there are many reasons it should not be used as the PK. However, it is still a key and should be defined as such. (As the SSN is a US-only construct, let's assume that the context is US employees only.) Although not a PK, the payroll department, for one, may prefer to use it as the FK of choice for the relationships in its database. After all, the SSN is central to most of the information it processes -- including data sent to and received from the IRS. The IRS doesn't know EmployeeID from squat -- it wants SSNs. If, as you maintain, the SSA should give out the same SSN to two different people, and if your company should be so lucky as to hire both of those people, then the fact that Payroll is going to be working with two employees with the same SSN is something they are going to want to know about -- right away! If this situation went unnoticed for a while, there is no end of trouble they could be getting themselves into. So, while the SSN should never be defined as the Primary Key, it should always be defined as a key (not null, unique). The accuracy of your paycheck may depend on it. 🙂
I think you may have missed the point of what I wrote.
Unless the database is specifically about Social Security, or the table in the database is specifically about Social Security, I would not define it as a key. I would keep it, hopefully encrypted, in a separate table, with an employee ID as the FK to the employees table.
Why would HR/Payroll/Accounting need to define their own FKs? Even if, under some odd circumstance, they do, I can't see that as being a good idea. For all the reasons I've already written.
Plus, to add to it, with regard to "cascading updates", does that mean you're going to lose all the SS payments history of an employee and end up with incomplete data on their W-2, if the employee has to get a new SSN? It sure sounds like you would be happy with that result, from what you've written so far.
Do SSNs ever change? Yep. Happens to some women when they get married (I've seen this). Happens to some people who are victims of identity theft. Happens to some people for various other legal reasons. The Social Security Administration deals with this very regularly.
Data that should be kept encrypted, which can change, which can be duplicate, which doesn't actually describe the entity it is assigned to (it's a characteristic, not a defining feature, unless we're specifically talking about data regarding taxes and Social Security), which can be null, sounds to me like a really bad candidate for a key. Almost, but not quite as bad as using first names as a key.
(Actually, there was a problem in some Canadian database a while back, whereby it couldn't deal with birth certificates for two people with the same name born on the same date, in two different cities. I remember reading about something like that a couple of years ago. Again, keys should really be well-thought-out.)
Does any of this mean I think keys need to be perfect? Of course not. But they should still be good enough to avoid known, obvious, problems, which can have a direct and significant legal and financial impact.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 30, 2008 at 1:55 pm
I don't who RonKyle is arguing with, but it is not me. I never proposed using SSN (or any country's equivalent) as a Primary Key. In fact, I first mentioned it to show why it should not be used as the PK, which is the same reason Ron gave.
However, it is a key. Sorry if you don't like that. It just is.
The fact that the SSN (or equivalent) is sensitive information doesn't matter in the slightest. Data sensitivity matters only in deciding to make it the Primary key.
I then mentioned the Payroll department to show that data sensitivity is not always the determining factor. Payroll may well use SSN as a foreign key despite its sensitive nature and possibly even the PK in some of its tables. Pretty much everything the Payroll department works with is sensitive and secure so the sensitive nature of the SSN doesn't make it stand out in this kind of environment. There may even be legal requirements that all payroll transactions be tied to an SSN.
When I owned a business, I had to give my EIN (an SSN assigned to a company) to all my vendors. When I sold product to another company wholesale, I had to get that company's EIN. That info went into QuickBooks, because if the State ever audited my sales tax reciepts, I had better be able to product an EIN for every wholesale (no sales tax charged) exchange I made. I didn't design the database used by QuickBooks, but there was a relationship there if I needed it.
What you as the data modeler/database developer do about the fact that an SSN is a key is up to you. You can ignore it -- generally not a good idea but you could get away with it for years. Or you can properly define it (not null, unique). You then restrict access to it to only those who need it -- Payroll being one.
There also seems to be this idea floating about that if one finds an exception to a rule, then the rule is no good.
Bloody hell!
(I threw that in to relate to the Brits in the crowd. How did I do?)
Primary keys should be stable. I have agreed with that. But it is a fact of life that any datum is subject to change. If you have two candidate keys that are in all other ways equally qualified, if you can determine that one is more stable than the other, make that one the PK. But stable does not mean "fixed, frozen, unchanging till the end of time."
This is one reason Identity fields are used so much as a surrogate key field. The value, once entered into the table, can never change, the system simply will not let you. So is this a solution to the "cascading update of 400M records is too resource intensive" problem? Well, yeah. But it has problems of its own (replication immediately comes to mind). So adopting the habit of making all PKs an Identity field can create problems of its own.
So all rules concerning keys are general rules -- all rules concerning anything are general rules. There are exceptions and there may even be other rules that come into play for specific kinds of exceptions. There is no rule that doesn't have exceptions -- except possibly for the rule that all rules have exceptions, which has no known exceptions. 😀
So, is there a list of rules for choosing the Primary Key of a table?
Here's one:
Properly identify all keys of the table. (This shouldn't really be too difficult. If you've managed to get your database into 3nf, you've already done this.)
Prefer single keys over composite keys.
Prefer smaller keys over larger keys.
Prefer more stable keys over less stable keys.
Prefer keys containing less sensitive information.
Prefer natural keys over surrogate keys.
I expect some argument over the last rule. I certainly get it here where I work. Don't get the idea I don't use surrogate keys. Probably about 25% of the tables I design have surrogate PKs. But I use them only after having shown that they are the best for that particular table.
I invite submission of other rules you use that I have missed. However, there is one rule that is not listed above:
Prefer numeric keys over text keys.
I have listened to many arguments in support of that rule. I remain unconvinced.
Tomm Carr
--
Version Normal Form -- http://groups.google.com/group/vrdbms
April 30, 2008 at 2:23 pm
There also seems to be this idea floating about that if one finds an exception to a rule, then the rule is no good.
Bloody hell!
The idea that rules with exceptions aren't rules is the basic, primary, founding principle of all valid science and mathematics.
Since I operate as a mathematician (with an emphasis on set-based logic), in an engineering role (engineering being the application of science), I follow that philosophy.
Politicians and lawyers generally consider rules with exceptions to be a good thing, because that's what keeps them employed. Since I am neither of those things, I don't follow that philosophy.
I'm sorry that I'm unable to clearly convey my point about the weakness of using SSNs as a key in any capacity except for the tracking of very, very specific data, primarily that intended for W-2 forms. I guess I just can't write it effectively.
We're just going to have to agree to disagree on this one.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 30, 2008 at 2:35 pm
I don't who RonKyle is arguing with, but it is not me. I never proposed using SSN (or any country's equivalent) as a Primary Key. In fact, I first mentioned it to show why it should not be used as the PK, which is the same reason Ron gave.
I'm not arguing with anyone, and don't, at least not on this site. We just exchange ideas and experiences.
That said, I reread the thread, and you are right, you mentioned it first with the reason. I don't know how I missed that on my first read, but had I picked that up, I would not have contributed a point that you already made.
April 30, 2008 at 2:37 pm
Prefer numeric keys over text keys.
I have listened to many arguments in support of that rule. I remain unconvinced.
The only reason I can think of for this is that numeric keys generally require less storage space and memory. I can store over 18-pentillion options (bigint) in 8 bytes per row, but the same number of options in text would be 20 bytes per row. (Int instead of bigint gives you a little over 4-billion options for 4 bytes per row, compared to 11 bytes per row for character data of the same magnitude.)
If that's a PK, with FK references, that's an advantage of 12 bytes per row per table. If it goes into non-clustered indexes, that's an advantage of 12 bytes per row per index (minimum).
On a 1-million row row primary table with 10-million rows of various sub-tables, and various indexes (including those on sub-tables), that's over a Gig of data that you don't have to store on disk or load into RAM, or feed through IO pathways.
In a big database, it could easily be a significant performance difference.
On the other hand, if you have a 20 byte text column which is a valid primary key, and you add an int column to the table as an unnecessary surrogate key, you just added half a Gig of junk to that same database. It very definitely goes both ways.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 30, 2008 at 5:44 pm
GSquared (4/30/2008)
In a big database, it could easily be a significant performance difference.
In a very bit database... 😉
However, such databases do exist and, truly, a numeric key would generally be best for the tables that grow to really large sizes. Generally, such multi-million row tables don't have good natural keys to begin with. But even in large databases, most of the tables are not such behemoths. Most tables are much smaller and/or much more stable and have perfectly good natural keys.
Let me show you a random row selected from one of the tables in one of the databases I work with.
268852 1 4 28 9 3 2008-04-22
I'm not worried about exposing any sensitive information here. I chose a table that didn't contain sensitive information in the first place but even if I had, so what? Without column names, this row of data is just a row of random numbers. Sure, the first value can be guessed to be the primary key of the row, the last value is probably a CreateDate or LastModifiedDate or something similar (I've removed the time portion just to save space). If you guessed the other five fields are foreign key values, give yourself a lollipop.
Suppose the row looked something like this instead:
268852 'USA' 'LOCK' 28 'RFQ' 'U' 2008-04-22
Now we have something that could actually make a lot of sense to a user. We could almost display this row "as is" and the user would know what he's looking at. So we've gone from a 6-table join to a 2-table join in order to show meaningful information to the user.
In the example above, the four tables I showed with a natural key have 3, 17, 12 and 9 rows respectively. When I got here a year ago, the table with 17 rows only had 15. I made up the values I show above, but the actual natural key values are comparable and looks very close to what it looks like after all the joining. Someone please explain why Identity keys are preferable in this situation to natural keys. What have we gained? And you will find this situation even in Very Large Databases.
Tomm Carr
--
Version Normal Form -- http://groups.google.com/group/vrdbms
May 1, 2008 at 6:51 am
In the example above, the four tables I showed with a natural key have 3, 17, 12 and 9 rows respectively. When I got here a year ago, the table with 17 rows only had 15. I made up the values I show above, but the actual natural key values are comparable and looks very close to what it looks like after all the joining. Someone please explain why Identity keys are preferable in this situation to natural keys. What have we gained? And you will find this situation even in Very Large Databases.
I agree with this completely. To give an even more basic example, I have seen many instances of tables containing the US states to have a number as their PK, so Ohio is say, 33, an unrecognizable number, instead of OH. In this case an int is used in place of a char(2), and that int will always require a join to make the information meaningful, whereas the char(2) won't. I have never understood the reason for this.
May 1, 2008 at 6:58 am
RonKyle (5/1/2008)
In the example above, the four tables I showed with a natural key have 3, 17, 12 and 9 rows respectively. When I got here a year ago, the table with 17 rows only had 15. I made up the values I show above, but the actual natural key values are comparable and looks very close to what it looks like after all the joining. Someone please explain why Identity keys are preferable in this situation to natural keys. What have we gained? And you will find this situation even in Very Large Databases.
I agree with this completely. To give an even more basic example, I have seen many instances of tables containing the US states to have a number as their PK, so Ohio is say, 33, an unrecognizable number, instead of OH. In this case an int is used in place of a char(2), and that int will always require a join to make the information meaningful, whereas the char(2) won't. I have never understood the reason for this.
Gotta agree with you on the state abbreviations. Char(2) is the way I go on that. Use that as the PK of the table of US States, too. Have a table of Zip codes, store them as a char(5) PK (with a bunch of other columns of various data about them).
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 1, 2008 at 7:09 am
RonKyle (5/1/2008)
In the example above, the four tables I showed with a natural key have 3, 17, 12 and 9 rows respectively. When I got here a year ago, the table with 17 rows only had 15. I made up the values I show above, but the actual natural key values are comparable and looks very close to what it looks like after all the joining. Someone please explain why Identity keys are preferable in this situation to natural keys. What have we gained? And you will find this situation even in Very Large Databases.
I agree with this completely. To give an even more basic example, I have seen many instances of tables containing the US states to have a number as their PK, so Ohio is say, 33, an unrecognizable number, instead of OH. In this case an int is used in place of a char(2), and that int will always require a join to make the information meaningful, whereas the char(2) won't. I have never understood the reason for this.
The only reason i can think of , is if they did not want to limit their database to the US, and in the future include states from other countries where a char(2) code will not be enough
May 1, 2008 at 7:56 am
you might want use a tinyint for the PK on a states table if it was being referred to from other large tables.
one of the dbs I have to work with has a few tables over 2 billion rows, each with a couple different FKs that are tinyints. if these FKs were character types instead of tinyint it would have a significant impact on the size on disk, among other things. total rows in the entire db is over 14 billion.
and what do these large tables use for a pk? bigint. 🙂
---------------------------------------
elsasoft.org
Viewing 15 posts - 16 through 30 (of 32 total)
You must be logged in to reply to this topic. Login to reply