Intelligent vs. Surrogate keys

  • In database design, how do most people handle the updating of primary keys?

    I have read articles that say to use a surrogate key. If you do that what do

    you use? Identity fields? GUIDs? Both have there drawbacks. I have normally

    created all my databases using an intelligent key, which has sometimes

    caused problems when a user wants to change the key and it is related to a

    child table. I understand there are cascading updates but that doesn't seem

    real clean either. I am looking for some advice on what other people think

    is the best way or pros and cons to determine primary keys.

  • The two MAIN reasons I like identity columns is that you can put a clustered index on them and your inserts will always fall to the bottom level of the data pages and (like you pointed out) you don't have to worry about application changes where the users/app requires a new PK. As has been pointed out to me by Celko though, this is not in line with the relational model as the model says to always use some type of natural key. Interested to hear everyone else's thoughts on this.

    Ray Higdon MCSE, MCDBA, CCNA

    ------------

    Ray Higdon MCSE, MCDBA, CCNA

  • There 's al mass of info regarding this subject on the net.

    I prefer surrogate keys because they make it mandatory for the db-user to understand the datamodel, so e.g. symantic errors can be prevented because someone has to think about what he/she wants to query.

    As a dba, I'm happy is the project-da is doing his analysis to at least 3NF (normalization). Remember a key can be composed with multiple columns. On our older systems, we have tables with up to 8 columns composing the primary key. Surrogate keys then very much ease joining.

    Surrogate keys can also raise performance. take a look at

    - http://www.sql-server-performance.com/tuning_joins.asp

    - http://r937.com/20020620.html

    - http://dbforums.com/archives/t281873.html

    and last but not least :

    - http://www.sqlservercentral.com/columnists/bkelley/normalization.asp

    - http://www.sqlservercentral.com/columnists/chedgate/clusterthatindex.asp

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Hi Ray,

    quote:


    The two MAIN reasons I like identity columns is that you can put a clustered index on them and your inserts will always fall to the bottom level of the data pages and (like you pointed out) you don't have to worry about application changes where the users/app requires a new PK. As has been pointed out to me by Celko though, this is not in line with the relational model as the model says to always use some type of natural key. Interested to hear everyone else's thoughts on this.


    uhoh, hopefully Celko isn't on this forum

    There are no identity columns!

    Identity isn't a datatype, but a column property you can assign to integer datatypes (and/or decimals with a scale of 0).

    Frank

    http://www.insidesql.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Deciding on a surrogate key for performance reasons can really be a trap. This is especially true if the datamodel is of any complexity and includes subtyped entities. This can require very complex joins when you need data from the top level supertype and a subtype 4 levels down. I have seen many instances where non-composite surrogate keys were used, and then to reduce the number of joins required to retrieve related data, the keys of grandparent tables were brought down into the grandchild table. This totally destroys the relational integrity of the design by allowing an invalid grandparent to be associated to the grandchild. Of course this can be checked using a trigger, but enforced relational integrity is more efficient through declared RI and composite keys than triggers.

    I have also found that the developers find it much harder to understand the table structures when all the tables are "cluttered" by the introduction of surrogate keys and the relationship heirarchy is not readily apparent from the PK.

    As for clustering, this can be the greatest reason for wanting to use the natural key. Suppose you have a three level table structure and you need to retrieve all the data related to a top level row. If all the tables are using and clustered by natural, composite keys, the disk I/O's would be sequential, and possibly aided by prefetch for large numbers of rows. On the other hand, with non-composite surrogate keys at each level, you would not be able to cluster in this manner and most likely all the disk I/O would be random and synchronous. This can be a major deciding factor if the system you are building has a significant batch component. With composite natural keys, you can allows retrieve data from two tables in the relationship heirarchy with a single two table join. A four level heirarchy using surrogate keys would require joining four tables to retrieve data from the top and bottom tables. A two table join using two slightly larger indexes is going to much more efficient than having to join four tables and use four smaller ones. And the number four assumes that a composite index has been built on the surrogate PK of each table and the FK of its parent. If not, the data pages of each intermediate table would have to be read as well.

    On the other hand, if you are building an small OLTP only system, you would not see a significant performance difference by choosing surrogate over non-surrogate, so why not use the natural keys and keep the data model clean?

    The biggest issue SQL2K has when handling multiple columns in the clustered index is the work-around MS implemented for the performance problem it had when rebuilding the clustered index. Having the non-clustered indexes store the key values of the clustered index instead of a pointer to the page and row is unacceptable. This causes the non-clustered indexes to be much larger than they should be and almost doubles the number of pages that must be read before getting to the data. But that's another topic.

    Ian Dundas
    Senior IT Analyst - Database
    Manitoba Public Insurance Corp.

  • For transaction systems, natural keys are very easy for referentials such as customer, vendor, account etc.

    Surrogate keys may be easy when you need to join two tables with composed primary keys. Just create a surrogate key and then create a unique index on all the columns that compose the natural primary key.

    Two pros:

    1. Joining the tables is with a single table1.Surrogate_ID = table2.SurrogateID. Easier to do than an avalanche of AND AND AND AND AND AND's in the WHERE clause.

    2. Natural keys are available as foreign keys to the main referential tables. The unique index will act as a "nephew" primary key. You can cluster on this one instead of the surrogate ID primary key.

    For Datawarehouse and Datastores, always use surrogate keys. This will make your reporting solutions less vulnerable to quirks in the OLTP systems.

    Regards,

    Geert

  • I use surrogate keys almost exclusively when I have design control over a system.

    I find that composite keys make it more difficult to get good performance out of a join. Maybe I just didn't try hard enough when it became an issue (working on someone else's database), but by reducing the key relationship to a single integer based pkey -> fkey, a lot of joins got a lot faster. This is despite trying to optimize indexes on the tables specifically for the join in question.

    I have also found that when I need to revise what defines a row as unique in a table, it's a lot easier to change procedure and application code to match, sometimes not requiring any major changes, since the definition of a parent row is hidden, to an extent, by the surrogate key. One case in point... I had to take a table that originally was to define data on a quarterly basis and change it so that it could represent data on a monthly basis. Since I was using a composite unique key (the natural key for the table) tied to a surrogate key, it wasn't that difficult. I added a field for the month in the quarter and expanded the unique index to include it. I then copied the data rows in the parent table, modified the non-key fields as appropriate, and assigned child rows to the parent rows based on the user specification.

    I know, a better design specification could have avoided the entire incident, but that's a fight that, while I haven't given up on entirely, just can't seem to be won. In an environment where your (the DBA) client is a department that doesn't know what it needs, but they need it right now because their client (the 'real' client) doesn't know what they need either, but needs it right now, flexibility in design, that allows for potentially dramatic changes in design to be accomplished with the least amount of effort, is desirable.

    Note that while I use surrogate keys for my pk, I almost never cluster it, unless it's the simplest sort of lookup table. And even then, I'll still cluster the natural key if it’s a single field and will frequently be used to filter a query, vs. just being looked up.

    You may have inferred that I almost always create a clustered unique index over the natural key in a table... Identity columns generally make poor clustered keys since they don't logically group data in most circumstances. You want your clustered index to be reasonably selective and 'chunky' so when you're filtering on it, SQL will quickly grab that chunk of the table and use it for lookups into your other tables.

    Another big reason I use surrogate keys is exactly what ccundy refers to: updating primary keys. I'm rather uncomfortable with the concept of cascading updates; I don't want the engine doing too much of what I'm supposed to be controlling. Also, I prefer the concept of what I was taught as ID/UD... ID = the internal identifier for a record (the surrogate key, generally an int); UD = user identifier, the nice, friendly user-land handle for a row. Users have this nasty tendency to want to change the looks of their handles, either for convenience, or just to make the developer's life hard, and using a surrogate key trivializes enabling this without endangering the integrity of the data.

    ian_dundas raises a very valid point in regards to 'flattening' joins by including grandparent ids in grandchild tables. If that relationship were to no longer be valid, then you would get essentially inconsistent results depending on how you queried the tables. Especially in the case where you queried the top couple of tables vs. the bottom couple of tables... the aggregates wouldn't match up compared to going straight from the top table to the bottom table in the scenario illustrated. Something like this should only ever even be considered, IMHO, in a warehousing environment, and even then, you probably ought to just flatten the whole thing out anyway, and not leave it pseudo normalized.

    The main question I wrestle with is when to use an int identity, and when to use a guid as my surrogate. I prefer to go lean and mean with the int, and sometimes the smaller integer types when appropriate, because I have a feel that they're coded more efficiently in the engine. I also like that they take up less space, and, generally, are going to be smaller when transmitting them via text as in XML or as a post command from a web form. The downside being... have you ever tried to merge two DBs that have the same types of data in the same schema that used 100% int columns? It’s not real pretty. GUIDs make this a breeze, however.

    Another case for GUIDs is that its much easier to relate across databases with them, such as if you create a centralized user database in an environment where you need to use the same set, or various, overlapping subsets of those users and it doesn't make sense, or is infeasible due to legacy applications, to house them all in the same physical database. You may not get DRI, but at least if you try to look up a user's GUID in, for instance, your company's product database, you won't find a match!

    Ok, I think that’s quite enough from me for now... I hope that somewhere in all that was something useful to you.

    Matthew Galbraith

  • Good discussion. I tend to use surrogate keys. I share the view about int's vs guid's. I use both, but so far inconsistently. My current plan is that if I have a reasonable idea that the table either won't be replicated or will be replicated as a read only copy I typically use ident, if I think it will be writeable on a remote box I go with a guid. Ident key ranges are a pain!

    It's not all about performance. Close, but not all.

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • Sometimes the decision is made for you. If you need to track Type 2 slowly changing dimensions, then surrogate keys are required. To get "current" data, create and index on the natural key and have some sort of "Is_Current" column.

    Don

    http://www.biadvantage.com

    Donhttp://www.biadvantage.com

  • I think both sides of this coin are valid. I hate to "denormalize" the database with surrogate keys but they are pretty darned useful when integrating to other systems or for synchronizing data to remote platforms when you cannot be networked into the database all the time. Oracle has the concept of "rowid" that seems to me would be useful in SQL. These "identifiers" are hidden from the schema and are not recommended for general use but are created automatically. They come in handy when transferring rows to another database and you want to ensure they made it without coding the multi-column keys.

  • quote:


    The two MAIN reasons I like identity columns is that you can put a clustered index on them and your inserts will always fall to the bottom level of the data pages and (like you pointed out) you don't have to worry about application changes where the users/app requires a new PK. As has been pointed out to me by Celko though, this is not in line with the relational model as the model says to always use some type of natural key. Interested to hear everyone else's thoughts on this.

    Ray Higdon MCSE, MCDBA, CCNA


    A clustered index on an identity column can be a problem if page level locking is used. Every insert will likely be into the same page.

    I would not waste the clustered index on the identity. I look for the most used query with an "order by" that returns several records. The order by columns is a good candidate for a clustered index.

    As to the question, I like to have both. Having an identity column PK that the apps don't get to touch makes writing triggers a lot easier. I know how to identify the before and after row pairs with this "invisible" id. If the data "keys" (perhaps a regular index) are updated, this can be impossible unless only one record is updated. The order in the deleted and inserted tables is not reliable.

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • I am inclined to place surrogate keys (by which I mean an abstract unique ID, usually an int column with the IDENTITY property) on almost every table I create.

    As others have noted, I have never been entirely comfortable with using user-controllable data as a primary key. Users, in my experience, are notoriously fickle as to their commitments to not change unique values used as keys.

    My formal training in normalization (if I had any!) is some 16 years in the past. I would have a great deal of difficulty being convinced that using a user-alterable data value as a primary and foreign key could be considering as aspect of normalizing a database - personally, I would consider that as storing critical data in two places, thus denormalizing the db.

    As an example, in the databases I use, it is not uncommmon for a user's email address to be the major unique identifier of the user. I would never use that to link, for example, rows in another table identifying products that user had bought back to the user table itself. This may change rarely, but it does change - and, to me, one of the major points of normalization is to avoid a need to update a single chunk of data in five, ten, fifteen different tables!

    I would concur with the advice others have given - whenever possible, have a secondary unique key the the primary key can be considered an "alias" for, and almost never include the int/guid primary key in a clustered index.

    The one type of table I don't add an IDENTITY-generated primary key to is a simple relationship table - the kind of table that, at its core, simply ties two primary keys together. An example might be a groupmember table, which simply uses user IDs and group IDs to indicate which users belong to which groups. The combination of the two keys ina nd of itself forms an adequate primary key, and those keys should never change - if a user isn't in a group any more, the key fields shouldn't be changed, because the record should simply be removed.

    R David Francis


    R David Francis

  • Hi there,

    I thought I'd throw in some oil in the fire and post a reply from the great Joe Celko on the MS newsgroups that (might) say it all. Anyway, get yourself a cup of coffee, have fun and enjoy

    quote:


    >> I'm trying to change an existing field [sic] to an

    IDENTITY field [sic], but I'm getting syntax errors. It's got to be

    simple, but somewhere I'm missing something. <<

    What you are missing is the basic concepts of the relational model.

    Columns are not fields; rows are not records; tables are not files. An

    IDENTITY property cannot be a key by definition. A key is a subset of

    attributes that uniquely define an entity in your data model.

    The IDENTITY column is a holdover from the early programming language

    which were very close to the hardware. For example, the fields in a

    COBOL or FORTRAN program were assumed to be physically located in main

    storage in the order they were declared in the program.

    The early SQLs were based on existing file systems. The data was kept

    in physically contiguous disk pages, in physically contiguous rows, made

    up of physically contiguous columns. In short, just like a deck of

    punch cards or a magnetic tape.

    But physically contiguous storage is only one way of building a

    relational database and it is not always the best one. But aside from

    that, the whole idea of a relational database is that user is not

    supposed to know how things are stored at all, much less write code that

    depends on the particular physical representation in a particular

    release of a particular product.

    One of the biggest errors is the IDENTITY column (actually property) in

    the Sybase family (SQL Server and Sybase). People actually program with

    this "feature" and even use it as the primary key for the table! Now,

    let's go into painful details as to why this thing is bad.

    The practical considerations are that IDENTITY is proprietary and

    non-portable, so you know that you will have maintenance problems when

    you change releases or port your system to other products.

    But let's look at the logical problems. First try to create a table

    with two columns and try to make them both IDENTITY. If you cannot

    declare more than one column to be of a certain datatype, then that

    thing is not a datatype at all, by definition. It is a property which

    belongs to the PHYSICAL table, not the data in the table.

    Next, create a table with one column and make it an IDENTITY. Now try

    to insert, update and delete different numbers from it. If you cannot

    insert, update and delete rows from a table, then it is not a table by

    definition.

    Finally create a simple table with one IDENTITY and a few other columns.

    Use a few statements like

    INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');

    INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');

    INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

    to put a few rows into the table and notice that the IDENTITY

    sequentially numbered them in the order they were presented. If you

    delete a row, the gap in the sequence is not filled in and the sequence

    continues from the highest number that has ever been used in that column

    in that particular table.

    But now use a statement with a query expression in it, like this:

    INSERT INTO Foobar (a, b, c)

    SELECT x, y, z

    FROM Floob;

    Since a query result is a table, and a table is a set which has no

    ordering, what should the IDENTITY numbers be? The entire, whole,

    completed set is presented to Foobar all at once, not a row at a time.

    There are (n!) ways to number (n) rows, so which one do you pick? The

    answer has been to use whatever the physical order of the result set

    happened to be. That non-relational phrase "physical order" again.

    But it is actually worse than that. If the same query is executed

    again, but with new statistics or after an index has been dropped or

    added, the new execution plan could bring the result set back in a

    different physical order. Can you explain from a logical model why the

    same rows in the second query get different IDENTITY numbers? In the

    relational model, they should be treated the same if all the values of

    all the attributes are identical.

    Using IDENTITY as a primary key is a sign that there is no data model,

    only an imitation of a sequential file system. Since this number exists

    only as a result of the state of particular piece of hardware at a

    particular time, how do you verify that an entity has such a number in

    the reality you are modeling?

    To quote from Dr. Codd: "..Database users may cause the system to

    generate or delete a surrogate, but they have no control over its value,

    nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp

    409-410) and Codd, E. (1979), Extending the database relational model to

    capture more meaning. ACM Transactions on Database Systems, 4(4). pp.

    397-434. This means that a surogate ought ot act like an index; created

    by the user, managed by the system and NEVER seen by a user. That means

    never used in queries.

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as

    permanent surrogates for entities.

    (1) The actual values of user-controlled keys are determined by users

    and must therefore be subject to change by them (e.g. if two companies

    merge, the two employee databases might be combined with the result that

    some or all of the serial numbers might be changed.).

    (2) Two relations may have user-controlled keys defined on distinct

    domains (e.g. one uses social security, while the other uses employee

    serial numbers) and yet the entities denoted are the same.

    (3) It may be necessary to carry information about an entity either

    before it has been assigned a user-controlled key value or after it has

    ceased tohave one (e.g. and applicant for a job and a retiree).

    These difficulties have the important consequence that an equi-join on

    common key values may not yield the same result as a join on common

    entities. A solution - proposed in part [4] and more fully in [14] - is

    to introduce entity domains which contain system-assigned surrogates.

    Database users may cause the system to generate or delete a surrogate,

    but they have no control over its value, nor is its value ever displayed

    to them....." (Codd in ACM TODS, pp 409-410).

    References

    Codd, E. (1979), Extending the database relational model to capture more

    meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

    The most common use that a Newbie makes of IDENTITY is to use it as a

    record number (under the error that a record nubmer is a key!), so that

    he does not have to think about keys, DRI, check digits, proper data

    types, international standards and all that hard stuff.

    While this was meant as an abstract example, I also fear that you have

    not read ISO-11179 because of the silly, redundant, dangerous prefixes

    on your code.

    --CELKO--


    Frank

    http://www.insidesql.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I'd enjoy the post more if it didnt include "silly".

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • quote:


    I'd enjoy the post more if it didnt include "silly".


    Yes, it seems that Joe Celko must have had a good day. Do you feel better when I remove this?

    Frank

    http://www.insidesql.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply