What is the argument against IDENTITY columns

Question

Post reply

What is the argument against IDENTITY columns

David.Poole

SSC Guru

Points: 76035
More actions
December 30, 2005 at 9:56 am

#90274

I have seen vague inferrences that IDENTITY columns are a bad idea as a primary key.
I can see that argument that they create uniqueness rather than enforce it but are there any other links or articles that give more information on this issue?
LinkedIn Profile

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

noeld SSC Guru Points: 96590 More actions · Answer 1

Oh man are we going to get back to the religious battle again

The following is CELKO's opinion:

>> I'm trying to change an existing field [sic] to an

IDENTITY field [sic], but I'm getting syntax errors. It's got to be

simple, but somewhere I'm missing something. <<

What you are missing is the basic concepts of the relational model.

Columns are not fields; rows are not records; tables are not files. An

IDENTITY property cannot be a key by definition. A key is a subset of

attributes that uniquely define an entity in your data model.

The IDENTITY column is a holdover from the early programming language

which were very close to the hardware. For example, the fields in a

COBOL or FORTRAN program were assumed to be physically located in main

storage in the order they were declared in the program.

The early SQLs were based on existing file systems. The data was kept

in physically contiguous disk pages, in physically contiguous rows, made

up of physically contiguous columns. In short, just like a deck of

punch cards or a magnetic tape.

But physically contiguous storage is only one way of building a

relational database and it is not always the best one. But aside from

that, the whole idea of a relational database is that user is not

supposed to know how things are stored at all, much less write code that

depends on the particular physical representation in a particular

release of a particular product.

One of the biggest errors is the IDENTITY column (actually property) in

the Sybase family (SQL Server and Sybase). People actually program with

this "feature" and even use it as the primary key for the table! Now,

let's go into painful details as to why this thing is bad.

The practical considerations are that IDENTITY is proprietary and

non-portable, so you know that you will have maintenance problems when

you change releases or port your system to other products.

But let's look at the logical problems. First try to create a table

with two columns and try to make them both IDENTITY. If you cannot

declare more than one column to be of a certain datatype, then that

thing is not a datatype at all, by definition. It is a property which

belongs to the PHYSICAL table, not the data in the table.

Next, create a table with one column and make it an IDENTITY. Now try

to insert, update and delete different numbers from it. If you cannot

insert, update and delete rows from a table, then it is not a table by

definition.

Finally create a simple table with one IDENTITY and a few other columns.

Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');

INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');

INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

to put a few rows into the table and notice that the IDENTITY

sequentially numbered them in the order they were presented. If you

delete a row, the gap in the sequence is not filled in and the sequence

continues from the highest number that has ever been used in that column

in that particular table.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)

SELECT x, y, z

FROM Floob;

Since a query result is a table, and a table is a set which has no

ordering, what should the IDENTITY numbers be? The entire, whole,

completed set is presented to Foobar all at once, not a row at a time.

There are (n!) ways to number (n) rows, so which one do you pick? The

answer has been to use whatever the physical order of the result set

happened to be. That non-relational phrase "physical order" again.

But it is actually worse than that. If the same query is executed

again, but with new statistics or after an index has been dropped or

added, the new execution plan could bring the result set back in a

different physical order. Can you explain from a logical model why the

same rows in the second query get different IDENTITY numbers? In the

relational model, they should be treated the same if all the values of

all the attributes are identical.

Using IDENTITY as a primary key is a sign that there is no data model,

only an imitation of a sequential file system. Since this number exists

only as a result of the state of particular piece of hardware at a

particular time, how do you verify that an entity has such a number in

the reality you are modeling?

To quote from Dr. Codd: "..Database users may cause the system to

generate or delete a surrogate, but they have no control over its value,

nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp

409-410) and Codd, E. (1979), Extending the database relational model to

capture more meaning. ACM Transactions on Database Systems, 4(4). pp.

397-434. This means that a surogate ought ot act like an index; created

by the user, managed by the system and NEVER seen by a user. That means

never used in queries.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as

permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users

and must therefore be subject to change by them (e.g. if two companies

merge, the two employee databases might be combined with the result that

some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct

domains (e.g. one uses social security, while the other uses employee

serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either

before it has been assigned a user-controlled key value or after it has

ceased tohave one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on

common key values may not yield the same result as a join on common

entities. A solution - proposed in part [4] and more fully in [14] - is

to introduce entity domains which contain system-assigned surrogates.

Database users may cause the system to generate or delete a surrogate,

but they have no control over its value, nor is its value ever displayed

to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more

meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

The most common use that a Newbie makes of IDENTITY is to use it as a

record number (under the error that a record nubmer is a key!), so that

he does not have to think about keys, DRI, check digits, proper data

types, international standards and all that hard stuff.

While this was meant as an abstract example, I also fear that you have

not read ISO-11179 because of the silly, redundant, dangerous prefixes

on your code.

--CELKO--

Although there are some valid points in it nothing in MY opinion is absolute some problems are better handled with the correct use of identities specially when performance is at stake

Cheers,

* Noel

Gabor Nyul SSCrazy Eights Points: 9459 More actions · Answer 2

I have no problems with Identity. I'm using it fairly often.

The Only thing you have to know is, that you cannot remove it once set (no alter table command is ther to remove an identity option, you can only remove the column itself.) The only way is to use EM or to go through a quite complicated create, rename process.

You have to design your identity colums if you want to integrate data from several databases (like replication)

Otherwise you can use it if you need a fast counter like column as a primary key

Bye
Gabor

k_vasanthkumar Old Hand Points: 349 More actions · Answer 3

Identity columns are good for Primary keys ...

But Identity columns have following constraint

1) You can create only one identity column per table

DrawBacks:

*********

1) Suppose you have created a identity column with INT as datatype and if the table has lot of inserts then int would be overflowen,the only option would be to set it as BIGINT and the same error can occur if number of rows cross the BIGINT size.

This problem would not occur if you have a varchar datatype

2) The other problem is that suppose if you have inserted some 100000 rows into the table and if you delete all the rows and if you try to insert a new row the row number would start from 100001 ..

3) There might be issues if you delete certain rows ..There would be gaps.

suppose if you have

1

2

3

4

5

6

if you delete row 2 & 4 there would be gaps ...You have to create a procedure which would populate the gaps to avoid inconsistency..

These are some of the problems that can occur with a identity column..

If the table is a small table then identity is a good choice ..

Cheers ...

Vasanth

Steve Jones - SSC Editor SSC Guru Points: 736479 More actions · Answer 4

They work well as a PK, but that doesn't mean that you shouldn't be looking for a natural key in the data. Dont' make them a clustered index and be sure they are set as the PK with a unique index.

If you are looking for a "row number" counter, or expect them to remain in sequence with no gaps, then you are incorrectly using them.

BobAtDBS SSCarpal Tunnel Points: 4590 More actions · Answer 5

The Primary Key has to be a unique, unchanging value. In a perfect world, there would be a "natural key" in any table and you wouldn't need to add a "surrogate key" like an identity column or Guid. But that perfect world just doesn't work out all that often. For a person, you'd think that SSN would be a great natural key, but you'd be wrong. You'll eventually find a person with more than one SSN, or a baby that doesn't have one yet, or someone who gets a new SSN when they go into the witness protection plan, or whatever.

In my world, I design with the idea that the data might need to use replication when done. I use replication to keep a "near hot" standby server, I have customers with disconnected copies of sales data, etc. If you're designing for replication, you know that there is eventually going to be a GUID assigned to every row of every table anyway, you might as well use it as the primary key as well. You can preach theory all you want, but this design has never failed me. Yes, sometimes your table looks like this:

ID GUID

Type varchar(10)

SortOrder int

where the "surrogate key" is longer than the data it's helping connect, but in a world of ever increasing hard disk space I'll deal with it. Besides, if you want to replicate the data you'll get the GUID anyway!

I'll waste a lot of hard drive space over and over, before defining relationships between tables with keys that I "hope" are unique enough and unchanging enough.

Student of SQL and Golf, Master of Neither

David.Poole SSC Guru Points: 76035 More actions · Answer 6

So basically the argument against them is that they are entirely artificial values that are independent of the data?

I'm not sure what the argument is against using them in clustered indexes. Let us suppose I have a lookup table, with fields ID and Description.

Unless I know that the description column is always going to be in order I use the ID column for the clustered index. Surely it is better to have a clustered index on the id column rather than have no clustered index at all?

If I delete records from my lookup table I don't want to be reusing any "gap" records because if I ever have to retrieve historical data I don't want the nightmare of finding out that ID 5 used to mean "Great Britain" but having reused an ID value it now means "European Union - superfluous region"

LinkedIn Profile

BobAtDBS SSCarpal Tunnel Points: 4590 More actions · Answer 7

As far as I can tell, yes, it is the fact that they are entirely artificial values that most bothers the folks that it bothers at all.

Also, the identity property IS platform dependant, and programatically, there are problems knowing which key is which when inserting several records at once. Once again, you don't have that problem with GUIDs, because you can generate them in your application and update related recordsets with foreign keys already in place.

One argument against using Identity columns as a clustered index is that since they are generated more or less in sequence, all inserts come at one page in the database causing a "hot spot" of activity, and in a busy system, locks or blocking. If you use GUIDs as the primary key, and don't have a better candidate for clustered index, I see no problem with using it because each new key will be scattred throughout the table.

I am not experienced in any performance bottlenecks of using Identity columns as clustered indexes, because I rarely use them at all. The previous paragraph is just what I've seen written, and it makes some sense.

Student of SQL and Golf, Master of Neither

ksh SSC Veteran Points: 298 More actions · Answer 8

IDENTITY columns as PK makes no sense as they don't really tell you anything. SGID (system generated IDs), which is what an IDENTITY COLUMN is, usually are essential for properly ordering work in a transaction oriented environment. This has nothing to do with physical location on a disk(ROWID) but has everything to do with the space/time continuum. e.g. If 10 rows got processed the last time a job ran, you want to be able to easily identify the rows via a numerically sequenced value with a logmaster table keeping track. This to me is the primary usage of the IDENTITY column.

If we worry about code being transportable between db engines(IDENTITY columns for instance), no one would use stored procedures. It is best to make full use of the tool you have instead of worrying about features it has that another tool you might use in the future does not have.

David.Poole SSC Guru Points: 76035 More actions · Answer 9

So going back to my lookup table example with an ID and Description column I should be better off putting my clustered index on my description column rather than on the ID column?

I thought that clustered index seeks were faster than index seeks and therefore any join on a lookup table would benefit from the clustered index?

LinkedIn Profile

BobAtDBS SSCarpal Tunnel Points: 4590 More actions · Answer 10

For this exact set of conditions - a LOOKUP table that is probably added to/updated rather infrequently; you've ruled out ever using replication; you've decided that an Identity column is a sufficient surrogate key; there is no obvious natural key - Here's what I'd do:

Make the Identity column the clustered index. Hot Spots are only a problem in heavy inserts. For the key column, you're obviously going to have an index, why waste the space and force an additional read by using a separate index?

Student of SQL and Golf, Master of Neither

Alan Palmer Old Hand Points: 303 More actions · Answer 11

Simple rule - use IDENTITY columns where they are required. If you need one then you will know.

If you are just talking about adding an identity column to a table where the column will not be used in a table join then adding a PK or clustered index to this column is wasteful. It is not clever to add a clustered index to a monotonically increasing number field if it is never used in a join or where clause.

If you wish to save space then a group of repeating key values (natural key) can be replaced with an identity column with a lookup table(s) defined based on the identity key value. Main table has PK of identity and the lookup table(s) have this value as a FK.

This will save space but could possibly hurt performance due to having to join between a minimum of 2 tables. A composite key replacement could mean having to join even more tables.

As for adding your clustered index to your description column - this depends on what your data looks like and whether you do specific searches on description values. If you are doing range searches on your descriptions and don't care about your id's then add your cluster to the description - again there is no definitive rule; it depends on your data and model.

BobAtDBS SSCarpal Tunnel Points: 4590 More actions · Answer 12

By definition - a lookup table is going to be the object of joins and/or where clauses.

Select P.Name, D.Description from Product P Left Join Descriptions D on P.DescriptionID = D.ID

would be expected if, as in the very specific example we were talking about, the Description Table was a lookup table for other tables.

Student of SQL and Golf, Master of Neither

Alan Palmer Old Hand Points: 303 More actions · Answer 13

Thanks BobAtDBS for stating the obvious Was merely trying to add some general information to this topic without going into the specific example of cluster on id or description - the thread did start off very general.

Honing in on the specific example, without knowing how the id/description are going to be used it wouldn't be wise to state what the best way of assigning the Primary Key or clustered index (as they could be different) is.

How many rows are going to be in this lookup table ? What are the volumes and the frequency of updates ?

How many distinct descriptions are there ?

What types of searches will be done against the descriptions ?

Is there a case for a composite index ?

Is it just a straight lookup between identity values in one table and the lookup ?

A covered non-clustered join can be very fast so is definitely worth considering over a clustered index.

Once all details are known then it will be easy enough to state what the best practice is. This is why I tried to keep the answer as general as possible !!

ksh SSC Veteran Points: 298 More actions · Answer 14

I personally try to avoid clustered indexes except where the predominant search method on the table will be on the clustered index. so in your example creating a clustered index on the IDENTITY column used for lookup purposes serves the major purpose of keeping all the data in order, a handy feature for querying. And as it will most likely be the only way the tale is accessed, a slight to miniscule speed gain(unless this table has millions of rows in it).