December 28, 2005 at 8:09 pm
We have an application that has no primary keys. It has unique keys and indices and was written this way intentionally - not something I (or most DBAs) normally recommend; but, alas, it works very well. Now that we are replicating data into a reporting store we (the replicating team) need to add primary keys. The argument from the original developers is that adding the primary key will negatively affect performance.
My question is "will it?" and if it won't, what argument do we give? I mean, it has to be done or the data cannot be replicated; but, still it would be good to provide some level of comfort to all concerned.
December 28, 2005 at 8:18 pm
Tell the original developers that they are full of it! A primary key is physically implemented by a unique index but with the addition that NULLs are not allowed.
If there is already a unique index, there will be no difference in performance either way.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
December 28, 2005 at 8:37 pm
"The argument from the original developers is that adding the primary key will negatively affect performance."
Acually, declaring primary keys, foreign key constraints and insuring that all columns are not null can improve performance because the SQL Server optimizer will use this information in building the query execution plan.
Here is a simple example using the Northwind database.
First, look at the execution plan for this SQL where the join is executed as a nested loop:
select orders.orderid
from customers
joinorders
on customers.customerid = orders.customerid
where EmployeeID = 5
go
Now, change column customerid in table Orders to "not null" by running:
if exists (select * from dbo.sysindexes where name = 'CustomerID' and id = object_id('dbo.Orders'))
drop index dbo.Orders.CustomerID
if exists (select * from dbo.sysindexes where name = 'CustomersOrders' and id = object_id('dbo.Orders'))
drop index dbo.Orders.CustomersOrders
alter table orders
alter column CustomerID nchar(5) not null
CREATE INDEX CustomerID ON dbo.Orders(CustomerID) WITH FILLFACTOR = 90
GO
Now look at the execution plan again - The customer table will no longer be accessed !
Now remove the foreign key constraint between Customers and Orders:
ALTER TABLE dbo.Orders DROP CONSTRAINT FK_Orders_Customers
The join is executed as a nested loop again !!
re-add the constraint:
ALTER TABLE dbo.Orders ADD CONSTRAINT FK_Orders_Customers FOREIGN KEY (CustomerID) REFERENCES Customers
SQL = Scarcely Qualifies as a Language
December 29, 2005 at 2:25 am
If you don't specify otherwise a primary key will be clustered. If the columns that you are proposing to become primary keys are not naturally ordered then having them as a clustered index would eventually degrade performance due to index fragmentation.
I would have a regular defragging job set up in any case regardless of whether your primary key column is naturally ordered.
I would say that every table should have a primary key, clustered or otherwise. Unless your unique column needs to be NULLable I would swap your unique indexes/constraints for primary key constraints.
December 29, 2005 at 10:22 am
What David says is true, but it has nothing to do with primary keys versus unique indexes per se. His is an argument to make sure your indexing strategy makes sense. Indexing is obviously realted to key definition, but it is a separate issue...
Also, fragmentation isn't always a bad thing, contrary to popular belief. Fragmentation tends not to hurt OLTP systems much, if at all, because their data access tends to be very random. Fragmentation absolutely kills data warehousing or reporting systems that tend to read large chunks of data.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
December 29, 2005 at 11:49 am
This is great stuff! I want to thank you all; but I don't want to curtail your comments either. This is why we hang out in places like this For the record, the change that is being proposed by "the replication team" is the addition of a GUID, if not just an auto incrementing token. There is no plan to incorporate it into the existing application. It is being added ONLY to aid in replication. And it is presumed that we are able to do so without causing any major retest (regression testing) of that application. Now, with that known, are there any other obvious implications to be aware of? For instance, if we were at the maximum number of indices on a table, and the addition of a primary key caused an existing index to become unusable... I doubt this to be the case; but what other issues do you see?
December 29, 2005 at 12:20 pm
Hang on, if you put a clustered primary key on a table where the primary key isn't naturally ordered doesn't it increase the page splits?
So although fragmentation per se does not affect cause a performance drop the side effect being increased page splits does.
Or am I talking rubbish?
December 29, 2005 at 12:51 pm
Yes it can, but that depends on a number of factors such as the type of activity being performed, and the amount of freespace in the index.
I wasn't disagreeing with what you said so much as I was just trying to point out that it may or may not apply in any given case.
Page splits are an unavoidable fact of life when dealing with indexes that can be inserted or updated. Nonclustered indexes are equally subject to page splitting as clustered indexes. The primary difference is that when a clustered index page splits, all the nonclustered indexes must be updated as well. This is the primary reason that index maintenance is still very important in OLTP systems. It's more a factor of maintaining an appropriate amount of free space to accomodate inserts and updates rather than eliminating fragmentation. In a data warehousing situation the fragmentation becomes the enemy and you generally want very little free space. Of course many systems combine some of both so the challenge becomes finding that happy compromise...
There is a common misconception that clustered indexes maintain a physical ordering of the data. This in turn leads some people to think that page splits for clustered indexes are somehow different (worse) than page splits in a non clustered index. Clustered indexes maintain a LOGICAL ordering of the data via a double linked list, so in either case a page split involves copying part of the data from the page to the new page and updating the pointers.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
December 29, 2005 at 12:57 pm
As a rule I don't like GUIDs as primary keys, but like most rules, there are exceptions...
I won't claim to be a replication expert (I have only used Snapshot replication, and that very rarely), but I don't think that any type of replication requires a GUID; or would be fundamentally benefitted by its presence.
You should be able to add columns (GUIDs or otherwise) to a table without affecting the current application as long as the applications queries avoid the "SELECT *" approach.
I have never even come close to the maximum number of nonclustered indexes on a table (249), so I can't say for sure what whould happen when you try to create number 250. My guess is that you'd simply get an error rather than invalidating an existing index.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
December 29, 2005 at 1:30 pm
Maybe I can shed some light on this (I'm on the replication team) and give you some more details.
Lets say there's a table that we want to replicate that does not have a primary key. For replication to function, the table must have a PK. We want to add this needed bit of information without impacting the application, so the intention is to add an autoincrementing field and make that a PK. This way, the order the records are in the table is the same order that the PK will be numbered.
The application that uses this database won't use this field and it will never show up in a query.
My position is that the addition of this field have have zero impact on performance (good or bad) because of what we are doing and how we are intending to do it.
Validation (or letting me know I'm completely full of it, either way) will be greatly appreciated.
December 29, 2005 at 1:54 pm
So I'm going to guess you are wanting to use transactional replication.
David's initial post indicated that there were unique indexes but not PKs. If this is the case, and the column(s) over which the unique indexes are defined are in fact the key, then it makes more sense from my perspective to define them as primary keys and be done. Defining them as primary keys will not change anything about the performance (assuming you use the same type of index to define the primary key as was used to enforce uniqueness).
Adding another column and defining it as a primary key will add some additional overhead as it is another index to maintain, and additional data to store etc... Whether this additional overhead is an issue in this case, I have no idea. Also, anyone contemplating the use of IDENTITY or GUID columns as a primary key needs to realize that surrogate keys do not enforce uniqueness, they create it.
HTH
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
December 30, 2005 at 6:06 am
"You should be able to add columns (GUIDs or otherwise) to a table without affecting the current application as long as the applications queries avoid the "SELECT *" approach."
dcpeterson makes an excellent point here. I would add one more warning. A 'gotcha' that eventually comes around to bite those untrained in SQL is an "INSERT tablename VALUES(1,2,3...)" (where the column names are not explicitly listed).
Hopefully this is addressed in your coding standards but you should be aware that adding columns to tables involved in such statements will cause problems.
From a frustrated DBA's point of view, I kinda like hearing the complaints from a developer who spent hours/days tracking down the new 'bug' in his code only to find that his INSERT doesn't work like it always has. This is a chance to remind the developer that coding standards exist for a reason and that the time spent debugging is the price he pays for the 2 minutes he saved writing a sloppy command.
That sounded much harsher than intended. I just wanted to point out that this may happen in your case and if you can prevent it, great. If prevention isn't possible, this may give you a quick insight into insertion problems.
Good luck with the project.
--When life hands you lemons, fire the DBA! 🙂
December 30, 2005 at 2:47 pm
Good point! I neglected to mention the INSERT problem...
"Don't distract me with standards, they just slow me down!" Of course almost everbody who has anything to do with development efforts will feel this way to some extent, those who are smart enough to learn from history know that standards actually save time in the long run. That is, if the standards aren't overly rigid... But always specifying each column in your projection list is pretty much a no-brainer.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
January 4, 2006 at 4:31 am
dcpeterson's excellent responses (as always) have pretty much said everything I would like to say about this, specifically that the easiest and in my view best solution would be to simply let (one of) the unique indexes that already exist become the primary key. And do not forget to slap the original developers for not including a primary key in the first place.
However, I just want to add one little point that is probably a little bit off-topic here. But I could not let the following comment pass without responding to it:
This way, the order the records are in the table is the same order that the PK will be numbered.
The "records" of a table are not ordered in any specific way at all, at least not logically. Sure, the rows are physically stored in some order, but when you query them (or manipulate them) you have no guarantees as to which order you will receive the rows in. Unless a SQL statement includes an ORDER BY clause there are no guarantees about the order of the rows it produces.
January 4, 2006 at 10:54 am
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply