Clustered index on identity column in large tables

Question

Clustered index on identity column in large tables

Pieter Huiberts

Old Hand

Points: 396
More actions
December 18, 2007 at 9:14 am

#180007

Hello,
I am having large tables that are typically 300-500 columns by 150-200mln records (~300-600GB). These are data extracts that I receive periodically and there are no updates to the data. I am typically adding a identity column to link derived data back to its source record.
MS SQL2005 recommends a clustered index followed by regular indexes approach. Adding a clustered index to such large table is difficult b/c of the disk space required to create the clustered index. I am wondering what the benefits and penalties would between the scenarios below.
Any insight would be appreciated.
Thanks!
Scenario 1
Create table
load data
create clustered index
create non-clustered indexes
Scenario2:
Create table
Create clustered index
Load data
Create non-clustered indexes

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Jack Corbett SSC Guru Points: 184394 More actions · Answer 1

How are you loading the data, bcp, bulk insert, sql? How long do you have to complete the load? How is the data used after the load? How does the identity column allow you to link back to the source record?

I would do something close to scenario 2, create table, clustered index, and non-clustered indexes and then load the data. Since you are using an Identity Field as the clustered index having the clustered index on the table at insert should not affect load performance because Identity is sequential. Having the non-clustered indexes before the load may slow the performance as it created the index pages.

I would also do my insert in batches.

As usual you should test both and see which one is faster.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

einman33 SSCrazy Points: 2813 More actions · Answer 2

Does a Clustered indenitiy really do anything for you? Have you experimented with the Cluster on a real SARG and just having the ident nonclust?

Grant Fritchey SSC Guru Points: 398911 More actions · Answer 3

You really should have a clustered index, but since you get only one, you need to make sure you put it in the right place. The PK is clustered by default, but it doesn't have to be. How is your table accessed? What is the most common select statement run against the table? If it is a clean select from the PK, then leave the cluster there, but if some other column, a company name, a foreign key from another table, or some combination of columns is the more likely avenue for select statements, that's probably a better place for the clustered index. You need to evaluate the design as well as the query structure. The time to load the table and the disk space should be secondary considerations to getting that cluster right. Remember, if you put it in the wrong place, it not only wastes space & time, but it will hurt your performance since you'll see a lot more scans and lookups.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

colin.Leversuch-Roberts SSC Guru Points: 52643 More actions · Answer 4

you might like to scan through this analysis I did on adding clustered indexes to tables and the size/choice of clustered key

http://grumpyolddba.co.uk/sql2005/working%20with%20indexes%204.mht

the work I did was based upon tables having between 100 and 400 million rows.

[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/

Shawn.Shi SSC Journeyman Points: 77 More actions · Answer 5

Partitioning on the identity field (partitioning key) is a good choice for such a big table. You can put each partition on different file group and disk (or disk control) to increase I/O throughput. The primary key on this identity field will be with each partition since it is clustered index. To further improve I/O for data loading and indexing, you can put any non-clustered indexes into different file groups (I would suggest aligned partitioning rather than storage aligned).

Since this table has so many columns, you can also consider vertical partitioning.

Shawn Shi