Clustered index on identity column in large tables

  • Hello,

    I am having large tables that are typically 300-500 columns by 150-200mln records (~300-600GB). These are data extracts that I receive periodically and there are no updates to the data. I am typically adding a identity column to link derived data back to its source record.

    MS SQL2005 recommends a clustered index followed by regular indexes approach. Adding a clustered index to such large table is difficult b/c of the disk space required to create the clustered index. I am wondering what the benefits and penalties would between the scenarios below.

    Any insight would be appreciated.

    Thanks!

    Scenario 1

    Create table

    load data

    create clustered index

    create non-clustered indexes

    Scenario2:

    Create table

    Create clustered index

    Load data

    Create non-clustered indexes

  • How are you loading the data, bcp, bulk insert, sql? How long do you have to complete the load? How is the data used after the load? How does the identity column allow you to link back to the source record?

    I would do something close to scenario 2, create table, clustered index, and non-clustered indexes and then load the data. Since you are using an Identity Field as the clustered index having the clustered index on the table at insert should not affect load performance because Identity is sequential. Having the non-clustered indexes before the load may slow the performance as it created the index pages.

    I would also do my insert in batches.

    As usual you should test both and see which one is faster.

  • Does a Clustered indenitiy really do anything for you? Have you experimented with the Cluster on a real SARG and just having the ident nonclust?

  • You really should have a clustered index, but since you get only one, you need to make sure you put it in the right place. The PK is clustered by default, but it doesn't have to be. How is your table accessed? What is the most common select statement run against the table? If it is a clean select from the PK, then leave the cluster there, but if some other column, a company name, a foreign key from another table, or some combination of columns is the more likely avenue for select statements, that's probably a better place for the clustered index. You need to evaluate the design as well as the query structure. The time to load the table and the disk space should be secondary considerations to getting that cluster right. Remember, if you put it in the wrong place, it not only wastes space & time, but it will hurt your performance since you'll see a lot more scans and lookups.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • you might like to scan through this analysis I did on adding clustered indexes to tables and the size/choice of clustered key

    http://grumpyolddba.co.uk/sql2005/working%20with%20indexes%204.mht

    the work I did was based upon tables having between 100 and 400 million rows.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Partitioning on the identity field (partitioning key) is a good choice for such a big table. You can put each partition on different file group and disk (or disk control) to increase I/O throughput. The primary key on this identity field will be with each partition since it is clustered index. To further improve I/O for data loading and indexing, you can put any non-clustered indexes into different file groups (I would suggest aligned partitioning rather than storage aligned).

    Since this table has so many columns, you can also consider vertical partitioning.

    Shawn Shi

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply