Non-Unique Clustered Indexes and invisible Unique Identifiers

  • To whom it may concern:

    I was reading the Microsoft Press's SQL Server 2000 Performance Tuning and came across a nugget of information (these nuggets are why we read dry book like this over and over) and I need some help understand it.

    Talking about clustered indexes, specifically non-unique clustered indexes, page 342 at the end of paragraph 4 states that:

    "If your clustered index is not unique, SQL Server generates a unique identifier for you.  This unique identifier is not visible to the user."

    First, relative to the "unique identifier," is this a 16-byte UNIQUEIDENTIFIER data type, a Disk::Sector:age reference as in a heap table or is it some other data type.  I can only see it being a UNIQUEIDENTIFIER, Disk::Sector:age reference or a BIGINT because SQL Server has no idea how many rows you will be putting in the table and I don't think columns, invisible or not will modify themselves on the fly.

    Regardless of its type, when you create a non-clustered index, that index must have a reference back to the data.  Is the reference used the clustered index value or the invisible unique identifier created by SQL Server.  It seems to me that SQL Server would use a unique value over a non-unique value as a pointer.

    Secondly, what is the size in bytes of a Heap Disk::Sector:age reference.  Is this 16-bytes like the UNIQUEIDENTIFIER?

    Either way, it is evident that when using a non-unique clustered index there are hidden space considerations that must be taken into account.

    Thanks for you assistance,

    Scott

  • The stupid face was not intended.  Disk::Sector:age should have been Disk-Sector-Page where the - is a double colon.  The ColonP is interpreted as a stupid face.

  • Hi Scott,

    I've read something similar for ORACLE. It said that, basically, there is only one kind of indexes in ORACLE - unique indexes. The same trick is used there, too: the index key is concatenated with the ROWID (a physical location of the row, which includes file id, block id, row id, and maybe something else), ensuring the uniqueness of the index. On ORACLE, there is a "row migration" action: when the row doesn't fit in the block, it is migrated to another block, thus changing the physical position. This should affect the non-unique indexes in some way (the index key should be modified or a pointer should be placed at the old location). I am not sure that similar something exists in MS SQL Server.

    Now, ORACLE is older and more mature database (although there are still issues that are solved better in SQL Server), and we can only guess which vendor invented which concept first :-). Personally, I think that ORACLE is the one that invented most of concepts in DB field, and other follow it.

    Reagards,

    Goce Smilevski.

  • To answer your first question (This is taken from the only book worth reading by MS Press "Inside SQL Server 2000"):

    If your clustered index was not created with the UNIQUE property, SQL Server adds a 4-byte field when necessary to make each key unique. Since clustered index keys are used as bookmarks to identify the base rows being referenced by nonclustered indexes, there needs to be a unique way to refer to each row in a clustered index. SQL Server adds the uniquifier only when necessary—that is, when duplicate keys are added to the table.

    Next is a excerpt from BOL

    Nonclustered indexes can be defined on a table with a clustered index, a heap, or an indexed view. In Microsoft® SQL Server™ 2000, the row locators in nonclustered index rows have two forms:

    • If the table is a heap (does not have a clustered index), the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The entire pointer is known as a Row ID.

    • If the table does have a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server 2000 makes duplicate keys unique by adding an internally generated value. This value is not visible to users; it is used to make the key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

    Because nonclustered indexes store clustered index keys as their row locators, it is important to keep clustered index keys as small as possible

    To answer your second question, here's another quotation from "Inside SQL Server":

    Since this index is built on a heap, the bookmark is an 8-byte RID. As you can see in Figure 8-7, this RID is fixed-length and is located in the index row immediately after the index key value of '1240 '. The first 4 bytes are the page address, the next 2 bytes are the file ID, and the last 2 bytes are the slot number.

    HTH

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Or in short form (because I usually am the one with the long answers) it uses the value in the Clustered Index but any duplicate values has the 4 extra bytes to ensure you know which of the two your are talking about when you are in a non-clustered index. (Purposeful stupid face).

  • Hey, my post doesn't even come close in length to the ones from you I remember.

    Apart from this, couldn't have said it better 

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Thank you for your analysis.  I found the information in Inside SQL Server 2000 yesterday but your analysis added to my knowledge.

     I am going to make a couple or recommendation that will affect our multi-terabyte database and I want to be absolutely sure I have it right.

    In this light I'd like to ask another question.

    I have a table with a compound non-unique clustered key.  The compound key is made of two int, NOT NULL columns.  Each compound key has a duplicate value in the table therefore a "uniqueifier" is created and the sysindexes reflects this by showing 3 columns in the index.  The table has 100M rows

    I also have a non-unique, non-clustered index on the table.  This index is on an int, NOT NULL column.

    These three columns constitute the row. There are no other columns in the table.

    Will the leaf blocks of the non-clustered index b-tree contain an 8-bytes (the clustered index value’s data length) or 12-bytes (the clustered index value’s data length + the uniqueifier’s data length) reference back to the clustered index?

    If the first is correct, then there is just a 4-byte cost for each row.  If the later is correct, then there is a 4-byte cost for each row + a 4-byte cost for each subsequent non-clustered index.

    When you have tables that are hundreds of millions of rows deep, simple int column in the wrong place (in this case the uniqueifier) can have a big impact on storage as well as responsiveness of the table.  When you add this 4-byte cost per index, this requires an additional 381.47Mbs per non-clustered index.

    What I am going to suggest is that an integer data type IDENTITY column be added to each identified table, a unique clustered index created on the IDENTITY column and all non-clustered indexes be rebuilt.  This should have no effect on the storage requirements of the rows of the table (I’ve swapped a uniqueifier for an int IDENTITY column) but should have a large effect on the non-clustered indexes storage requirements.  Further, since the storage requirements of the non-clustered indexes are smaller, more data can be read per block and index scans/seeks should progress faster the before.

    What are your thoughts?

    Thanks,

  • If you create a UNIQUEID and cluster-index it purely to make all useful indexes nonclustered, then this would behave very much like having no clustered index, a 'heap'. I actually support using 'heaps' in some situations but I would probably be howled down by other ex-spurts. Anyway..

    The advantage of a clustered index (real one, not a uniqueid that isnt part of the logical model) is that it helps* arrange the data in an order where read-aheads are more productive (*helps, its not absolute). With an generated ID, or a heap, data is mostly in the order it was written, which may mean more data pages to be read to find the same amount of related data. Considering your primary key currently is non-unique, it seems you should be getting the benefit of the data with the same real key staying physically close together on disk - either in the same data page or in adjacent pages that will be picked up by read-ahead.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply