Introduction to Indexes: Part 2 – The clustered index

  • Brian.Eriksen (11/11/2009)


    Anyways, I just thought that since summary-detail table relationships are very common that it's probably worth calling out this exception to the uniqueness guideline specifically. What do you think?

    Those are guidelines, not rules. If you feel that ignoring them is for the better, then go ahead. Of those guidelines, the uniqueness one is the one I most often ignore.

    As I stated, I hold to the school that says the cluster is used to organise the table and the nonclustered indexes to retrieve data. I've done the summary/detail design before, often I use nonclustered indexes on the foreign key column. Other times I've used the cluster.

    ...

    ...

    ...

    It depends. 😀

    If you've tested and that is the best way to organise things and the fragmentation (if the cluster is not ascending) and increased width of the cluster is not an issue, go right ahead. Just be aware of the tradeoffs

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gail,

    (I promise I'll give up soon) - Am I missing something (as some colleagues have already suggested).

    If an identity column is used for the clustered index (CI) and NC indexes use the identity column (key) rather than a RID as the retrieval pointer then doesn't that mean that when the NC index is searched and the key (Identity column) is retrieved, it then needs to traverse the CI tree down to the leaf level to retrieve the actual row required as the 'identity column' alone wouldn't physically locate the row, it's just part of the data (leaf level of the CI)?

    Jim

    Trainee Novice:w00t:

  • Esalter (11/11/2009)


    one thing I've frequently come across that would be extremely useful is the ability to define Foreign keys that are a compound key (concatenated columns). Do you know if this is available in current versions of SQL (or ever likely to be)?

    Multi-column Foreign keys? That's been possible for many versions.

    Create table Parent (

    ParentID int identity primary key,

    ChildID1 int,

    ChildID2 int,

    SomeArbString varchar(10)

    )

    Create table Child (

    ChildID1 int not null,

    ChildID2 int not null,

    SomeDate datetime

    )

    ALTER TABLE Child ADD CONSTRAINT pk_Child PRIMARY KEY (ChildID1, ChildID2)

    ALTER TABLE Parent ADD CONSTRAINT fk_Parent_Child FOREIGN KEY (ChildID1, ChildID2) REFERENCES Child (ChildID1, ChildID2)

    Got nothing to do with indexes though

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (11/11/2009)


    Esalter (11/11/2009)


    I'm surprised that non clustered indexes don't also use the RID to provide a more efficient use of space.

    A RID is 8 bytes. An integer column defined as identity (a very good and popular choice for clustered index) is 4 bytes.

    This is of course countered by the need to update all non clustered indexes when a row is moved to a different page as the result of a split operation

    And the need to rebuild every single nonclustered index when the cluster is rebuilt. Painful in terms of logging.

    The need to update all the nonclustered indexes is, I believe, one of the reasons the RID is not used as the row 'pointer' when there's a clustered index. Even when the underlying table is a heap, when a row is moved the RID in the NC indexes is not updated, rather a forwarding pointer is left behind which says where the row has moved to.

    i don't think i ever found a straight answer to this question

    if i have a table with a clustered index and say 5 NC indexes

    i run alter index <clustered_index> on <some_table> rebuild

    will this also rebuild all the non-clustered indexes as well? i've noticed that the clustered index rebuilds take a lot longer to complete

  • Esalter (11/11/2009)


    If an identity column is used for the clustered index (CI) and NC indexes use the identity column (key) rather than a RID as the retrieval pointer then doesn't that mean that when the NC index is searched and the key (Identity column) is retrieved, it then needs to traverse the CI tree down to the leaf level to retrieve the actual row required as the 'identity column' alone wouldn't physically locate the row, it's just part of the data (leaf level of the CI)?

    The identity column would identify the row (it's the clustering key). To get the rest of the row, SQL would do a lookup to the cluster. That's called a bookmark lookup in SQL 2000 and a key lookup in SQL 2005 and above. The physical location of the row is totally irrelevance.

    That's discussed in this article. The section titled 'lookups'

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for that - I know it's not index related but it was prompted by reading another contribution (summary/detail tables).

    Jim

    Trainee Novice:w00t:

  • SQL Noob (11/11/2009)


    if i have a table with a clustered index and say 5 NC indexes

    i run alter index <clustered_index> on <some_table> rebuild

    will this also rebuild all the non-clustered indexes as well?

    It depends.

    SQL 2000:

    Unique clustered index: No (but buggy in earlier service packs)

    Non-unique clustered index: Yes

    SQL 2005+:

    Unique clustered index: No

    Non-unique clustered index: No

    http://www.sqlskills.com/BLOGS/PAUL/post/Indexes-From-Every-Angle-What-happens-to-non-clustered-indexes-when-the-table-structure-is-changed.aspx

    i've noticed that the clustered index rebuilds take a lot longer to complete

    Well it is the biggest index on the table, the one with the most leaf pages. Rebuilding it will take longer than the smaller nonclusters.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • most of our clustered indexes are from the 2nd school. unique, ascending, etc

    but i do break the rules sometimes and created clustered indexes on date columns and we had a big project last year where we went from a unique column to a non-unique column. in a table that ranges from 180,000,000 rows to almost 300,000,000 rows during the month the clustered index key sometimes has 8 million rows in the table.

    the reason is that an application needs to use only one column in a where clause and in testing we saw that after 1.5 million rows or so returned it would go to an index scan. We tried an index hint, but didn't work very well. the application was also changed to run more threads at once and it resulted in a lot of blocking as the select statement would lock a lot of rows.

    changed to a clustered index and it has worked out great. a lot less blocking and we now have select statements that return 8 million rows are a clustered index seek and run in minutes.

    edit: originally we had a unique bigint identity column as the PK/clustered index. the problem select statements had a where clause on a different column that was an int or a bigint but not unique. changed the table structure so that the PK was a non-clustered index and the problem column got a clustered index on it

  • Sadly we're still on SQL2000

    Most every thing Gail said applies to 2000. I'm setting on a bunch of 2000 deployments and a load of 2005 deployments. These things also apply to the Compact Edition. That is the version of SQL Server that can run on a cell phone (mobile).

    Gail I know that this is an overview but it's one hell of a good overview (5 stars). I know it's going to come up in the discussion, and already has to an extent, so let me take a shot at the the whole "traversing the tree" thing.

    Given that SQL server uses 8K pages, and taking a heap table as the example you can take a SWAG at index sizing. You can fit a lot of nodes in a page. At the root each node in the page points to yet another index page. If there is another level then all of the nodes on this level point to yet more pages. On and on.

    I did some studies years (well decades) ago and found that it was rare to have indexes that were tremendously deep. Even so, in almost all the cases we looked at the tree index was better than doing a binary lookup on an ordered list.

    ATBCharles Kincaid

  • Charles Kincaid (11/11/2009)


    Sadly we're still on SQL2000

    Most every thing Gail said applies to 2000. I'm setting on a bunch of 2000 deployments and a load of 2005 deployments.

    Indeed. The main thing that doesn't is the include columns, but that's mainly the next article. (and filtered indexes, but that's also the next article)

    Gail I know that this is an overview but it's one hell of a good overview (5 stars).

    Thanks. Glad they're of use.

    I did some studies years (well decades) ago and found that it was rare to have indexes that were tremendously deep. Even so, in almost all the cases we looked at the tree index was better than doing a binary lookup on an ordered list.

    Also the non-leaf levels will very likely be in cache most of the time. They're small and frequently used (in comparison to the leaf pages)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    Complete nubee here. What does "narrow" mean in the "considerations for selecting the clustering key". What would be useful is to show in the graphic what happens as you add fields to the index. Is the data added to the Clustered Inkex Key Value nodes? For example, if I create a key by Zip5, then add Zip4, then Add Address.

    Another question, if I have an "Identity" integer PK, and I am going to include that in the clustered key, I assume (now, after reading your article) that I need to specify Unique, AND that is the only field that should be included in the clustered key?

    Given both an Identity PK and a set of fields like Zip5 / Zip4 / Addr, why would I select one or the other as the Clustered key? The PK would be often used for Joins, the address data for sorts.

    I'm conflicted. 😉

    And finally, what does all of this have to do with physical storage on disk and getting at the data. All these articles mention "the heap" vs ... (unknown something). Does this imply that if stored on "the heap" then fields of a table are scattered around, i.e. the entire row is placed piecemeal on the disk but if there is a clustered index then the fields are stored sequentially on the disk? Does the disk not even enter into this discussion? I cannot imagine SQL Server having any influence on where something is stored on the disk itself. Are we talking about in the structure of the databse file? I am never finding the answers to these kinds of questions, it is just assumed that I already understand this part, though where I would learn this part is a mystery to me.

    Thanks for the article. I have read at least a dozen such articles and they all assume that I know more than I do.

    jwcolby54

  • John Colby (11/11/2009)


    What does "narrow" mean in the "considerations for selecting the clustering key".

    It refers to the width of the column, ie the number of bytes it takes. 4 byte integer is narrow, a 500 byte char is quite a bit wider.

    What would be useful is to show in the graphic what happens as you add fields to the index. Is the data added to the Clustered Inkex Key Value nodes? For example, if I create a key by Zip5, then add Zip4, then Add Address.

    If you do that, you have a composite (multi-column) key consisting of 3 columns. Where I show the single clustered index key column, you would have 3 columns.

    Another question, if I have an "Identity" integer PK, and I am going to include that in the clustered key, I assume (now, after reading your article) that I need to specify Unique

    If the clustered index is unique, specify that it is unique. Don't make it wider than it needs to be

    Given both an Identity PK and a set of fields like Zip5 / Zip4 / Addr, why would I select one or the other as the Clustered key? The PK would be often used for Joins, the address data for sorts.

    3 zip codes and an address is definitely not narrow. You can create the cluster on that, just not a very good choice. As I mentioned in the article and stated in the discussion, my take on the cluster is that it is used to organise the table. Use the nonclustered indexes to access the data. Most of the time.

    And finally, what does all of this have to do with physical storage on disk and getting at the data. All these articles mention "the heap" vs ... (unknown something).

    Heap vs table with a clustered index. I thought I'd covered that in this article.

    Does this imply that if stored on "the heap" then fields of a table are scattered around, i.e. the entire row is placed piecemeal on the disk but if there is a clustered index then the fields are stored sequentially on the disk?

    No, absolutely not. Rows are always stored in pages (8kb chunks of the disk), not haphazardly all over the data file. Did you read the first part of this series?

    Are we talking about in the structure of the database file?

    Structure within the data file.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • With comments like this I run for the exit:

    … good clustering. One school says to put the clustered index on the column or set of columns that would be most useful for queries, either frequently run queries or ones doing queries of large ranges of data. The other school says to use the clustered index primarily to organise the table and leave data access to the nonclustered indexes.I hold to the second school…

  • hlam-1032421 (11/11/2009)


    With comments like this I run for the exit:

    … good clustering. One school says to put the clustered index on the column or set of columns that would be most useful for queries, either frequently run queries or ones doing queries of large ranges of data. The other school says to use the clustered index primarily to organise the table and leave data access to the nonclustered indexes.I hold to the second school…

    ...and why would that be? Care to expand upon your statement?

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • The difference between hierarchy database and relational database should tell you why no one should be in that school #2.

    The real 2 or 3 schools in question should have been most useful, and/or most important, and/or less io/cpu/footprint.

Viewing 15 posts - 16 through 30 (of 122 total)

You must be logged in to reply to this topic. Login to reply