GUID clustered indexes vs. RID

  • (Just posted a similar comment on Glen Berry's blog at http://www.sqlservercentral.com/blogs/glennberry/archive/2010/3/22/why-uniqueidentifier-is-a-bad-choice-for-a-clustered-index-in-sql-server.aspx. This is not meant as a double post, just not sure if responses to older blog articles ever get read....)

    I've inherited a system where GUID values are regularly (almost exclusively) used as IDs (primary keys) on tables. We generally don't create a clustered index on the GUID -- and I've seen numerous articles that confirms that decision. However, the GUIDs work great since we regularly merge data from multiple SQL instances into a single, master instance, and there are a lot of relations between tables that need to be maintained during that merge -- and we don't want to have to write all sorts of FK conversion scripts during the import.

    My question: Is there any disadvantage to leaving the table with NO clustered index and letting SQL assign the RID, which I believe will dictate the physical order of the data and become the actual lookup value used by other indexes? Is that any more or less efficient than adding an additional int/bigint identity field and creating a clustered index on that?

    Rob Schripsema
    Propack, Inc.

  • Rob Schripsema (10/4/2010)


    (Just posted a similar comment on Glen Berry's blog at http://www.sqlservercentral.com/blogs/glennberry/archive/2010/3/22/why-uniqueidentifier-is-a-bad-choice-for-a-clustered-index-in-sql-server.aspx. This is not meant as a double post, just not sure if responses to older blog articles ever get read....)

    I've inherited a system where GUID values are regularly (almost exclusively) used as IDs (primary keys) on tables. We generally don't create a clustered index on the GUID -- and I've seen numerous articles that confirms that decision. However, the GUIDs work great since we regularly merge data from multiple SQL instances into a single, master instance, and there are a lot of relations between tables that need to be maintained during that merge -- and we don't want to have to write all sorts of FK conversion scripts during the import.

    My question: Is there any disadvantage to leaving the table with NO clustered index and letting SQL assign the RID, which I believe will dictate the physical order of the data and become the actual lookup value used by other indexes? Is that any more or less efficient than adding an additional int/bigint identity field and creating a clustered index on that?

    In general, you're better off with a clustered index than not. SQL Server is optimized around the concept of the clustered index, so, while you can get by without one, you're paying a cost. As to what to put it on, I don't know that I'd say that you should add a column, just because. Instead, I'd look at putting the cluster on the most frequently used access path (maybe, even if it is the GUID). Clustering a GUID can create a serious issues around page splits. But, according to Adam Machanic, he doesn't see serious negative impact, most of the time. Personally, I saw a system with HUGE negative impact from clustering a GUID, so I'm a bit gun shy.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant,

    Thanks for the reply. Sounds like there isn't complete consensus on whether GUIDs are valid for clustered indexes or not. So far, we've not seen significant performance impact where we've used them -- but we're not dealing with huge volumes of data, either. I'll keep monitoring and experimenting.

    BTW, appreciated your book on SQL Server Execution Plans. Clear and concise, unlike many tech books out there. And the knowledge gained has been extremely useful! Thanks again!

    Rob Schripsema
    Propack, Inc.

  • Rob Schripsema (10/14/2010)


    Grant,

    Thanks for the reply. Sounds like there isn't complete consensus on whether GUIDs are valid for clustered indexes or not. So far, we've not seen significant performance impact where we've used them -- but we're not dealing with huge volumes of data, either. I'll keep monitoring and experimenting.

    BTW, appreciated your book on SQL Server Execution Plans. Clear and concise, unlike many tech books out there. And the knowledge gained has been extremely useful! Thanks again!

    Wow! Thank you. I love hearing that the book is useful. I'm going to start rewriting it in a few weeks. This can act as motivation.

    The main two problems with GUIDs are the width, which is not that big a problem, but it's worth noting, and the fragmentation. The fragmentation is not automatic, but is very possible and can be, not will be, extremely problematic. So you're right. Test, test, test and then decide.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • If you search on a covered non-clustered index on a table with a clustered index, you get a bookmark lookup where the index goes and looks up the row in the table.

    If you are search on a covered non clustered index on a heap, the bookmark lookup does not occur. So thats one less operation = in some cases better performance.

    This is a very isolated (but valid) situation where a table can benefit from being a heap.

  • nick.mcdermaid (10/14/2010)


    If you search on a covered non-clustered index on a table with a clustered index, you get a bookmark lookup where the index goes and looks up the row in the table.

    If you are search on a covered non clustered index on a heap, the bookmark lookup does not occur. So thats one less operation = in some cases better performance.

    This is a very isolated (but valid) situation where a table can benefit from being a heap.

    Two points, first, it's worse than you said. There are actually, if I remember Adam Machanic's lecture correctly, three additional reads for a lookup operation, so they can be extremely expensive, depending.

    Second, a heap does not rescue you from lookup operations. Same situation occurs in a heap. If you have a non-covering non-clustered index, it must do the additional reads against the heap. It's called a RID lookup, for Row ID lookup, the additional, hidden, column added to heaps to uniquely identify the rows in order to provide a key value for indexes. There might be situations where having a heap is a benefit, but this is not one of them.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • nick.mcdermaid (10/14/2010)


    If you search on a covered non-clustered index on a table with a clustered index, you get a bookmark lookup where the index goes and looks up the row in the table.

    If you are search on a covered non clustered index on a heap, the bookmark lookup does not occur. So thats one less operation = in some cases better performance.

    Please educate me. I understood that if you have a covering index (one that contains all of the outputs called for by the query) then SQL wouldn't have to do ANY lookup. Why would that change when there's a clustered index on the table? Wouldn't SQL just grab the query outputs directly from the index and bypass any heap/table lookup altogether?

    Rob Schripsema
    Propack, Inc.

  • Rob Schripsema (10/15/2010)


    nick.mcdermaid (10/14/2010)


    If you search on a covered non-clustered index on a table with a clustered index, you get a bookmark lookup where the index goes and looks up the row in the table.

    If you are search on a covered non clustered index on a heap, the bookmark lookup does not occur. So thats one less operation = in some cases better performance.

    Please educate me. I understood that if you have a covering index (one that contains all of the outputs called for by the query) then SQL wouldn't have to do ANY lookup. Why would that change when there's a clustered index on the table? Wouldn't SQL just grab the query outputs directly from the index and bypass any heap/table lookup altogether?

    Oops. I read that and inserted the "non" automatically with my eyeball. Yeah, you're right. A covering index is a covering index. Heap or not, cluster or not.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • That's what I thought. Thanks for the clarification, Grant.

    Rob Schripsema
    Propack, Inc.

  • Predictability can be an issue when using a uniqueidentifier for clustering key.

    They (ms) provided the new "new sequential id" to help with that, so when you start inserting a bunch of new rows, you would end up physically kind of sequential, so pages might be allocated more optimal.

    However, the "starting" point of this sequential operation still is "random".

    My guideline: avoid uniqueidentifiers for clustering index and choose a column you actually know how it behaves in time. ( cfr identity, datetime (if you have control over its sequentialness))

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Let's say (hypothetically :cool:) that I have a table with 100 million rows, averaging about 800 bytes each. It has an ID field which is a GUID -- and we use that as our clustered index. The nature of the table is that it is "fed" with about 2 million new rows each night between midnight and 3am (feeds from remote locations -- ergo, we get the benefit of GUIDs being unique though generated across multiple systems). During the workday (5am to 5pm) the primary activity on the file is reads -- lots of queries. A few inserts here and there (a few hundred per day). Between 5pm and midnight, very little access.

    Would we overcome the downside of using the GUID as a clustered index if we set a fillfactor and pad_index of, say 85% on that table/index, and then every couple of days, during the down time, ran an index rebuild? That would 'spread out the data' a bit, ideally enough that the insertion of non sequential GUID keys would not cause a significant number of page splits.

    Is that a reasonable line of thinking?

    Rob Schripsema
    Propack, Inc.

  • That's a reasonable strategy, and it's how people tend to deal with lots of inserts that might cause page splits.

    Can you do sequential GUIDs in the remote locations? That might minimize the page splitting/fragmentation as well.

  • The GUIDs are generated by client side apps (.NET/C#) in an application that has been running smoothly for a couple of years now -- I don't think the developers would be willing to re-architect that to let SQL gen the IDs. There is actually a 'set' of records for multiple tables that is generated as the components of an object in the client app, and they all get written to their respective tables with FK's pre-assigned as part of a single operation. I believe that's the reason they went with the GUID-as-key concept in the first place -- the multi-location benefit only became apparent as the app spread to other locations.

    Rob Schripsema
    Propack, Inc.

  • I was just wondering if each location could generate sequentials from .NET. They wouldn't be all sequential, but you'd get "bunches" from each location, and that might limit, not eliminate, some of the page splitting.

  • By default, sqlserver will fill pages to 100%.

    You can change this default behaviour using sp_configure 'fill factor', ...

    This is an advanced setting.

    ref: http://msdn.microsoft.com/en-us/library/aa196711%28SQL.80%29.aspx

    If you have the space available, and 85% doesn't accomplish what you aimed for, you may even want to consider putting it at 50%.

    Keep in mind to document your modifications, so their not altered by someone who didn't remember this investigation.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply