Order of Clustered Index

  • Let's say, hypothetically, I have a table with three columns--LastName, FirstName, MiddleName and I wanted to create a clustered index on LastName. The table has three people with the last name "Smith."

    Smith, John, Benjamin
    Smith, Sally, Elizabeth
    Smith, Abigail, Colleen

    SQL Server will create an internal id (row id) to distinguish each of the records as part of creating the index. My question is, how does it determine which order to assign the row id? For simplicity, let's just say that the records need to be assigned a row_id of 1,2, and 3. Which of the Smiths get which number? Does it just happen to look at the next field defined (FirstName) even though it's not part of the index and assign it alphabetically, meaning Abigail would get stored first, followed by John, then Sally?

    I know many times the clustered index would also be the primary key so this situation wouldn't occur--but with duplicates, I'm just not sure how SQL Server handles the order of assignment of the internal row id.

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Thursday, August 23, 2018 7:08 PM

    Let's say, hypothetically, I have a table with three columns--LastName, FirstName, MiddleName and I wanted to create a clustered index on LastName. The table has three people with the last name "Smith."

    Smith, John, Benjamin
    Smith, Sally, Elizabeth
    Smith, Abigail, Colleen

    SQL Server will create an internal id (row id) to distinguish each of the records as part of creating the index. My question is, how does it determine which order to assign the row id? For simplicity, let's just say that the records need to be assigned a row_id of 1,2, and 3. Which of the Smiths get which number? Does it just happen to look at the next field defined (FirstName) even though it's not part of the index and assign it alphabetically, meaning Abigail would get stored first, followed by John, then Sally?

    I know many times the clustered index would also be the primary key so this situation wouldn't occur--but with duplicates, I'm just not sure how SQL Server handles the order of assignment of the internal row id.

    Thanks,

    Mike

    I think row id is an Oracle thing.
    You might find something of interest here

  • Mike Scalise - Thursday, August 23, 2018 7:08 PM

    Let's say, hypothetically, I have a table with three columns--LastName, FirstName, MiddleName and I wanted to create a clustered index on LastName. The table has three people with the last name "Smith."

    Smith, John, Benjamin
    Smith, Sally, Elizabeth
    Smith, Abigail, Colleen

    SQL Server will create an internal id (row id) to distinguish each of the records as part of creating the index. My question is, how does it determine which order to assign the row id? For simplicity, let's just say that the records need to be assigned a row_id of 1,2, and 3. Which of the Smiths get which number? Does it just happen to look at the next field defined (FirstName) even though it's not part of the index and assign it alphabetically, meaning Abigail would get stored first, followed by John, then Sally?

    I know many times the clustered index would also be the primary key so this situation wouldn't occur--but with duplicates, I'm just not sure how SQL Server handles the order of assignment of the internal row id.

    Thanks,

    Mike

    How they are assigned is irrelevant, because they have no meaning.  You are trying to impose a meaning (order) on something that has no meaning.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Jonathan AC Roberts - Friday, August 24, 2018 5:20 AM

    Mike Scalise - Thursday, August 23, 2018 7:08 PM

    Let's say, hypothetically, I have a table with three columns--LastName, FirstName, MiddleName and I wanted to create a clustered index on LastName. The table has three people with the last name "Smith."

    Smith, John, Benjamin
    Smith, Sally, Elizabeth
    Smith, Abigail, Colleen

    SQL Server will create an internal id (row id) to distinguish each of the records as part of creating the index. My question is, how does it determine which order to assign the row id? For simplicity, let's just say that the records need to be assigned a row_id of 1,2, and 3. Which of the Smiths get which number? Does it just happen to look at the next field defined (FirstName) even though it's not part of the index and assign it alphabetically, meaning Abigail would get stored first, followed by John, then Sally?

    I know many times the clustered index would also be the primary key so this situation wouldn't occur--but with duplicates, I'm just not sure how SQL Server handles the order of assignment of the internal row id.

    Thanks,

    Mike

    I think row id is an Oracle thing.
    You might find something of interest here

    SQL uses them, too, but I don't think that there is any way to access them.  Row IDs are used to distinguish rows that would otherwise be duplicates.  For instance, when deleting duplicates using a CTE with ROW_NUMBER().

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • SQL calls it a "uniquifier", not a "row id".  The assignment is completely arbitrary -- and totally meaningless to you.  If you're worried about getting "Abigail" Smith returned before "John" Smith, you must include an ORDER BY on the SELECT.  Note that, that would be true even if the clus index was ( LastName, FirstName ).

    The only thing about uniquifiers that really does concern you is that they lead to ghost rows, which is annoying.  In extreme cases, ghosts can also take up a noticeable amount of space.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Friday, August 24, 2018 8:13 AM

    SQL calls it a "uniquifier", not a "row id".  The assignment is completely arbitrary -- and totally meaningless to you.  If you're worried about getting "Abigail" Smith returned before "John" Smith, you must include an ORDER BY on the SELECT.  Note, that, that would be true even if the clus index was ( LastName, FirstName ).

    The only thing about uniquifiers that really does concern you is that they lead to ghost rows, which is annoying.  In extreme cases, they can also take up a noticeable amount of space.

    This ^^^

  • ScottPletcher - Friday, August 24, 2018 8:13 AM

    SQL calls it a "uniquifier", not a "row id".  The assignment is completely arbitrary -- and totally meaningless to you.  If you're worried about getting "Abigail" Smith returned before "John" Smith, you must include an ORDER BY on the SELECT.  Note that, that would be true even if the clus index was ( LastName, FirstName ).

    The only thing about uniquifiers that really does concern you is that they lead to ghost rows, which is annoying.  In extreme cases, ghosts can also take up a noticeable amount of space.

    Here's the thing. I'm not worried about data retrieval in the sense of getting "Abigail" Smith returned before "John" Smith or else of course ORDER BY would be the way to ensure that. I'm just curious about how SQL Server stores the data in a clustered index--and I don't think it's completely meaningless to the end user. For example,

    Microsoft states:

    Clustered indexes sort and store the data rows in the table or view based on their key values.
    https://docs.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-2017

    So, let's say I have a heap with the same structure as I mentioned before and it has a million records. Every single record in the table has the last name of "Smith" (and I create a CI with LastName as the only key). SQL Server will attempt to store all of those records as efficiently as possible for data retrieval (which is why, according to the statement above, they "sort and store"). So when it realizes that every single value in the key is the same, what does it do next? Just give each duplicate record a uniquifier and store them as is? How is that any better than the table remaining a heap?

    I feel like there must be some logic in place for when you have a significant number of duplicates values in the clustering key to make data retrieval more efficient for SQL Server (I know the uniquifier distinguishes the records but I don't think it does anything for efficiency). Now I'm by no means an expert on any of this, so if you say, it just randomly assigns the uniquifier and moves on/scans all of the records, then that's cool. I'm just trying to get a better understanding.

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Friday, August 24, 2018 11:38 AM

    Here's the thing. I'm not worried about data retrieval in the sense of getting "Abigail" Smith returned before "John" Smith or else of course ORDER BY would be the way to ensure that. I'm just curious about how SQL Server stores the data in a clustered index--and I don't think it's completely meaningless to the end user. For example,

    Well at the end of the day part of the DBA's job is to determine correct indexing strategies that allow SQL Server to store and retrieve data efficiently.  SQL Server does a lot of stuff but it's not some magical engine that you can just throw data into and it figures everything out for you.  So yes if you were to just make a clustered index on a non unique column with completely identical values in it that would be of zero use to anyone.

  • Mike Scalise - Friday, August 24, 2018 11:38 AM

    So, let's say I have a heap with the same structure as I mentioned before and it has a million records. Every single record in the table has the last name of "Smith" (and I create a CI with LastName as the only key). SQL Server will attempt to store all of those records as efficiently as possible for data retrieval (which is why, according to the statement above, they "sort and store"). So when it realizes that every single value in the key is the same, what does it do next? Just give each duplicate record a uniquifier and store them as is? How is that any better than the table remaining a heap?

    I feel like there must be some logic in place for when you have a significant number of duplicates values in the clustering key to make data retrieval more efficient for SQL Server (I know the uniquifier distinguishes the records but I don't think it does anything for efficiency). Now I'm by no means an expert on any of this, so if you say, it just randomly assigns the uniquifier and moves on/scans all of the records, then that's cool. I'm just trying to get a better understanding.

    Thanks,

    Mike

    You can think of the uniquifier as essentially an identity value but separate for each unique base clus key.  IF there are dup key values, the dups get a uniquifier added (non-dup keys do NOT have a uniquifier).  So, for 1M dups, you'd have a uniquifier of ~1,000,000 on the last of the dups.  Since SQL is assigning new values sequentially, there's no "sorting" done as dup rows are being inserted: new rows get higher values.  In fact, that's what causes the ghost issue: if you delete all the dups, SQL keeps the last one.  I think it does that to know what the next uniquifier value to assign is.  Remember, SQL must insure that dups do not exist for any given clus key.  Rebuilding the table, though, seems to remove the ghosts, at least most of the time.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Makes perfect sense now. Thank you, all, for the replies and information!

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • You're welcome.  I started the confusion when I stated that assignment of uniquifiers was "completely arbitrary" without clarifying it.  I made it sound like the values were random over time.  As you noted, that would indeed be a difficult situation for performance. 

    That's why I clarified myself above, because "arbitrary" is not true over time.  What I really meant was within a single transaction it's arbitrary.  That is, if you insert "Smith, A", "Smith, B" and "Smith, C" at the same time, it's arbitrary which is uniquifier 1, which is 2 and which 3.  That is, you can't count on "Smith, A" being uniq1.  But, you can count on the next set of inserts having higher uniqifier values than previous inserts.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Keep in mind as well that despite any ordering in the clustered index, you might not get rows back in that order. If there is parallelism or the engine finds it more efficient, it could return rows in a different order.

  • Sorry for the delay. My wife went into labor a week early so I've been busy with newborn things for my first child 🙂

    You're welcome. I started the confusion when I stated that assignment of uniquifiers was "completely arbitrary" without clarifying it. I made it sound like the values were random over time. As you noted, that would indeed be a difficult situation for performance. 

    That's why I clarified myself above, because "arbitrary" is not true over time. What I really meant was within a single transaction it's arbitrary. That is, if you insert "Smith, A", "Smith, B" and "Smith, C" at the same time, it's arbitrary which is uniquifier 1, which is 2 and which 3. That is, you can't count on "Smith, A" being uniq1. But, you can count on the next set of inserts having higher uniqifier values than previous inserts.

    No worries. I'm glad you were able to help me understand a little better. Thank you for that!

    Keep in mind as well that despite any ordering in the clustered index, you might not get rows back in that order. If there is parallelism or the engine finds it more efficient, it could return rows in a different order.

    Steve, that's a great point. When it comes to returning rows in a specific order, I'll never rely on the CI. If I can, I try to order in the app layer--it's such an expensive operation in SQL (which I'm sure you know).

    --------------------------------

    On a side note, there are very few people in my personal life that I can bounce questions like this off of (and I have no shortage of questions or general musings), so thank you all again for continually sharing and discussing information about SQL Server...

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Sunday, September 2, 2018 10:18 AM

    Sorry for the delay. My wife went into labor a week early so I've been busy with newborn things for my first child 🙂

    You're welcome. I started the confusion when I stated that assignment of uniquifiers was "completely arbitrary" without clarifying it. I made it sound like the values were random over time. As you noted, that would indeed be a difficult situation for performance. 

    That's why I clarified myself above, because "arbitrary" is not true over time. What I really meant was within a single transaction it's arbitrary. That is, if you insert "Smith, A", "Smith, B" and "Smith, C" at the same time, it's arbitrary which is uniquifier 1, which is 2 and which 3. That is, you can't count on "Smith, A" being uniq1. But, you can count on the next set of inserts having higher uniqifier values than previous inserts.

    No worries. I'm glad you were able to help me understand a little better. Thank you for that!

    Keep in mind as well that despite any ordering in the clustered index, you might not get rows back in that order. If there is parallelism or the engine finds it more efficient, it could return rows in a different order.

    Steve, that's a great point. When it comes to returning rows in a specific order, I'll never rely on the CI. If I can, I try to order in the app layer--it's such an expensive operation in SQL (which I'm sure you know).

    --------------------------------

    On a side note, there are very few people in my personal life that I can bounce questions like this off of (and I have no shortage of questions or general musings), so thank you all again for continually sharing and discussing information about SQL Server...

    Mike

    It's not actually that expensive in SQL Server if it's done correctly, which normally means having the correct index in place.  There are many times when you do an ORDER BY that the sort operator won't appear in the execution plan because SQL Server actually did use the clustered (or whatever) index.  That's not to say that you should ever remove and ORDER BY... things can change where SQL Server will need it to return the result set in the order you want.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Wait.  Your wife went into labor?

    CONGRATULATIONS!

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply