Lots of Key Lookups vs. UniqueIdentifier Clustered Index

Question

Lots of Key Lookups vs. UniqueIdentifier Clustered Index

Viewing 12 posts - 31 through 42 (of 42 total)

You must be logged in to reply to this topic. Login to reply

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 1

Jeff Moden (11/22/2012)
Either put the CI on the GUID with a decent FILL FACTOR to prevent rampant page splits during inserts

what would be that decent valure ? in our environment we use 75% , is it ok ? OR are there other deciding fator too for this ?

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

TomThomson SSC Guru Points: 104768 More actions · Answer 2

CELKO (11/22/2012)
My question is: is it better to have all of these key lookups happening or would it be better to just build the clustered indexes on the UUIDs?
Neither. This nightmare is how non-RDBMS people design schemas to fake their familar pointer chains. Neither UUID (intented for web apps only) or IDENTITY (physical record insertion count for one file on one machine) are keys by definition.

Stuff and nonsense! You clearly haven't read all the information in this thread. There is a clear statement of a requirement for unpredictability in the key (and a key it most certainly is, because it is what an end user uses to identify the row containing it, conforming exactly to Codd's original definition of a key - and it's because it identifies a row that unpredictability is required).

This unpredictability (randomness) requirement is extremely common in applications that use an RDBMS to store personal information (whether they are web applications or not - unpredictable keys are somewhat older than the web). An index like this one is a component of the security wall that maintains privacy. Of course usually isn't the only such component, but for security and privacy each small addition to an attacker's problems is valuable.

Tom

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 3

Jeff Moden (11/22/2012)
dave-L (11/22/2012)
Hi Jeff. That's my whole question really. Is there a good way to fix this and is it worth trying to fix.
True using a UDF to generate the UUIDs will have more over head than native NEWID(), but my thought was I would save so much time on the scores of thousands of key lookups that occur because my CI is on a useless identity that this would be worth it.
Dave
I'm still confused about it all. I thought you said that the GUID from the GUI was absolutely required and has been used in thousands of places throughout the code. The only way that you'd get rid of the Key Lookups is to do one of two things. Either put the CI on the GUID with a decent FILL FACTOR to prevent rampant page splits during inserts or build a covering index with the GUID as the leading column. All of this other work will do nothing to get rid of the Key Lookups.

I think where things are going is that if the GUID is made to be sequential then the CI can be again redefined using the GUID resulting in the key lookups going away and fragmentation due to mid-page page splits also going away meaning the FILLFACTOR does not need to be modified solely to ease things up for a non-sequential GUID.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 4

Bhuvnesh (11/22/2012)
Jeff Moden (11/22/2012)
Either put the CI on the GUID with a decent FILL FACTOR to prevent rampant page splits during inserts
what would be that decent valure ? in our environment we use 75% , is it ok ? OR are there other deciding fator too for this ?

I don't know if 75% for your environment is OK or not because "It Depends" on the environment. For example, it depends on things like how many INSERTs there are vs how often you rebuild or reoganize the index. It also depends on how many pages of data you have especially for indexes that have a UUID for a leading column. Since they are so very random, it may be that you can make the FILL FACTOR larger because INSERTs will be distributed quite randomly and evenly if there are a lot of pages of data. The same page probably won't be hit multiple times in many cases. Having a larger FILL FACTOR is better for performance of SELECTs because fewer pages will need to be read.

Also, just to be sure, I wouldn't set the defaults for the server environment to 75%. I would only change an index from the default of "0/100" in situations like this or where a table may suffer a lot of updates to variable length columns.

Last but not least, I believe that sequential UUIDs would be a mistake here because it will hammer pages that are not at the logical "end" of the table. Here's the explanation from Books Online.

Creates a GUID that is greater than any GUID previously generated by this function on a specified computer since Windows was started. After restarting Windows, the GUID can start again from a lower range, but is still globally unique. When a GUID column is used as a row identifier, using NEWSEQUENTIALID can be faster than using the NEWID function. This is because the NEWID function causes random activity and uses fewer cached data pages. Using NEWSEQUENTIALID also helps to completely fill the data and index pages.

Notice the part where it says that a GUID sequence can start at a lower range after a Windows restart. That means that new GUIDs could go to the logical "middle" of a table which would cause severe page splitting either on a non-clustered index or on the data itself if it's a clustered index.

I also want to note that this entry in Books Online has a mistake on it. I can't but my hads on the URL right now, but even Microsoft has stated that, although the likely hood of it happening is quite small, GUIDs produced by most Microsoft Products are no longer guaranteed to be "globally unique" because they're no longer based on the Mac Address of the computer like the old Type 1 GUIDs where. They are now Type 4 GUIDs (for security reasons) and Type 4 GUIDs are nothing more than a fairly unpredictable, incredibly large domain, pseudo-random number (one machine won't repeat it until all of the values have been used). It is actually possible (though highly improbable) that two or more machines could end up producing a pair of matching GUIDs. Despite the low probability, I always put a UNIQUE index on GUID columns that will be used as key values (which you should do with any key column, BTW).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 5

opc.three (11/24/2012)
Jeff Moden (11/22/2012)
dave-L (11/22/2012)
Hi Jeff. That's my whole question really. Is there a good way to fix this and is it worth trying to fix.
True using a UDF to generate the UUIDs will have more over head than native NEWID(), but my thought was I would save so much time on the scores of thousands of key lookups that occur because my CI is on a useless identity that this would be worth it.
Dave
I'm still confused about it all. I thought you said that the GUID from the GUI was absolutely required and has been used in thousands of places throughout the code. The only way that you'd get rid of the Key Lookups is to do one of two things. Either put the CI on the GUID with a decent FILL FACTOR to prevent rampant page splits during inserts or build a covering index with the GUID as the leading column. All of this other work will do nothing to get rid of the Key Lookups.
I think where things are going is that if the GUID is made to be sequential then the CI can be again redefined using the GUID resulting in the key lookups going away and fragmentation due to mid-page page splits also going away meaning the FILLFACTOR does not need to be modified solely to ease things up for a non-sequential GUID.

I think that using sequetial GUIDs would be a very bad idea for this because of what I said in my previous post above. Restarting Windows can cause the sequence to restart at a lower value than the previous sequence which would cause new rows to be inserted into the logical "middle" ("begining" on the first restart) of the table if the CI were on that column. Since they would again be sequential, page splits would be rampant right after the restart. I don't believe they'd settle down much after that, either. I would be an interesting test over time.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 6

Jeff Moden (11/24/2012)
opc.three (11/24/2012)
Jeff Moden (11/22/2012)
dave-L (11/22/2012)
Hi Jeff. That's my whole question really. Is there a good way to fix this and is it worth trying to fix.
True using a UDF to generate the UUIDs will have more over head than native NEWID(), but my thought was I would save so much time on the scores of thousands of key lookups that occur because my CI is on a useless identity that this would be worth it.
Dave
I'm still confused about it all. I thought you said that the GUID from the GUI was absolutely required and has been used in thousands of places throughout the code. The only way that you'd get rid of the Key Lookups is to do one of two things. Either put the CI on the GUID with a decent FILL FACTOR to prevent rampant page splits during inserts or build a covering index with the GUID as the leading column. All of this other work will do nothing to get rid of the Key Lookups.
I think where things are going is that if the GUID is made to be sequential then the CI can be again redefined using the GUID resulting in the key lookups going away and fragmentation due to mid-page page splits also going away meaning the FILLFACTOR does not need to be modified solely to ease things up for a non-sequential GUID.
I think that using sequetial GUIDs would be a very bad idea for this because of what I said in my previous post above. Restarting Windows can cause the sequence to restart at a lower value than the previous sequence which would cause new rows to be inserted into the logical "middle" ("begining" on the first restart) of the table if the CI were on that column. Since they would again be sequential, page splits would be rampant right after the restart. I don't believe they'd settle down much after that, either. I would be an interesting test over time.

I agree 100% with regard to NEWSEQUENTIALID(). It is not a good option.

However the OP is talking about producing their own function to generate sequential GUIDs that would not be affected by a restart. The tradeoff of course is that all GUID-generation would require a roundtrip to the database which has its own set of baggage.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 7

opc.three (11/24/2012)
However the OP is talking about producing their own function to generate sequential GUIDs that would not be affected by a restart. The tradeoff of course is that all GUID-generation would require a roundtrip to the database which has its own set of baggage.

Considering that the OP has also identified that the existing GUIDs have been used in "thousands of lines of code", I'm hoping that he won't consider such a thing. It would even break GUI code unless they write a wad of "Instead Of" triggers to covert the UUIDs the GUI code is already passing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 8

Jeff Moden (11/24/2012)
opc.three (11/24/2012)
However the OP is talking about producing their own function to generate sequential GUIDs that would not be affected by a restart. The tradeoff of course is that all GUID-generation would require a roundtrip to the database which has its own set of baggage.
Considering that the OP has also identified that the existing GUIDs have been used in "thousands of lines of code", I'm hoping that he won't consider such a thing. It would even break GUI code unless they write a wad of "Instead Of" triggers to covert the UUIDs the GUI code is already passing.

Obviously I cannot see this particular codebase but for the ones I have worked with where this technique was employed the GUID generation was typically centralized so rewriting that piece (or those pieces) of code would have been trivial. Getting a new GUID for purposes of adding new rows to a database is usually a small percentage of the work the application would need to do. The 'get' logic, i.e. the joining of existing GUID-columns, usually comprises the vast majority of application logic but hopefully none of that will need to change.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 9

Wouldn't it just be easier and safer to set the FILL FACTOR appropriately and make sure that reindexing was occuring on a regular basis? That would't require any code changes and, therefor, no regression testing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 10

It's quite possible that leaving the FILLFACTOR at 100 would make the most sense. With the SSD in play, the page splits and ensuing fragmentation may be less of a concern than absorbing a 25% space-hit to the buffer pool for each index where the FILLFACTOR is lowered.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 11

opc.three (11/24/2012)
It's quite possible that leaving the FILLFACTOR at 100 would make the most sense. With the SSD in play, the page splits and ensuing fragmentation may be less of a concern than absorbing a 25% space-hit to the buffer pool for each index where the FILLFACTOR is lowered.

At first, and even though it goes against my inner data-troll, I thought that would be fine especially considering the blazing speed of SSDs. It might still be fine if you have a reorg going on a regular basis. My concern is that when you have a lot of pages of GUIDs, inserting a lot more GUIDs will cause page splits possibly on almost every page because of the extremely random distribution. In theory, if you have 10,000 pages of GUIDs where the CI is on the guid and you insert just 10,000 new GUIDs, it could cause a page split on each and every page. Now you suddenly have the space equivalent of a 50% FILL FACTOR on some relatively very expensive hardware.

Considerig the speed that a reorg or rebuild might happen on the SSD, that might still be the way to go, though.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TomThomson SSC Guru Points: 104768 More actions · Answer 12

Jeff Moden (11/24/2012)
opc.three (11/24/2012)
However the OP is talking about producing their own function to generate sequential GUIDs that would not be affected by a restart. The tradeoff of course is that all GUID-generation would require a roundtrip to the database which has its own set of baggage.
Considering that the OP has also identified that the existing GUIDs have been used in "thousands of lines of code", I'm hoping that he won't consider such a thing. It would even break GUI code unless they write a wad of "Instead Of" triggers to covert the UUIDs the GUI code is already passing.

I think I disagree, Jeff.

If id generation is one place, at most three triggers are needed; that's pehaps a bit less than a wad of triggers. Of course there's also a view (since there's no reasonable way to do partitioning between new GUIs and old GUIs within one table, it has to be two separate tables) and that's what is operated on. Maybe fewer than three triggers are needed (I can't remember how good SQL Server is at doing updates and deletions on a union view without trigers - certainly updates and deletions on unions have been done in other dbs without needing a trigger, so maybe only the insert trigger to trap duplicate keys is needed; and in fact a try catch could be used instead of that trigger if the insert statement used the new table instead of the union view, but I doubt that would be much cheaper in development, testing, or performance than inserting into the view and using the trigger to detect duplicates and direct to the new table). The overhead of the insertion trigger is pretty small too: just a check whether the generated value is already in the old-keys table, and if it is regenerate (doesn't need a new GUID, just adding one to the sequential parameter of the generator) and check again, and the check will hit a match only about one time in ten to the power thirty three even if there are as many as a billion rows in the old table, so the performance overhead is small).

I could be wrong here - I've never done this trick with GUIDs.

Tom