GUIDs as clustered index

Question

GUIDs as clustered index

Viewing 14 posts - 46 through 59 (of 59 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

Sergiy - Friday, June 1, 2018 6:27 PM
ScottPletcher - Friday, June 1, 2018 1:10 PM
Definitely need index overhauls. [And wow do you all love guids!] Would you mind running the script below and posting the results? I always prefer to tune including actual SQL usage, and missing index info, rather than just simple table DDL.
At first analysis, I think the clus indexes should be as follows:
(1) [AMT].[ItemVersions] = ( ChangesetId, ItemId ) {most likely, although it's possible it could be just ( ChangesetId ) <or> ( ChangesetId, ItemTypeId ). Can't be absolutely sure without further info.
Also, verify that you really need Unicode for all those columns. Don't you assign the "Code"? Will you ever use "extreme" chars in it? Of course "Description" and certain other columns might truly need Unicode, in which case of course you still use it.
(2) [AMT].[ItemTags] = (ParentNewId[, NewId]) (the ParentId = NewId from the ItemVersions table). You don't have to use a guid to clus this table, so there's no reason to. You can still include the parent's guid if you really need it in this table for some very odd reason. Join / lookup using NewId rather than the guids. NewId is optional in the index, but some people don't feel comfortable without a unique clustering key. The table itself doesn't really need an identity, but again, some people panic without their identity crutch in the table.
(3) [Questionnaire].[ItemActions] = (ParentNewId[, NewId]). Same comments as table (2).
All these index suggestions are good, but a little bird on a tree out of my window sings that all the queries they're are running have IsDeleted = 0 as a part of WHERE clause.
If the bird is right, it would be quite useful to have IsDeleted as a first column in a clustered index.
It would be especially useful, if non-deleted sets make <20% of all records.

Your looking at the guy that brought Expedia.com to it's knees for 5 minutes with such an index. It wasn't even the clustered index. I wouldn't recommend such a low cardinality column as the leading column of any index.

Also, IMHO, having an IsDeleted column when you also have a DeletedAt and a DeletedBy column just doesn't make sense to me. It's an extra column that has nothing extra to offer. Same goes for that IsActive column.

I also wouldn't keep "deleted" rows in the main table unless they decided to build a proper slowly changing dimension, which it doesn't appear that they've done unless the CreatedAt and DeletedAt are really meant to be an indication of whether a row is active or not (which, again, would totally negate the need for IsDeleted and IsActive columns). Even then, I'd be seriously tempted to remove them from the "main" table and put them in a separate archive table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sergiy SSC Guru Points: 110209 More actions · Answer 2

Jeff Moden - Saturday, June 2, 2018 4:54 PM
Your looking at the guy that brought Expedia.com to it's knees for 5 minutes with such an index. It wasn't even the clustered index. I wouldn't recommend such a low cardinality column as the leading column of any index.
Also, IMHO, having an IsDeleted column when you also have a DeletedAt and a DeletedBy column just doesn't make sense to me. It's an extra column that has nothing extra to offer. Same goes for that IsActive column.
I also wouldn't keep "deleted" rows in the main table unless they decided to build a proper slowly changing dimension, which it doesn't appear that they've done unless the CreatedAt and DeletedAt are really meant to be an indication of whether a row is active or not (which, again, would totally negate the need for IsDeleted and IsActive columns). Even then, I'd be seriously tempted to remove them from the "main" table and put them in a separate archive table.

Making it a CI is a crucial part here.
Clustering provides a kind of partitioning for the table.
Having that column at the beginning of a clustered index divide the table into 2 ones, just like you suggested above, but without having actually 2 tables in the schema.

But I agree, there is a lot of consideration to be made before going this way.
And it will work well only if absolute majority of queries mention the relevant condition. And those ones which don't - use some highly selective NC indexes.
I used this approach may be couple of times only, but when it had its place - it's was a spectacular win.

I remember the case of object properties which had validity time frame: StartDate, ExpireDate.
Searches for current set of properties were taking forever, until I've changed the CI to begin with ExpireDate.
"Current" records have ExpireDate NULL or in future.
With number of "current" records < 5% from the total set, most of the queries got an instant boost for at least 20+ times. Actually, much more, because there was less blocking, less waiting for spid's.
And queries which needed to see a history of particular properties for a particular object used NCI with ObjectID for the 1st column. With 20-30 records to return even a key lookup was not a problem.

_____________
Code for TallyGenerator

ScottPletcher SSC Guru Points: 101238 More actions · Answer 3

Jeff Moden - Saturday, June 2, 2018 4:54 PM
Sergiy - Friday, June 1, 2018 6:27 PM
ScottPletcher - Friday, June 1, 2018 1:10 PM
Definitely need index overhauls. [And wow do you all love guids!] Would you mind running the script below and posting the results? I always prefer to tune including actual SQL usage, and missing index info, rather than just simple table DDL.
At first analysis, I think the clus indexes should be as follows:
(1) [AMT].[ItemVersions] = ( ChangesetId, ItemId ) {most likely, although it's possible it could be just ( ChangesetId ) <or> ( ChangesetId, ItemTypeId ). Can't be absolutely sure without further info.
Also, verify that you really need Unicode for all those columns. Don't you assign the "Code"? Will you ever use "extreme" chars in it? Of course "Description" and certain other columns might truly need Unicode, in which case of course you still use it.
(2) [AMT].[ItemTags] = (ParentNewId[, NewId]) (the ParentId = NewId from the ItemVersions table). You don't have to use a guid to clus this table, so there's no reason to. You can still include the parent's guid if you really need it in this table for some very odd reason. Join / lookup using NewId rather than the guids. NewId is optional in the index, but some people don't feel comfortable without a unique clustering key. The table itself doesn't really need an identity, but again, some people panic without their identity crutch in the table.
(3) [Questionnaire].[ItemActions] = (ParentNewId[, NewId]). Same comments as table (2).
All these index suggestions are good, but a little bird on a tree out of my window sings that all the queries they're are running have IsDeleted = 0 as a part of WHERE clause.
If the bird is right, it would be quite useful to have IsDeleted as a first column in a clustered index.
It would be especially useful, if non-deleted sets make <20% of all records.
Your looking at the guy that brought Expedia.com to it's knees for 5 minutes with such an index. It wasn't even the clustered index. I wouldn't recommend such a low cardinality column as the leading column of any index.
Also, IMHO, having an IsDeleted column when you also have a DeletedAt and a DeletedBy column just doesn't make sense to me. It's an extra column that has nothing extra to offer. Same goes for that IsActive column.
I also wouldn't keep "deleted" rows in the main table unless they decided to build a proper slowly changing dimension, which it doesn't appear that they've done unless the CreatedAt and DeletedAt are really meant to be an indication of whether a row is active or not (which, again, would totally negate the need for IsDeleted and IsActive columns). Even then, I'd be seriously tempted to remove them from the "main" table and put them in a separate archive table.

Amen on the "IsDeleted" thing. It's often "just in case" and never really used; all queries specify "WHERE IsDeleted = 0" anyway. That "soft delete" was free back in tape processing days, but not all now: it drives the optimizer crazy at times.

Don't do it. Put deleted rows in a separate table. Have a view that combines the two tables for those 1-in-a-million queries that actually includes deleted rows.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101238 More actions · Answer 4

To the OP:
I still wish you'd run the script I posted earlier and post the results (on Page 4; can't get the link to it to copy correctly right now). Run it against the production db before you make/made any index changes.

The [AMT].[ItemVersions] tables has at least one of the classic signs of needing a different clustered index: multiple (3-4+) non-clus indexes with the same leading column and different include columns.

Another classic sign that a table's clus index needs reviewed is if it clustered by an identity column and is not a standard "master" table. Again, the vast majority of tables are not best clustered by identity but by some other column(s).

As for these tables, in which none are best clus by identity. That's extremely common, much more common than vice versa. Break yourself completely away from the "identity clustering myth". Don't just slap an identity and a gazillion covering indexes on all your tables: think more deeply than that first about what the clus index should really be.

To tune indexes, determine and implement the best clus index first, then determine and apply nonclus indexes. Of course for a table where you've already got real production history, don't ignore that history, use those usage stats too to help determine the needed indexes. And that's why I'd like to see the results of the query against the existing table.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 5

@vsamantha35

You never came back to say what you folks ended up doing nor how it worked or....

I can't speak for the others but I'd find it very interesting. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

Just as an update, please see the following presentation where I destroy the myth of Random GUID Fragmentation and pretty much waste what people have been using as supposed "Best Practices" for index maintenance. And, yes... if you know me, you already know that I prove it with code.

Here's the link. Watch it to the very end even after the moderator says I'm out of time.

https://www.youtube.com/watch?v=qfQtY17bPQ4

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2251 More actions · Answer 7

vsamantha35 wrote:

So, they came up with a workaround as below.
- Introduced a new column as BIGINT column.
- Made the bigint column as NOT NULL IDENTITY(1,1) with clustered index on it
- Retained the Primary Key constraint on the GUID column.
Note: the searching is still based based on GUIDs.
Now, I want to know if there is any problem with such approach? Does it eliminate the index fragmentation? Any performance issues in this approach.

In the initial post the set up was regarding the PRIMARY KEY as well as the clustered index. The presentation addresses clustering on GUID. Does it address implications on the logical model? In general it seems the decision to implement GUIDs is (typically imo) made at the logical model phase of development. Is there a strictly physical model reason to implement GUIDs? Idk I think maybe the answer could be no.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 8

Steve Collins wrote:

vsamantha35 wrote:
So, they came up with a workaround as below.
- Introduced a new column as BIGINT column. - Made the bigint column as NOT NULL IDENTITY(1,1) with clustered index on it - Retained the Primary Key constraint on the GUID column.
Note: the searching is still based based on GUIDs.
Now, I want to know if there is any problem with such approach? Does it eliminate the index fragmentation? Any performance issues in this approach.
In the initial post the set up was regarding the PRIMARY KEY as well as the clustered index. The presentation addresses clustering on GUID. Does it address implications on the logical model? In general it seems the decision to implement GUIDs is (typically imo) made at the logical model phase of development. Is there a strictly physical model reason to implement GUIDs? Idk I think maybe the answer could be no.

The decision to use Random GUIDs is based on a combination of both logical and physical decisions.

1. Make both the logical and physical location of the source of new keys totally unimportant, which also increases the physical speed at which new key values can be created because it doesn't need to traverse as much hardware to get from source to consumer while also virtually eliminating the possibility of key collisions when merging data from hundreds or even thousands of physical locations (which are also part of a logical location).

2. Physically and logically eliminate the "Hot Spot" formed by ever-increasing indexes at the last logical page of the index.

... and now that we know it's been the wrong kind of index maintenance, poor testing, and misinformation about Random GUIDs supposedly causing fragmentation...

3. Ironically, the use of a Random GUID key will also prevent both logical and physical fragmentation created by post-insert "ExpAnsive" updates in "Hot Spots", which is otherwise a difficult type of fragmentation to eliminate.

4. Prevent/seriously reduce the physical fragmentation associated with page splits caused by lack of free space on the physical space of logical pages anywhere in the given index.

Just remember that, by definition, even logical fragmentation is caused by a physical a physical problem and that problem is ...

"In B-tree (rowstore) indexes, fragmentation exists when indexes have pages in which the logical ordering within the index, based on the key values of the index, does not match the physical ordering of index pages.
Ref: https://docs.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver15

Looking back at what you quoted, I also used to think that switching Random GUIDs to some form of Integer was the right way to go. Over the last 5 years, I've learned that it's actually a really bad idea in many areas and not just concerning forms of replication. Most people do such a thing for the two purposes of reducing the size of their data and an attempt to eliminate fragmentation.

With what we know now, it actually a bit of a fool's errand because of the damage such a change can cause.

1. It doesn't nothing to help range scans of the clustered index.

2. It does seem like it reduces the size of data by a whole bunch if you only compare GUID (unique identifiedr) size to even BIGINT size but it typically saves a relatively small % if you compare the savings to the actual overall row size. It does make an impact but, percentage wise, a lot of people are disappointed, especially by the new problems such a change can cause.

3. It IS the cause of a "Hot Spot" and all the problems that such a thing brings to bear.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2251 More actions · Answer 9

Thank you Jeff. On some projects we've had logical/physical GUID issues (or idk what to call it, let's say it's the subject of ongoing analysis) which we've sort of managed around. I've been mulling making changes and your presentation data and comments seem to maybe suggest a confirmatory way forward. Specifically within the context of API security and auth/auth where information is sent "outside the firewall". The design usefulness of GUIDs seems to come from the randomness imo. For recognized reasons in certain circumstances we would like to only emit random nonsense. Encryption being a separate topic. The performance issue/topic arises when converting inbound GUID(s) back into primary key(s). A minimal representative example (of a newer/different way than currently) might be something like this

/* table containing PII */
drop table if exists dbo.customers;
go
create table dbo.customers(
  c_id          bigint identity(1, 1) constraint pk_customer_id primary key nonclustered not null,
  guid_id       uniqueidentifier not null default newid(),
  cust_name     nvarchar(64) not null,
  dob           date not null,
  created       datetime2 not null,
  edited        datetime2 not null);
create unique clustered index ndx_unq_customer_guid on dbo.customers(guid_id);
create unique index ndx_unq_customer_name on dbo.customers(cust_name);
go

/* sample row for John Doe */
insert dbo.customers(cust_name, dob, created, edited) values
(N'John Doe', '07-04-1969', sysutcdatetime(), sysutcdatetime());

/* convert id to traffic-able random guid */
declare @cust_guid_encoded             nvarchar(max);
select @cust_guid_encoded=(select cast(cast(guid_id as char(36)) as varbinary(max)) 
                           from dbo.customers
                           where cust_name='John Doe'
                           for xml path(''), binary base64);

/* we receive the guid back.  convert inbound guid to row id */
select @cust_guid_encoded encoded, c.*
from dbo.customers c
where guid_id=cast(cast(@cust_guid_encoded as xml).value('.','varbinary(max)') as varchar(max));

In a more realistic scenario the c_id would be referenced as a foreign key all over the place. On creation of a new customer the generation of the IDENTITY value is confirmed as part of the transactional process so I think we would want that to remain. Would the GUID still be suitable as a primary key if you know there are going to be many foreign key references?

This is a really interesting topic. Thank you for letting me ask questions. Any additional comments would be valued and appreciated. Thanks Jeff

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 10

To be honest and with the understanding that I have no real good grip on what your processes require, I see no serious merit in having both the IDENTITY column and the GUID column. I also see no reason why anyone would need to convert a QUID to a CHAR(36 ) for transmission purposes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2251 More actions · Answer 11

Maybe it was incorrect to label the GUID column 'guid_id' because it was intended as a public attribute of the primary key. Maybe it would be better labelled as 'securitystamp' or 'fingerprint'. The idea being the GUID represents the pk but is not traceable (for security) and fixed width (for tokenization). To the extent the primary key is referenced by foreign key relationships the GUID would need to be replicated in other table(s), no? Afaik row keys should contain no information and be irreducible. Why leave fingerprints all over the place? If the GUID were used as a public securitystamp then replicating it to (also) maintain referential integrity would seem to be a violation of normalization rules

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 12

Call it what you want, using a GUID to "represent the PK" or vice versa makes no sense to me. Both columns are fixed width (unless you've done row or page compression). Both columns are unique. As you've defined them, neither is nullable.

I can't make any guesses because I have to assume that you have a lot more columns in your Customer table than just the few in your post. And so with that, I have to ask, why did you add the GUID column? There may be a good reason but you've not yet stated what that reason is.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2251 More actions · Answer 13

Jeff Moden wrote:

Call it what you want, using a GUID to "represent the PK" or vice versa makes no sense to me. Both columns are fixed width (unless you've done row or page compression). Both columns are unique. As you've defined them, neither is nullable.

Yes you're correct as to the db storage properties. My reference was to the width of the GUID string vs padded integer or cast(bigint as varchar(20))

Jeff Moden wrote:

I can't make any guesses because I have to assume that you have a lot more columns in your Customer table than just the few in your post. And so with that, I have to ask, why did you add the GUID column? There may be a good reason but you've not yet stated what that reason is.

The code above is a hypothetical implementation. The intended reason for the GUID was so the real pk of the table c_id never has to leave the database

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 14

Ok. That sounds like the only reason you're adding the GUID is to try to avoid fragmentation. By itself, that's probably not a good enough reason. And, again, there's no need to convert a perfectly good GUID to CHAR(36) for data transmission or whatever. They shouldn't be converting integers or dates and times to labels for transmission, either.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)