Clustered index on non-primary key columns

Question

Clustered index on non-primary key columns

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Phil Parkin (7/8/2016)
GilaMonster (7/8/2016)
Phil Parkin (7/8/2016)
SQL!$@w$0ME (7/8/2016)
Does the clustered key column(s) has to be NOT NULL/UNIQUE as best practice?
Yes. Otherwise SQL Server has to create its own 'uniquifier', to make the rows unique. Read more about it here[/url].
To be honest, non-unique is the one of the guidelines that I'm happiest to break, as long as the clustered index is still fairly unique (not 4 values in a 100 million row table). The uniquifier's only a 4-byte int and it's null for the first occurrence of any value
Just to be clear: does that mean that you are suggesting that uniqueness in the clustered key column is not a best practice?

Um, did I say that?

There's a big difference between "non-unique is the one of the guidelines that I'm happiest to break, as long as the clustered index is still fairly unique" and "uniqueness in the clustered key column is not a best practice"

The guidelines around clustered index design are just that. Guidelines, not rules written in stone.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Phil Parkin SSC Guru Points: 246979 More actions · Answer 2

GilaMonster (7/8/2016)
Phil Parkin (7/8/2016)
GilaMonster (7/8/2016)
Phil Parkin (7/8/2016)
SQL!$@w$0ME (7/8/2016)
Does the clustered key column(s) has to be NOT NULL/UNIQUE as best practice?
Yes. Otherwise SQL Server has to create its own 'uniquifier', to make the rows unique. Read more about it here[/url].
To be honest, non-unique is the one of the guidelines that I'm happiest to break, as long as the clustered index is still fairly unique (not 4 values in a 100 million row table). The uniquifier's only a 4-byte int and it's null for the first occurrence of any value
Just to be clear: does that mean that you are suggesting that uniqueness in the clustered key column is not a best practice?
Um, did I say that?
--snip

Obviously, you did not say it explicitly. But you did respond negatively to my response, which prompted my request for clarification. That's all.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

SQL!$@w$0ME SSChampion Points: 12349 More actions · Answer 3

Please see the table structure:

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

SET ANSI_PADDING ON

GO

CREATE TABLE [dbo].[Test](

[Id] [int] IDENTITY(1,1) NOT NULL,

[Col1] [varchar](256) NOT NULL,

[Col2] [varchar](256) NOT NULL,

[Col3] [varchar](256) NOT NULL,

[Col4] [int] NOT NULL,

[ColumnValue] [varchar](max) NULL,

[ColumnValueDate] [datetime] NULL,

[ColumnValueInt] [int] NULL,

[CreatedAt] [datetime] NOT NULL,

[U_Id] [int] NULL,

CONSTRAINT [PK_Test] PRIMARY KEY NONCLUSTERED

(

[Id] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

)

GO

SET ANSI_PADDING OFF

GO

ALTER TABLE [dbo].[Test] ADD DEFAULT (getdate()) FOR [CreatedAt]

GO

CREATE CLUSTERED INDEX [ClusteredIndex-20160706] ON [dbo].[Test]

(

[Col1] ASC,

[Col2] ASC,

[Col3] ASC,

[Col4] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

GO

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 4

The core guideline that the best clustered index is a unique, not null, monotonically increasing integer column (basically, identity), is true. That said, it doesn't rule my life. For example, see this old article[/url] about how we combined identity and logical constraints to create primary keys in a mixed design where some tables were using identity columns for their clustered index, and others were not.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

Assuming that's the real table design, stop worrying about the clustered index and start worrying about how very badly that EAV design performs in general, and consider either a relational design, or a noSQL database if the data really is schema-less.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 6

So the majority of your queries, or the most frequently called queries use all these columns when filtering:

[Col1] ASC,

[Col2] ASC,

[Col3] ASC,

[Col4] ASC

That's a very wide key, three 256 varchars. To be sure it worked well, I'd need to test. However, these are not what I would normally choose as good columns for the keys of any index, let alone the cluster. Although, testing is the answer, not my opinion.

However, you still need to evaluate further. It's not just identity = good/bad or varchar = good/bad. It's about the length, number, queries, selectivity of the data, level of updates, etc.

Also, you don't have a logical identifier on that table. What uniquely identifies that data for the business?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

SQL!$@w$0ME SSChampion Points: 12349 More actions · Answer 7

Table/columns are renamed due to security reasons.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 8

We can go back-and-forth with he-said-she-said's for weeks and not cover all the permutations or pros and cons of indexing. That's specifically why I didn't even start down that road. 🙂

I have spent more than 100 man hours at at least 20 different clients over the years in my consulting work (probably 300+ at two of them) to develop a proper indexing strategy for their application(s). And that is the proper term for it too because you have to take the entire system as a whole into account to become fully optimized.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

SQL!$@w$0ME SSChampion Points: 12349 More actions · Answer 9

Grant Fritchey (7/8/2016)
So the majority of your queries, or the most frequently called queries use all these columns when filtering:

[Col1] ASC,

[Col2] ASC,

[Col3] ASC,

[Col4] ASC

Majority of the queries use all these columns when filtering. There will be frequent deletes and inserts on these tables.

Please see the part of Stored Procedure.

delete from dbo.Test

where Col1= @Col1 and Col2= @Col2 and Col3 = @Col3 and Col4 = @Col4

insert into dbo.Test

(Col1, Col2, Col3, Col4, ColumnValueVarChar, ColumnValueDate, ColumnValueInt, U_Id)

values (@Col1, @Col2, @Col3, @Col4, @ColumnValueVarChar, @ColumnValueDate, @ColumnValueInt, @U_Id)

Thanks!

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 10

SQL!$@w$0ME (7/8/2016)
Grant Fritchey (7/8/2016)
So the majority of your queries, or the most frequently called queries use all these columns when filtering:
[Col1] ASC,
[Col2] ASC,
[Col3] ASC,
[Col4] ASC
Majority of the queries use all these columns when filtering. There will be frequent deletes and inserts on these tables.
Please see the part of Stored Procedure.
delete from dbo.Test
where Col1= @Col1 and Col2= @Col2 and Col3 = @Col3 and Col4 = @Col4
insert into dbo.TrackColumnChanges
(Col1, Col2, Col3, Col4, ColumnValueVarChar, ColumnValueDate, ColumnValueInt, U_Id)
values (@Col1, @Col2, @Col3, @Col4, @ColumnValueVarChar, @ColumnValueDate, @ColumnValueInt, @U_Id)
Thanks!

Lots of inserts makes the four columns a very poor clustered index. The original concern that Phil raised with me will be an issue here.

The deletes are concerning too. However, if the deletes are based on the four columns, the read aspect of the delete operation will be assisted by having an index (clustered or non-clustered) on that column.

Again, the width of the columns here is concerning. I'd want to test this to be certain. Testing will have to take into account the other tables associated with this one.

However, again, as before, I can explain a bunch of the trade-offs, the good & bad, but I can't come right out and tell you that for this table you MUST do this for the clustered index. It's really down to too many factors that I can't account for remotely. You'll have to assess what's going to work out best for the system because you're there.

Based on everything you've said, I'd be inclined away from using these four columns as the clustered index key, however, testing is what would determine the answer.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

SQL!$@w$0ME SSChampion Points: 12349 More actions · Answer 11

SQL!$@w$0ME

SSChampion

Points: 12349

July 8, 2016 at 11:23 am

#1889233

Thanks a lot.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 12

SQL!$@w$0ME (7/8/2016)
Grant Fritchey (7/8/2016)
So the majority of your queries, or the most frequently called queries use all these columns when filtering:
[Col1] ASC,
[Col2] ASC,
[Col3] ASC,
[Col4] ASC
Majority of the queries use all these columns when filtering. There will be frequent deletes and inserts on these tables.
Please see the part of Stored Procedure.
delete from dbo.Test
where Col1= @Col1 and Col2= @Col2 and Col3 = @Col3 and Col4 = @Col4
insert into dbo.Test
(Col1, Col2, Col3, Col4, ColumnValueVarChar, ColumnValueDate, ColumnValueInt, U_Id)
values (@Col1, @Col2, @Col3, @Col4, @ColumnValueVarChar, @ColumnValueDate, @ColumnValueInt, @U_Id)
Thanks!

Unless I needed an exceptionally-fast covered-index lookup scenario I would probably not consider doing all 4 of those in one index (with at least one exception I think) - clustered or nonclustered. If one of those columns is reasonably sized and very specific I would probably just use that one to keep the index size down. Additional lookups for the few rows that did not meet the remainder of the predicate would be acceptable in most situations I bet. If you need more than one column to get reasonable specificity possibly pick the order depending on whether one or the other is also used in many other filters by itself to increase seek potential.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Eric M Russell SSC Guru Points: 125562 More actions · Answer 13

SQL!$@w$0ME (7/7/2016)
Hello Folks,
Is it a good practice to create clustered index on non-primary key columns since the primary key column is not queried at all. Please share your thoughts or best practices on this.
Many thanks!

If you have a large Customer table where CustomerID is the primary key, but the most common query is by CustomerName, then my first cut would be to cluster CustomerID and have a non-clustered index on CustomerName while also including any other handful of columns needed to cover the query.

But post the DDL for your existing table and indexes so we know for sure what the situation is.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 14

Another good example of non-primary key clustered indexes is in data warehousing scenarios. This is where it can be common to cluster on time rather than the primary key. That's because time is the most common filter for the data in most scenarios depending on how the data is being queried and used.

Another one for me was an alphanumeric identifier that was extremely long and commonly used. Having a non-clustered index on that field caused a lot of fragmentation and increased insert times dramatically. With large volumes, it could take hours upon hours for insterts and updates. While it's a great way to increase performance to non-cluster, it was not maintainable. Therefore another approach would be to convert the identifier to something more usable and non-cluster on the converted output instead. This an be applied to cluster index fields that are ideal candidates, but hard to manage due to the type of data.