Put clustered index on 8 Column Natural Key, or on Identity Key

Question

Put clustered index on 8 Column Natural Key, or on Identity Key

BSharbo

SSC Veteran

Points: 210
More actions
August 2, 2015 at 5:37 pm

#299377

Hello everyone. I am extremely new to database design, and I ran into a problem that I know comes up often, however has many opinions... I'm not sure if there is a right answer, but I have a rather extreme case so was hoping some additional opinions could help me out!
Basically I have a table that is going to have 50+ columns. The natural key on this table is actually 8 columns wide, 4 of them being Varchar columns by default. (varchar(50)'s).
I have added an identity column, (1,1) to the table, however I put the clustered index on the 8 natural keys... My plan is to rebuild the clustered index once nightly when the system isn't in use (after 7 pm).
I know others would say it would be better to have the clustered key on the 1,1 column and then add indexes on the other 8 fields... However I don't quite understand why honestly...
Every single query against this table will use the 8 columns, and will NOT use the Identity column (1,1) because they are calls from other systems that do not know the Identity column....
Therefore if your database is set up for query speed, and every single query has to have a value for 8 columns to get a valid result, does it make sense to put a clustered index over the 8 columns?
If not why? Why is putting a clustered index on an identity column (that will literally never be used in a query) a better solution?
Thanks for your time!

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

TomThomson SSC Guru Points: 104768 More actions · Answer 1

It depends on two things: 1. the frequency of inserts to the table, and 2. whether queries typically access several adjacent rows in the ordering you choose to place on the 8 columns.

If inserts are frequent (but perhaps not if they are extremely frequent), it's a good thing to cluster on something that increases or decreases with time, since otherwise inserts are going to cause fragmentation. However, if queries access multiple rows it's a good idea (if possible) to cluster on something that will bring together rows that will be accessed together.

Usually when confronted with something like 8 varchar(50) fields as the primary key it turns out that none of the key columns is helpful in bringing together on disc rows that are accessed together and none of the key columns is monotonically increasing or monitonically decreasing, so clustering on the primary key helps neither in avoiding fragmentaton nor in improving perfornace by bringing together on disc storage rows that are accessed together, so usually it is best not to cluster on the primary key, but on some column that will help with one or other of these two things; there may already be some column that will help if clustered on, but often there isn't and it may then be necessary to introduce a rather meaningless column (such as an identity column) that will help avoid fragmentation. But the real primary key should still be declared as the primary key (unclustered) - an identity column should certainly not be made the primary key in cases like yours when all access is by the natural key and the originators of the accesses can't reasonably be aware of the identity values.

Tom

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 2

Whether or not the primary key for the table should be clustered really is down to experimentation and testing. I'm with Tom that, at least from the description provided, you're better off with that eight column unique identifier for the table. Whether or not you also make that the clustered index, that's a question of examining your execution plans to see how it gets used by the optimizer.

Since you have so many different columns, I'd be concerned about this. First off, you need to pick a very good first column for the index. The histogram in the statistics is one of the key drivers behind good index use by the optimizer. As such, the optimizer has to be under the impression that your index is useful. This is determined by looking first at the histogram to measure selectivity. Experiment with which column should be that leading edge. Probably it needs to be the most selective, but possibly not.

Also, while you technically don't need to worry about the order of the columns in the WHERE clause because they'll be rearranged by the optimizer, I would suggest getting in the habit of always putting them in the same order as the primary key, clustered or not. This will help to reduce the amount of work that the optimizer has to do.

From what you're describing, it doesn't sound like an identity is needed. The only pushback I have on this though, does this table have child tables? Because eight column foreign keys are absolutely going to start to add seriously to the design and issues on the database. In that case, I probably would suggest making an identity column in order to help facilitate joins to other tables. I still probably wouldn't make it the clustered index though, depending.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

BSharbo SSC Veteran Points: 210 More actions · Answer 3

I can understanding adding an identity key to a table to facilicate joins to other tables.

However in this instance there is only the one table.

I still probably wouldn't make it the clustered index though, depending.

I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)

"Loan number"

"city"

"state"

"client"

"Servicer"

"Pool" (A level lower than client that makes loan number distinct)

"LoadedDate"

"SampledDate"

If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?

What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....

SampledDate ; servicer ; pool ; client ; OTHER FIELDS.

I believe this will lessen fragementation a lot.

let's say I had 10 tables, all of which needed to join on these 8 fields... In that case I could add an identity column to all of the tables and join on that column...

Are you saying we should just add a regular index to this identity column, and keep the clustered index on the other 8 columns?

Thanks for your time. sorry for the confusion... It seems if you ask 10 people this question you will get 10 different answers which makes learning quite confusing.

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 4

BSharbo (8/3/2015)
I can understanding adding an identity key to a table to facilicate joins to other tables.
However in this instance there is only the one table.
I still probably wouldn't make it the clustered index though, depending.
I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)
"Loan number"
"city"
"state"
"client"
"Servicer"
"Pool" (A level lower than client that makes loan number distinct)
"LoadedDate"
"SampledDate"
If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?

Yeah, probably. It does depend on how many records a day come in because that value is unlikely to be very selective at all if you have lots of data. But, from your description, and making guesses based on the other columns, yes.

What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....
SampledDate ; servicer ; pool ; client ; OTHER FIELDS.
I believe this will lessen fragementation a lot.

If you can guarantee that order on most queries, yes. Again, probably a good choice. If you start seeing other columns being referenced, you may need additional indexes.

let's say I had 10 tables, all of which needed to join on these 8 fields... In that case I could add an identity column to all of the tables and join on that column...
Are you saying we should just add a regular index to this identity column, and keep the clustered index on the other 8 columns?

Yes. But you need to make sure that index is unique, or that it's the primary key. Remember, you can create the clustered index on a column or columns that are not the primary key. Also, you can create a clustered index on non-unique data, but it doesn't work as well.

Thanks for your time. sorry for the confusion... It seems if you ask 10 people this question you will get 10 different answers which makes learning quite confusing.

Yes. You will. This is a combination of different people's experience, their knowledge level, their flexibility in approach, and whether or not they've tested everything they're saying.

You'll notice I'm using lots of hedge and weasel words. That's because I'm not looking at your data and structures and code, so I want to be really cautious on making recommendations without adequate knowledge. So, I can tell you how SQL Server works and what seems likely based on what you're providing. Hope that's helpful.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

BSharbo SSC Veteran Points: 210 More actions · Answer 5

It is helpful and I truly appreciate it.

Maybe a dumb question... But at the company I work for, pretty much every table has an autoincremented primary key that is clustered.... This is userful when joining together many tables (obviously) however I can't imagine it's useful for querying the data...

I was told it's almost always better to have a primary key Clustered on a random identity column, and then just add indexes to your "natural keys". They said queries against your table will use the natural keys (hence why they need indexes) put that doesnt' mean they need to be your clustered index (they note the ordering issue as you pointed out and the fragemenation that comes with it).

Is there a downside to their methodology? The main downside I can see is that your true natural key isn't your primary key anymore, so all code written against the system needs to do dup checks, etc...

Other than that issue... It seems like queries against the table using regular indexes would be slower....

Okay more to digest and think about over lunch. Thanks

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 6

BSharbo (8/3/2015)
It is helpful and I truly appreciate it.
Maybe a dumb question... But at the company I work for, pretty much every table has an autoincremented primary key that is clustered.... This is userful when joining together many tables (obviously) however I can't imagine it's useful for querying the data...
I was told it's almost always better to have a primary key Clustered on a random identity column, and then just add indexes to your "natural keys". They said queries against your table will use the natural keys (hence why they need indexes) put that doesnt' mean they need to be your clustered index (they note the ordering issue as you pointed out and the fragemenation that comes with it).
Is there a downside to their methodology? The main downside I can see is that your true natural key isn't your primary key anymore, so all code written against the system needs to do dup checks, etc...
Other than that issue... It seems like queries against the table using regular indexes would be slower....
Okay more to digest and think about over lunch. Thanks

Lots of designs are done with artificial keys. It's a valid approach. I tend to use a little bit of a mix of natural and artificial keys. I don't have a good article on it, but I have an example in an older article I wrote. [/url] The primary drawback to using artificial keys is that you still have to have a constraint on the natural keys, otherwise you're likely to get very bad data. This means that inserts, updates and deletes, are guaranteed to have to do more work. It's absolutely a balancing act if you're serious about carefully designing your system. Or, you can take the purist approach and use only natural keys or only artificial keys. I'm not a fan of that method.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

ScottPletcher SSC Guru Points: 101238 More actions · Answer 7

BSharbo (8/3/2015)
I can understanding adding an identity key to a table to facilicate joins to other tables.
However in this instance there is only the one table.
I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)
"Loan number"
"city"
"state"
"client"
"Servicer"
"Pool" (A level lower than client that makes loan number distinct)
"LoadedDate"
"SampledDate"
If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?
What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....
SampledDate ; servicer ; pool ; client ; OTHER FIELDS.
I believe this will lessen fragementation a lot.

You want the lead column(s) in the clustered index to be those that are always (or almost always) specified in the query. From among those always-used columns, pick the one that has the fewest unique values as the lead column.

Once you determine that, you can decide which of 3 main clustering options is best:

1) an identity column,

2) natural key columns,

3) a hybrid solution, where you first insert the natural keys into a "key store" table to generate a single-int clustering key value, and use that as the key to the main table. That seems unlikely to be best in this case, but you never know I guess.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Rich Mechaber SSChampion Points: 11131 More actions · Answer 8

Setting aside the question of clustering key for the moment, which others here have already taken up, I would ask whether your assumption of having an 8-column natural key is, in fact, correct. In addition, you stated that you have 50 columns in the table and, further on, that there is only the one table in your database.

Without knowing more about the real-world data you're trying to model in your design it's impossible to say, but that sounds to me like you've probably violated basic design practice. Not that an 8-column key is impossible or that a 50-column table is wrong, but you yourself say that you are a beginner at database design, and this is an area that it's common for beginners to walk astray.

In the example you gave about loan numbers/providers/clients, it looks like the three columns of loan number, client, and pool -- which you yourself describe as "A level lower than client that makes loan number distinct" -- are sufficient to identify a unique row in the table and that everything else depends on them. In other words, they constitute a natural primary key.

Things like LoadedDate are facts about the row but are not necessary to making the row unique (and therefore should not be part of a primary key).

Likewise, if city and state are aspects of the client, then city and state probably belong in a Clients table, with a foreign key to your Loans table. If city and state are, say, the location of the asset for which the loan was made, then you might want to consider an Assets table, where you can add multiple columns about the asset (type: car, home, boat, etc.; city; state; etc.).

If you can post an actual list of column names and describe what you're actually trying to model, you might get a lot more useful information back about different ways you might get to set this up.

Fortunately, you started at the beginning: how best (and "best" can be different things) to set up a database. It's a lot easier to create a better database if it holds no data yet!

Rich

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 9

ScottPletcher (8/3/2015)
BSharbo (8/3/2015)
I can understanding adding an identity key to a table to facilicate joins to other tables.
However in this instance there is only the one table.
I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)
"Loan number"
"city"
"state"
"client"
"Servicer"
"Pool" (A level lower than client that makes loan number distinct)
"LoadedDate"
"SampledDate"
If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?
What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....
SampledDate ; servicer ; pool ; client ; OTHER FIELDS.
I believe this will lessen fragementation a lot.
You want the lead column(s) in the clustered index to be those that are always (or almost always) specified in the query. From among those always-used columns, pick the one that has the fewest unique values as the lead column.
Once you determine that, you can decide which of 3 main clustering options is best:
1) an identity column,
2) natural key columns,
3) a hybrid solution, where you first insert the natural keys into a "key store" table to generate a single-int clustering key value, and use that as the key to the main table. That seems unlikely to be best in this case, but you never know I guess.

"specified" really isn't precise enough in this scenario. a column that just appears in the data but isn't used in sorting or filtering is NOT a good choice for leading edge of the CI: those should really just go in the included columns. Assuming there isn't any substantial difference in fragmentation, I order by what is used most in filtering (WHERE or JOIN clauses), which most selective to the front, followed by items most used for ordering (group by /order by). That's usually the starting point, with some experimentation to see if other arrangements might be marginally better.

Of course it is rare that ordering by the "perfect retrieval strategy" doesn't have a sizeable impact on storage and fragmentation, but that's part of the tradeoff analysis.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

ScottPletcher SSC Guru Points: 101238 More actions · Answer 10

Matt Miller (#4) (8/4/2015)
ScottPletcher (8/3/2015)
BSharbo (8/3/2015)
I can understanding adding an identity key to a table to facilicate joins to other tables.
However in this instance there is only the one table.
I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)
"Loan number"
"city"
"state"
"client"
"Servicer"
"Pool" (A level lower than client that makes loan number distinct)
"LoadedDate"
"SampledDate"
If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?
What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....
SampledDate ; servicer ; pool ; client ; OTHER FIELDS.
I believe this will lessen fragementation a lot.
You want the lead column(s) in the clustered index to be those that are always (or almost always) specified in the query. From among those always-used columns, pick the one that has the fewest unique values as the lead column.
Once you determine that, you can decide which of 3 main clustering options is best:
1) an identity column,
2) natural key columns,
3) a hybrid solution, where you first insert the natural keys into a "key store" table to generate a single-int clustering key value, and use that as the key to the main table. That seems unlikely to be best in this case, but you never know I guess.
"specified" really isn't precise enough in this scenario. a column that just appears in the data but isn't used in sorting or filtering is NOT a good choice for leading edge of the CI: those should really just go in the included columns. Assuming there isn't any substantial difference in fragmentation, I order by what is used most in filtering (WHERE or JOIN clauses), which most selective to the front, followed by items most used for ordering (group by /order by). That's usually the starting point, with some experimentation to see if other arrangements might be marginally better.
Of course it is rare that ordering by the "perfect retrieval strategy" doesn't have a sizeable impact on storage and fragmentation, but that's part of the tradeoff analysis.

True; by "specified", I meant in the WHERE clause, as a row specification. What is just in the SELECT clause obviously doesn't need to be a clustering key. What's used in order by is nebulous at best too, definitely test it to verify that it does eliminate/reduce a sort before you add it to the clustering key.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Kim Crosser SSCommitted Points: 1763 More actions · Answer 11

rmechaber (8/4/2015)
Setting aside the question of clustering key for the moment, which others here have already taken up, I would ask whether your assumption of having an 8-column natural key is, in fact, correct. In addition, you stated that you have 50 columns in the table and, further on, that there is only the one table in your database.

I concur with "rmechaber". A 50-column table and/or an 8-column index structure would raise a lot of red flags for me. While there may be unusual cases where these would apply, I would strongly urge looking at reorganizing the data into multiple tables that may better reflect the actual data relationships.

If all your records always have the same values in City and State for a given account or customer, then you should probably have a separate table for Account or Customer with the City/State information contained in that table, keyed appropriately by Account or Customer.

Quick check (or maybe not so quick) - of your eight columns of index data, are there any sets of those columns where if one column has a particular value, the other column(s) always contain specific values. For example:

select max(tmp2.howmany)

from (

select tmp1.Col1, count(*) as howmany

from (

select mytable.mycol1, mytable.mycol2 -- put your real table.column names here

from mytable

) tmp1

group by tmp1.Col1

) tmp2

If you run this for any pair of your 8 columns (you would need to swap around the orders) and it returns "1", then you probably have a redundant relationship that should be moved to a second (or nth) table. This isn't particularly elegant, but is simple to try.

If every pair returned a number greater than one, then you really may have a valid structure that needs 8 columns of indexes...

codykonior SSCommitted Points: 1591 More actions · Answer 12

This question is as much as holy war as source code formatting 😀 But as long as you're happy with what you've done nobody is likely to complain.

1) When the real "unique" identifier has to change, it only has to be done in one spot because the identity columns can be retained as-is. e.g. you're not deleting and re-inserting records in in every table that references it, which can be nightmarishly convoluted.

2) Part of it comes down to performance. Joins on tables with long char columns are less performant than those on an int, and when you're holding auditing data with hundreds of millions of rows that can be important. There may also be performance implications when you're constantly inserting rows in their "logical" order and everything has to be shuffled around in the background. Your nightly rebuild window is, frankly, a true luxury.

But the rule of thumb is that if you want to anger Celko then the very least you can do is create a clustered index, or unique constraint, on the real primary key.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

ScottPletcher (8/4/2015)
Matt Miller (#4) (8/4/2015)
ScottPletcher (8/3/2015)
BSharbo (8/3/2015)
I can understanding adding an identity key to a table to facilicate joins to other tables.
However in this instance there is only the one table.
I guess I'm still a bit confused with these answers... Let's say my 8 columns are... (in a transactional sampling database)
"Loan number"
"city"
"state"
"client"
"Servicer"
"Pool" (A level lower than client that makes loan number distinct)
"LoadedDate"
"SampledDate"
If I wanted to make a clustered index on these columns, the "sampledDate" by definition is ordered. Every new entry will have a sampled Date > than the last. From your comments I believe you are saying I should use this as the first level of my clustered index?
What happens if queries only need 3-4 of the columns (such as "servicer + pool + client")? In that case i'm guessing those 3 should be the next value of the clustered index, such as....
SampledDate ; servicer ; pool ; client ; OTHER FIELDS.
I believe this will lessen fragementation a lot.
You want the lead column(s) in the clustered index to be those that are always (or almost always) specified in the query. From among those always-used columns, pick the one that has the fewest unique values as the lead column.
Once you determine that, you can decide which of 3 main clustering options is best:
1) an identity column,
2) natural key columns,
3) a hybrid solution, where you first insert the natural keys into a "key store" table to generate a single-int clustering key value, and use that as the key to the main table. That seems unlikely to be best in this case, but you never know I guess.
"specified" really isn't precise enough in this scenario. a column that just appears in the data but isn't used in sorting or filtering is NOT a good choice for leading edge of the CI: those should really just go in the included columns. Assuming there isn't any substantial difference in fragmentation, I order by what is used most in filtering (WHERE or JOIN clauses), which most selective to the front, followed by items most used for ordering (group by /order by). That's usually the starting point, with some experimentation to see if other arrangements might be marginally better.
Of course it is rare that ordering by the "perfect retrieval strategy" doesn't have a sizeable impact on storage and fragmentation, but that's part of the tradeoff analysis.
True; by "specified", I meant in the WHERE clause, as a row specification. What is just in the SELECT clause obviously doesn't need to be a clustering key. What's used in order by is nebulous at best too, definitely test it to verify that it does eliminate/reduce a sort before you add it to the clustering key.

Yes - agreed.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

BSharbo SSC Veteran Points: 210 More actions · Answer 14

rmechaber (8/4/2015)
Setting aside the question of clustering key for the moment, which others here have already taken up, I would ask whether your assumption of having an 8-column natural key is, in fact, correct. In addition, you stated that you have 50 columns in the table and, further on, that there is only the one table in your database.
Without knowing more about the real-world data you're trying to model in your design it's impossible to say, but that sounds to me like you've probably violated basic design practice. Not that an 8-column key is impossible or that a 50-column table is wrong, but you yourself say that you are a beginner at database design, and this is an area that it's common for beginners to walk astray.
In the example you gave about loan numbers/providers/clients, it looks like the three columns of loan number, client, and pool -- which you yourself describe as "A level lower than client that makes loan number distinct" -- are sufficient to identify a unique row in the table and that everything else depends on them. In other words, they constitute a natural primary key.
Things like LoadedDate are facts about the row but are not necessary to making the row unique (and therefore should not be part of a primary key).
Likewise, if city and state are aspects of the client, then city and state probably belong in a Clients table, with a foreign key to your Loans table. If city and state are, say, the location of the asset for which the loan was made, then you might want to consider an Assets table, where you can add multiple columns about the asset (type: car, home, boat, etc.; city; state; etc.).
If you can post an actual list of column names and describe what you're actually trying to model, you might get a lot more useful information back about different ways you might get to set this up.
Fortunately, you started at the beginning: how best (and "best" can be different things) to set up a database. It's a lot easier to create a better database if it holds no data yet!
Rich

I realized my original message came off as likely offensive and knee jerk so I want to edit it!

Hey Rich. I appreciate the thought and your help, but the 8 columns ARE necessarily the key to the table. I do concur that I could easily split the table up into multiple tables, however I don't see the benefit honestly.

In my example above I made up 8 columns and did a bad job of explaining why they are the key, so please see my actual example below.

My actual data Example:

What is happening is a client is supplying a daily file with 50+ columns. (Lets say it's 200 + columns, as this has happened to me before). The key to each record is is "loan number" "module" "data month" "Review week" "loan servicer" "client Name" "pool" "Loaded Date".

note that Loaded Date does have to be part of the natural key, because if the client sends the same loan the next day, we have to review it again (strange I know). Therefore the exact same record for the 7 other keys can be in the database with two loaded Dates.

I have been doing quite a bit of testing, and I tried the following two methodolgies..

Table has 13,046,000 + records.

Method #1:

Put a primary key clustered Index on the 8 key columns. Run a query that finds the "most recent laon load per Specific Key by loadedDate". The query is given below.

Population Updater is simply a table that has a record for each of the

here is the actual query:

SELECT lp.*

FROM loanlevelpopulation_raw lp

JOIN (SELECT pp1.LoanNumber,pp1.module , pp1.datamonth, pp1.loanservicer, pp1.Clientname, MAX(loadeddate) AS Loadeddate1 FROM loanlevelpopulation_raw pp1

GROUP BY pp1.LoanNumber,pp1.module,pp1.datamonth,pp1.loanservicer,pp1.Clientname) rw

ON rw.MODULE=lp.MODULE

AND rw.Loadeddate1=lp.loadeddate AND rw.datamonth=lp.datamonth AND rw.loanservicer=lp.loanservicer AND rw.LoanNumber=lp.LoanNumber

AND rw.clientname=lp.clientname

QUery takes 1 minute 40 seconds.

Method #2:

Put a primary key NONclustered Index on the 8 key columns. Add an auto-incremented column which then has a clustered index on it. Run the exact same query as above.

QUery took over 14 minutes before I killed it.

It seems like for my specific needs, method #1 is working over 10 times faster, even though it is a method which every Database designer would disagree with... And now I am slightly more confused haha 🙂

Interestingly the query plan for Method 2 uses an Index Scan and the plan for method 1 uses an Index seek, which seems to be really the only major difference.

EDIT: I'm sorry I did not mean to generalize. I meant to say that all of the DBA / Developers that I have talked too thus far are of the opinion that an auto-incremented Key clustered is the way to go, instead of clustered the actual natural keys. I am glad to see that some people are not so gun ho to "always use an auto incremented clustered key on a table".