DWH : Fact table : index creation.

Question

DWH : Fact table : index creation.

Bhuvnesh

SSC Guru

Points: 59351
More actions
February 15, 2013 at 3:33 am

#394192

Below is the schema of my FACT table
table : Fact.POLICY
Size : 165 GB
Policy_transactional_ID (surrogate key )
Policy_ID
Audit_Trail_ID (FK)
Journalization_ID (FK)
Program_ID (FK)
Product_ID (FK)
Producer_ID (FK)
Surplus_Line_ID (FK)
Policy_ID (FK)
Account Number
Acquisition Costs
monthly cost
total_gain_producer_amt
Amount
Attachment Point
Base Amount
'
'
'
'
'
'
starttime
Endtime
approach 1 :
Policy_transactional_ID identity , not in use of any query , should i make it PK with non clu index
clus index : policy_id + starttime
and other non clus indexes on ( every ID column + starttime )
approach 2 :
Policy_transactional_ID : PK + clus index
and other non clus indexes on ( every ID column + starttime )
people say we should go with case 1 , but why to aligned/sort the data physical on column which is not being used in queries. my take would be case 2
please guide
-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 1

It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing, or Contra correction logic as the decision will have a big impact on which route you go down.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

Andy Hyslop SSCrazy Eights Points: 9198 More actions · Answer 2

Jason-299789 (2/15/2013)
It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing, or Contra correction logic as the decision will have a big impact on which route you go down.

Agreed +1

Putting the clustered index on the startdate column could speed up your select statements providing they use startdate but will also slow down your inserts. In addition I would look at the queries that are hitting this table and think do you really need all those non clustered indexes if yes then fair enough but worth doing the check just to see? 🙂

If you are doing a nightly load rather than an incremental you could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

Cheers

Andy

==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 3

Jason-299789 (2/15/2013)
It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing,

it will be "incremental with Insert new update existing".

And how much indentity column as surrogate key will help the other non clus index although SK will not be part of any query.

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 4

Andy Hyslop (2/15/2013)

could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

Andy Hyslop SSCrazy Eights Points: 9198 More actions · Answer 5

Bhuvnesh (2/18/2013)

Andy Hyslop (2/15/2013)
could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient
Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

Sorry for this answer, but it does depend on the volume of data..

I assume you were referring to the nightly loads (as dropping indexes for an incremental will carry more overhead than its worth)..

For a nightly load yes dropping and re-creating the indexes will have an impact but you will need to try and assess both strategies and see which performs better i.e. perform two loads one with indexes and one dropping and re-creating the indexes and test..

In my experience dropping and recreating generally works faster and has the added benefit of generating a fresh index..

Andy

==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 6

Andy Hyslop (2/18/2013)

Bhuvnesh (2/18/2013)
Andy Hyslop (2/15/2013)
could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient
Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?
Sorry for this answer, but it does depend on the volume of data..
I assume you were referring to the nightly loads (as dropping indexes for an incremental will carry more overhead than its worth)..
For a nightly load yes dropping and re-creating the indexes will have an impact but you will need to try and assess both strategies and see which performs better i.e. perform two loads one with indexes and one dropping and re-creating the indexes and test..
In my experience dropping and recreating generally works faster and has the added benefit of generating a fresh index..
Andy

+1, but would add that there are only two indexes I wouldn't drop, the first is the clustered index, the other is the covering Business key index for the ETL process, especially on tables that are substantial.

But would agree its very much a suck it and see which works best for you, try it without the drop indexes then try the same load with drop indexes, see which one gives you the best solution.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 7

Bhuvnesh (2/18/2013)

Andy Hyslop (2/15/2013)
could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient
Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

Dropping the indexes during ETL would mean misserable performance for users querying the table, also adding overhead wher building index back - former means potential performance impact for other queries.

In regards to indexing strategy, here are my two cents.

Clustered unique index supporting your PK

Non-clustered, non-unique indexes on every FK

Other non-clustered, non-unique indexes depending on actual queries.

Hope this helps.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

ScottPletcher SSC Guru Points: 101234 More actions · Answer 8

Under no circumstances should you cluster this table on Policy_transactional_ID.

Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.

[You can add a separate unique nonclustered PK if you absolutely need it, although I don't think you should need it in this type of situation.]

Besides, the date added should be ascending anyway, other than corrections. Sort the new data being added into the clustering key sequence.

Keep in mind: a given row is INSERTed only once, but SELECTed thousands (or millions!) of times.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 9

ScottPletcher (2/18/2013)
Under no circumstances should you cluster this table on Policy_transactional_ID.
Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.

Lets take an scenario,

i have PK on Policy_transactional_ID and most of the queries are using starttime and policy_number in WHERE clause

then which option will work better for clustered index.

approach 1 : starttime + policy number (sql server itself add RID to make the records unique)

approach 2 : srarttime + policy number + Policy_transactional_ID.

and which apprach will have more overhead from space and resource perspective.

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 10

Scott,

I would disagree with you on the under no circumstances comment, for a couple of reasons

1) the risk of fragmentation is significant on the clustered index, especially if it uses an alphanumeric.

2) On DW's you don't run targeted queries, such as show me Policy X, you run Show me all Policies written in the last timeframe.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

ScottPletcher SSC Guru Points: 101234 More actions · Answer 11

Bhuvnesh (2/18/2013)
ScottPletcher (2/18/2013)
Under no circumstances should you cluster this table on Policy_transactional_ID.
Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.
Lets take an scenario,
i have PK on Policy_transactional_ID and most of the queries are using starttime and policy_number in WHERE clause
then which option will work better for clustered index.
approach 1 : starttime + policy number (sql server itself add RID to make the records unique)
approach 2 : srarttime + policy number + Policy_transactional_ID.
and which apprach will have more overhead from space and resource perspective.

I would think that starttime + policy number would inherently be unique. If it is unique, just define the index as UNIQUE yourself and SQL won't add a "uniquifier".

If it's not unique, either 1 or 2 will be OK; most people would probably prefer 2 just because it's clearer what is happening, since you can explicitly specify UNIQUE in the index.

The combined clustered index will take slightly more diskspace for the non-leaf clustered index levels on ( starttime, policy_number [, policy_transactional_id] ) than on just ( policy_transactional_id ). However, if the proper clustered index prevents you from having to create a second, nonclustered index on ( starttime, policy_number ), then you've saved disk space anyway, but much more importantly, you've saved additional I/O every time a row is deleted, inserted or updated.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101234 More actions · Answer 12

Jason-299789 (2/19/2013)
Scott,
I would disagree with you on the under no circumstances comment, for a couple of reasons
1) the risk of fragmentation is significant on the clustered index, especially if it uses an alphanumeric.
2) On DW's you don't run targeted queries, such as show me Policy X, you run Show me all Policies written in the last timeframe.

You're simply wrong.

"Show me all Policies written in the last timeframe"

Exactly! So the table should be clustered by timeframe in that case, not some dopey meaningless identity column.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".