Is it a good practice for fact tables to have a PK?

  • In a data warehouse, is it a good practice for fact tables to have a actually primary key added as a composite of its FK's.

    As I understand, using a PK can slow down the ETL load process and I can use my ETL tool (informatica) to define a logical key when there’s a need to update or insert the fact table - this support my ETL process, doesn't allow duplicates all without having to have a physical primary on the fact table.

    Since the clustered index doesn't need to be on a primary key I don't believe there are any performance issues on the OLAP side, but are they any benefits I'm missing out on?

    Any insight would be much appreciated.

  • Really depends on your data and how you are controlling the load. You might need a surrogate key on the fact for ETL purposes for a loading point-of-view or for updates. Here are a few references that you can take a look at that might help you to decide on what you will need:

    Design Tip #81 Fact Table Surrogate Key

    Design Tip #84 Reader Suggestions for Fact Table Surrogate Keys

    Composite Primary Key in fact table[/url]

    ----------------------------------------------------------------------------------------
    Dan English - http://denglishbi.wordpress.com

  • A primary key on a combination of all of the foreign keys in the table is probably inappropriate. Since it may be necessary for you to add reversal records and possibly after adding a reversal adding the record in again, a combination of all of the other keys in the table would be reasonably likely to need to be violated.

    I recommend a primary key on a fact table, but I typically recommend an identity column being the primary key. This is really useful in adding new records for an incremental load into a cube. You can then easily track the last loaded record and load the cube with records having an Id higher. Another option is to have a batch number and process by this batch number, but having an identity column to use for a primary key is convenient. Regardless, it is very important in any table to have a way to uniquely identify each record. So, enforcing this uniqueness with a primary key constraint is usually a good idea.

    It is important that any clustered index you use (which does not need to be the primary key) is in the order data is going to be added into the fact table. If your clustered index is not in the order records are being added, it will continuously fragment and you will be forced to reindex the fact table regularly. If you have a cuple hundred million records in the table, this can become problematic.

  • No! Primary keys don't belong in fact tables. They are practically useless since indexing them does not add any benefit and will add unwanted size to your fact table. Instead, use composite keys.. a unique combination of foreign keys from your dimension tables.

  • There's two topics under discussion here, and they deserve to be pulled apart:

    Surrogate keys vs business keys.

    The case for surrogate keys occurs when you have large business keys and billions or trillions of rows in a data table, and determine that you need to conserve space in the data table. However, it needs to be balanced against the complexity of doing data updates and easily identify (and managing) this records that need to be maintained. For example, careful analysis may determine that you can save 25% of the space in a terabyte table by switching to surrogate keys, but at what cost?

    For example, a business key may be a date time field, which could easily be replaced by a surrogate key, resulting in substantial space savings. However, it's often useful to partition large tables by time, so that old information can simply be swapped out as new information come in. (I've also swapped out new information, when it was determined that bad data had been loaded.) You'll need tight control over your surrogate keys, to know that you have a reliable partitioning strategy.

    Primary key or no Primary key.

    I've never been a fan of database designs that fail to define a primary key. It's important to understand that primary key definitions are simply part of the base data. A detailed discussion on space utilization is beyond this response.

    What should be a bigger consideration, is the clustering sequence for the data table. As in all cases, the best clustering sequence should be the one that addresses the most demanding queries (either in terms of frequency or query time).

    Assuming your data warehouse is built for a product like MSAS, the best clustering sequence is based on your cube partitions, even if you have to artificially add a column or two to the data. Remember, OLAP is about data retrieval, not business rules.

    Finally, regarding the level of detail to keep in a data warehouse, it's true that you want to keep data at the lowest level that is likely to be used. That means you still need to make some design decisions. Too low a level, and performance suffers for no appreciable benefit. To high a level, and you lose the ability to gain meaningful information and insights. If it's important to track and analyze "reversals", there should be separate records (and dimension information) to track this. If it isn't (or unlikely to be), then this information should be aggregated.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply