June 22, 2010 at 3:41 am
Hi all,
I'm dealing with a relative large fact table.
60 million rows and ABOUT 14 GIGAbyte in size.
I'm building Cognos reports on the facttable (no cubes, not yet).
What the developer has done is this, he added the natural keys to the fact table. Whereas a 'normal' dimensional fact table has only the surrogate keys to the dimensions, this fact table contains first the surrogaat keys, then the measures en THEN all natural keys (ie customer number, productnumber, etc.).
Now I now that Kimball mentioned this design technique in his toolkit, but I'm worried how all the extra columns affect performance on fact table queries, as the natural keys (all varchars) are expensive in terms of size.
But (and this is my question) how will the performance be influenced if I only select the proper surrogate dimensional keys in a query?
Will the extra columns in the table influence the query even though they are not included in the query?
In other words, how will the actual size of the table influence a query, if only a few columns are included in the query itself.
In my example, I think removing the natural keys will decrease the table size from 14 gig to 2 gig. But...will this actually improve the performance of fact table queries (as they only include the surrogate keys allready)?
Thanks for your replies!
Kind regards
June 22, 2010 at 5:39 am
Your queries against this fact table will most likely result in scans of large ranges of data. Most likely data will have to be read from disk.
If that is the case the number of rows you can fit on a page will affect the execution time.
If you include all the large business keys there will be fewer rows per page so the server will have to read more pages from disk.
To test the effect of this you could create a non-clustered index that includes all columns except the large business keys.
Run the qurey with and without this index - compare the execution time and the number of logical reads.
When comparing execution times you should empty the cache first by using DBCC DROPCLEANBUFFERS to force a read from disk.
If you get a significant improvement with the index you could either just keep the index or you could remove these extra columns from the real fact table.
Keeping the index will use more disk space, and it will slow down inserts and updates of the fact table.
/SG
June 22, 2010 at 7:58 am
Hi Steven,
thanks for your reply!
If I understand you correctly, the columns with the business keys in the fact table will ALLWAYS slow performance, even though they will never be included in any query, as they decrease the number of rows SQL server can store on one page?
(for the record, I understand the columns where only added to be able to retrieve the origin of the -1 or 'unknown' facts in the fact table, ie facts that don't have a corresponding business key in the dimension. In my oppinion a false technique. Records that don't have a corresponding business key in the dimension should not be included in the fact table (perhaps an error-fact table, ok...). It consumes a LOT of space!)
All the surrogate keys to the dimensions are non-unique non clustered indexes. There also a primary key, surrogate of nature (identifier), with a clustered, unique index. Would you say its better to implement a unique clustered index on the combination of ALL surrogate key columns?
Or is the indentier surrogate primary key a good alternative?
Thanks again for your respons!
Kind regards
June 22, 2010 at 8:17 am
Hi Stefan,
never mind my surrogate vs compound key for a fact table.
I allready understand there are some etl advantages for using this as a primary key for the fact table.
Kind regards,
June 22, 2010 at 8:27 am
M__M (6/22/2010)
Hi Steven,
Stefan 😉
If I understand you correctly, the columns with the business keys in the fact table will ALLWAYS slow performance, even though they will never be included in any query, as they decrease the number of rows SQL server can store on one page?
Correct. How much it will slow down things for you depends on a lot of things, but there will always be some slowdown.
(for the record, I understand the columns where only added to be able to retrieve the origin of the -1 or 'unknown' facts in the fact table, ie facts that don't have a corresponding business key in the dimension. In my oppinion a false technique. Records that don't have a corresponding business key in the dimension should not be included in the fact table (perhaps an error-fact table, ok...). It consumes a LOT of space!)
There are lots of different ways to handle missing dimension rows. Another possibility is to add "inferred" members to the dimension tables. The business key would then be found in the inferred dimension member. The fact table would only need the surrogate key.
All the surrogate keys to the dimensions are non-unique non clustered indexes. There also a primary key, surrogate of nature (identifier), with a clustered, unique index. Would you say its better to implement a unique clustered index on the combination of ALL surrogate key columns?
Or is the indentier surrogate primary key a good alternative?
The clustered key in such a big fact table should always be as narrow as possible - preferrably a single int. Remember that the clustered key is included in each of the non-clustered indexes.
To avoid page splits the clustered column should be steadily increasing over time.
For big fact tables I usually use partitioning on a date column. I usually use this date column as a non-unique clustered index key.
Using an identity-column is also a reasonable option, however such a clustered key will not help in any queries against the fact table. Using an appropriate date column might increase the performance of some queries a lot.
Using a primary key composed of all dimension columns will only help you avoid inserting duplicate rows in the fact table. It will slow down inserts, and it will consume space. It will not help query performance much.
I usually have a non-clustered unique index on the logical primary key in the fact table during development to catch bugs in the ETL process. When the system is in production I usually remove this index just to improve ETL performance.
Also, actually having foreign keys to all dimension tables might decrease insert performance too much. I usually turn off foreign keys once the system is in production.
It is all a balance between correctness, query performance and ETL performance.
June 22, 2010 at 1:14 pm
Hi Stefan,
thanks again for the response.
I'll use your feedback!
Kind regards
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply