May 16, 2008 at 3:37 am
Hi - I'm currently working on a project to build a data mart for a web site, the data mart is to measure traffic coming to the website. The website tracks page impressions, so we have a row of data for each page that a user views.
The fact table (fctPageImpression) only requires a few dimensions, Date, Time, User, Page, Source (where the user came from) and Referrer. There are 2 measures Page Impressions and Users, PageImpressions will be calculated by summing the Amount col and Users will be calculated by performing a distinct count on the UserKey col.
I estimate this fact table will have half a billion rows per year added to it.
Please view attached PageImp_ERD_a.jpg
While i have been getting more into the detail of the design i have realised that 90-95% of the queries are based on combinations of channel, date and source, measuring users and Page impressions. It is also true that 95% of these are measuring users rather than page impressions.
Based on this i am considering an alternative design please see attached PageImp_ERD_b.jpg
The difference between a and b is that the fact table has been split in half, with an AggKey added to join them back together. PageKey and Time key have been moved into the detail fact, and ChannelKey has been added to the Agg Fact.
I estimate that the Agg table will have 60% less rows than the detail table (per year this works out to 300,000,000 less) and will service 90% of the queries. BUT adding the detail table and relating via the AggKey means that no detail is actually lost because that join gives me back the full detail.
Given the gains this does seem to be not a completely insane idea but its not something i've ever seen done before. Can i get some general thoughts please?
I'd really appreciate some considered thought on this.
Thankyou
Bob.
May 16, 2008 at 6:38 am
I think you are building a table with aggregations and using it with a product (Analysis Services - I assume) that is designed to calculate aggregations. Sound funny when I say it like that?
Why wouldn't you just use the first design and add aggregations specifically for the higher level group?
Your warehouse is super-simple. Don't over-complicate it for no reason. I would leave this as a simple star schema. If you include this summary table, you will suddenly have to keep the summary and detail in sync - this will end up more troublesome than you think a year from now. Let AS do it's job, it is pretty good at it.
May 16, 2008 at 6:39 am
You will end up with a pretty good number of facts, so it is probably a good idea for you to look into partitioning right away.
May 16, 2008 at 6:49 am
Hi thanks for replying, the mart is more complex than that there are 6 other fact tables, 25 other dimensions, i've cut it down just to show whats relevent for this post.
The reason for design b is two fold - 1 i would save 300,000,000 rows of data a year from the more heavilly used table (agg) only a handful of queries go down to the detail of page and time. This would not only save significant storage, which has its own benefit for recovery, re-indexing etc but also would dramatically increase the performance of the 'user' related queries as these are always distinct counts on that key.
I know what your saying about over complicating the model this is an data warehousing ethos i do stick by also. But design is about trade offs in my opinion so i'm not sure.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply