Pitfalls to Avoid when Designing OLAP Cubes
Creating even the simplest of cubes can have its pitfalls. I have found the biggest problem with building cubes is the quality of the data. If a person has very clean data, building cubes will be a breeze. The second biggest problem is simply poor cube design.
Here is a list of some of the pitfalls people fall into when designing cubes. This is just a partial a listing of the many problems you can encounter when designing OLAP Cubes.
- Building the world in one cube.
- Putting unrelated data in the same cube
- Different levels of granularity between the fact table and dimension table
- Fact tables have more foreign key members than exist in the dimension table
- Not taking advantage of Virtual Dimensions
Building the World in One Cube
A common problem with cube designs (especially people building their first cubes) is trying to put everything into one cube. This occurs when the cube developer attempts to put every possible dimension into one cube. A good cube design requires some homework before the developer opens the Analysis Services application. The analyst must gather requirements by interviewing the end user community. The results of these interviews should determine what business questions the end users hope to be able to answer from a multidimensional application. A perspective should be established that provides answers to a majority of the end users questions. It is not always necessary to be able to answer all the end users questions in one cube. A typical cube consists of 5 to 10 dimensions. You can have more than 10 dimensions. However, the average comprehension for most people is about 7 dimensions. I have read this factoid in several different sources (I don’t remember which ones though). I figure studies have been done, people answered surveys, someone made money, the article was published, people agreed with it and it became common practice. A good design may consist of several cubes with a virtual cube that joins two or more of the cubes together.
Putting Unrelated Data in the Same Cube
On one of my first cube projects, the customer (a transportation company) wanted to replicate a set of reports dealing with trucking and intermodal (railroads). In order to replicate the reports several dimensions were created that were only relevant to one part of the business and not the other. For instance, three dimensions pertained only to intermodal (railroad) and two dimensions pertained only to trucking. Each additional dimension makes the cube exponentially larger. The method to resolve the problem was to simply split the cube into two cubes: one for trucking and one for intermodal.
Different Levels of Granularity
I built a cube that consisted of monthly data. The Time dimension is a shared dimension based on the daily level. The fact table had a column for Year Month combination that was a foreign key to the YearMonth column in the dimension table. When I processed the cube I noticed that the measures were 31 times larger than I had expected. The problem was that the YearMonth was not unique in the Time dimension since there were 31 days in that particular month. This is a many-to-many relationship. The problem was resolved by creating a column in fact table called FirstDayOfMonth that is derived from the YearMonth column by repositioning the month and year and inserting ‘/01/’ between the month and year. Where the YearMonth column has the value ‘200203’; this is repositioned to create a value of ‘03/01/2002’ for the column FirstDayOfMonth. Using the FirstDayOfMonth column to be joined to the Time dimension creates a one-to-many relationship. Another solution would be to create another time dimension with granularity at the YearMonth level.
Fact Table has more Keys than the Dimension Table
This problem occurs where some of the rows of data in the fact table are orphaned because they cannot be joined with one or more dimensions. A couple of mistakes may have happened here that caused the problem. One could be the dimension tables were created from a different data source than the foreign keys in the fact table. To correct the problem, the foreign keys in the fact table should be derived from the same data source used to create the dimension table(s). Another potential cause of the problem could be invalid data in either the fact table or the dimension table. For instance, if you are joining the fact table with dimension table based on state codes in the U.S. and you have a state code with a value of ‘ZZ’ in the fact table you might have some problems joining this data with a dimension table that has only valid state codes. This problem should be tracked back to original source. Meanwhile, you have a choice of leaving these records out of the cube or substituting a valid value until the problem is fixed on the front end or wherever the bad data showed up between the original source and the insertion into the fact table. Be sure to advise the end user of the data anomaly. They may also have some input on how the bad data occurred and/or how to handle the bad data.
Not Taking Advantage of Virtual Dimensions
For each dimension defined in the cube the number of data cell intersections increases exponentially. For instance, if you have 4 dimensions (Time, Product, Districts and Salesmen) the total number of data cell intersections is calculated by multiplying the number of members in each dimension. Let’s say Time has 17 members, Product has 50 members, Districts have 10 members, and Salesmen have 25 members. The total number of data cell intersections is calculated as follows: 17 x 50 x 10 x 25 or 212,500 possible data intersections. Now let’s say we need to add another dimension that is an attribute of Product called Product Models, which consists of 100 members. If we add this to the cube as a regular dimension then the total number of possible data intersections increases to 21,250,000 members. By making this a virtual dimension you keep the number of possible data intersections down to 212,500. Virtual dimensions are created from the member properties of another dimension. They do not create additional data cell intersections and the end user sees it as a regular dimension. Use Virtual dimensions as often as possible.
These are just a few of the design problems you can encounter when building cubes with Analysis Services. I am sure there are many more pitfalls that were not covered in this article. Please feel free to share your experiences and resolutions in the Analysis Services Discussion Forum.