Sparse facts

  • I am working on a dw project for a bank where we are consolidating data from many source systems for products such as current accounts, mortgages, credit cards, savings etc.

    When I look at the design of the fact table I have a dilemma.

    Let's say that there are 30 different facts across all the products and that some (e.g. current balance) are common to all products types. However, there are other facts (such as credit/overdraft limits) which only relate to some of the product types and wouldn't apply to the savings product type. Other facts relate to a specific product type only.

    I have come up with a number of possible designs for storing the facts:

    - a single, wide fact table holding all the different fact but with sparse data (i.e. nulls where the fact is not relevant)

    - a normalised fact table with a single fact column and a fact dimension describing the fact (e.g. current balance, interest amount etc)

    - a separate fact table for each type of product with specific fact columns on each table

    Each has pros and cons.

    The single wide fact table is the simplest solution but there would be lots of nulls in each row (each product type might only have 10 facts and the remainder would be null). We are using SQL 2008 R2 so we could use SPARSE COLUMNS to reduce space but the business would have to be aware of the null values in their analysis.

    The normalised fact table avoids storing any nulls as we would only have a fact row where the fact is appropriate (i.e. no loan type facts for savings accounts). However, with tens of millions of accounts and 10 facts per account the fact table will contain a very large number of rows (even using techniques such as only recording changes in a value rather than the value everyday) and the business will want to see a number of facts at a time rather than one specific fact.

    The separate fact table per product type only stores the necessary facts but makes answering questions that span product types (e.g. what is the total credit balance?) more difficult to answer.

    How have other people dealt with this type of scenario? Is there another solution that I have not thought about?

    Comments please.

    Jez

  • Jez,

    A multi-fact table is a better approach. A fact table with sparse data will hamper the record retrieval process.

    Raunak J

  • Thanks.

    That was the decision I was coming to but great to have someone think the same way.

    Jez

  • You might want to do some research some stuff by Ralph Kimball. I remember one of the articles talking about the complexity of making a DW for a bank precisely because of the issues you have mentioned. As they didn't apply to me, I didn't ready the article beyond noting it for future reference, but it couldn't hurt to at least know his recommendation.

  • Kimball seems to mentions two different approaches. One involves creating a fact dimension to describe the single fact (e.g. balance, credit interest, debit interest etc) on the fact table so there is no sparsity in the fact table as there is only one fact per row. However, as there might be at least 5 facts per account (per day) and maybe more with tens of millions of accounts that means a very large number of fact rows. There are techniques for reducing the number of rows by only recording a row when the fact changes but with credit card and current account balances changing every few days there will still be a large number of rows.

    The other technique that Kimball mentions is 'outrigger' fact tables in a similar manner to out rigger dimension tables. There would be a core fact table for common facts (such as balance) with its own identity column linking to separate fact tables containing discrete facts for other products.

    This seems overly complicated compared with separate fact tables for separate product types.

    Thanks for the thought.

    Jez

  • For this situation I have successfully used what I called an "annex" fact table approach.

    The main fact table has its own surrogate key plus one column for each common measure and common dimension. Call this table FactMain, with surrogate key FactMainKey.

    If a sparse measure (i.e., a measure not common to all sources) is required, a table called Fact[MeasureName] is created with two columns: FactMainKey and [MeasureName] and is populated only with rows when the measure is used. For example, say the sparse measure is NotionalUSD. Then the annex table would be called FactNotionalUSD and its columns would be FactMainKey and NotionalUSD.

    If a sparse dimension (i.e., a dimension not common to all sources) is required, a table called Fact[DimensionName]Key is created with two columns: FactMainKey and [DimensionName]Key and is populated only with rows when the dimension is used. For example, say the sparse dimension is DimCounterparty, which has DimCounterpartyKey as its surrogate key. Then the annex table would be called FactDimCounterpartyKey and its columns would be FactMainKey and DimCounterpartyKey.

    To make population of the annex tables facile, you may create a new table (we called it DimAnnex) that lists all the annex tables in use. The DimAnnex columns for a simple install could be DimAnnexKey SMALLINT, IsMeasure BIT, AnnexName VARCHAR(255).

    This approach is a vertical partitioning approach. Because the annex tables are so narrow, no indexes are required for them.

    Of course this approach works best when reports and/or queries do not need every column. If your application usually pulls all columns you are better off with a single Fact table.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply