If you are new to dimensional modeling for business intelligence projects. Determining how to create a star schema against your company's data can be challenging. Which tables and columns should be included in dimensions and which ones should be in a fact table? This article covers how to create a dimensional model from a public data source the World Economic Data dataset (Link). I'm using an Excel file because business users often come to a BI team with an example of the data they need. BI Teams then have to turn these spreadsheets into useful data repositories that are easier to access across the company. For this exercise, the Excel file is readily available and has a good mix of dimensions and many years of metrics.
The WEO Entire Dataset file contains the following fields:
- WEO COUNTRY CODE
- ISO
- WEO SUBJECT CODE
- COUNTRY
- SUBJECT DESCRIPTOR
- SUBJECT NOTES
- UNITS
- SCALE
- COUNTRY/SERIES-SPECIFIC NOTES
- 1980 - 2013 (as column names, each column contains metrics for the year)
In order to create a star schema dimensional model from this file. Begin by reviewing the content of the file and examine each fields contents. You are looking for text that duplicates row after row and numbers that are metrics. To build the basic star schema the facts will need to be isolated from the dimensions. A Fact table stores the metrics or numbers for a specific record or event. The dimensional tables store the labels about the records in the Fact Table.
In the source data for World Economic Data the facts are stored in each years columns. For example, row 2 of the file shows Gross Domestic Product for Afghanistan in 2006 is 238.519. The number 238.519 is the metric for the fact table and Afghanistan, 2006 and Gross Domestic Product are the dimension labels. After reviewing all of the fields in the file the high level logical star schema would look like this:
The next step would be to create the physical data model and map the required columns to each dimension and fact table. When creating the fact and dimension tables leverage the use of surrogate keys. A surrogate key is an new identity column created in the star schema that is independent from the source systems original key otherwise known as a natural key. Create all surrogate keys as a numeric or integer data type as the primary key on the table. Having a primary key as a numeric or integer data type enables the SQL query processor to make the joins faster than using an alphanumeric key. Also create friendly names for the non key columns on the dimension tables to make the content more understandable for your users. After considering the content in all of the fields the completed physical model would be designed like this.
Each dimension table should be populated with distinct records to populate each row. This eliminates storing the data multiple times. For example there will only be 4 rows for the DIM_SCALE_T table: Billions, Millions, Units, and [Blank]. These 4 records represent the distinct rows from the Scale column in the spreadsheet. Once populated the DIM_SCALE_T table would contain these records:
DIM_SCALE_PK | SCALE_NAME |
---|---|
1 | Billions |
2 | Millions |
3 | Units |
4 | [Blank] |
The DIM_COUNTRY_T table would contain the distinct values for the columns along with the new (surrogate) primary key. Starting with Afghanistan the primary key would be 1 and incremented for each new row. In this data set there is 189 distinct countries. This reduces the amount of records that queries would need to search when running reports. Instead of needing to search through 8,316 rows in the source file, there are only 189 rows required in the star schema. Thereby making the queries a faster to execute. To illistrate how to populate this table the first 3 records are displayed below:
DIM_COUNTRY_PK | WEO_COUNTRY_ID | ISO_CODE | COUNTRY_NAME |
---|---|---|---|
1 | 512 | AFG | AFGHANISTAN |
2 | 614 | AGO | ANGOLA |
3 | 914 | ALB | ALBANIA |
After populating each dimension table with distinct rows you can then populate the fact table. The fact table represents the combination of unique rows for all dimensions from every year's metric value. The table below represents how the first 3 records for Afghanistan could look for 2006 and subjects NGDP_R, NGDP_RPCH, NGDP repsectively.
WEO_FACT_PK | DIM_SUBJECT_PK | DIM_UNIT_PK | DIM_SCALE_PK | DIM_COUNTRY_PK | DIM_YEAR_PK | METRIC_VALUE |
1 | 1 | 2 | 1 | 1 | 27 | 238.519 |
2 | 2 | 1 | 4 | 1 | 27 | 5.554 |
3 | 3 | 3 | 1 | 1 | 27 | 352.308 |
Afghanistan would contain a total of 44 rows in the WEO_FACT_T table once fully populated for the metric values in 2006. After loading the records for the remaining countries for 2006 there would be a total of 8,316 rows representing each data point for all countries. For each year that is added to the WEO_FACT_T table there will be an additional 8,316 records. If the entire dataset from 1980-2018 is loaded it will contain (8,316 rows * 39 years) or 324,324 rows of data.
Upon populating this star schema you will have a base dimensional model that can be used. There are two common methods to leverage the star schema. First, you can run reports directly against the schema since the data is de-normalized and optimized for reporting. Second you can use the model for an OLAP engine to consume such as SSAS.