March 9, 2018 at 10:08 am
Hello;
I'm new to Dimensional Modeling. I've been reading Ralph Kimball's book The Data Warehouse Toolkit. In it he recommends creating separate Date and Time dimensions. He explains that the Date dimension is the only dimension which could/should/can use a durable/natural key as the surrogate key since dates by, by definition, never change the "identity". He also explains that a record should be added to the dimension to represent null dates coming from the data. I've seen the date 9999-12-31 used for this date (the surrogate key would be 99991231). This all makes sense to me. However I do have a few of questions about a row to represent NULL values and am seeking some advice
I have similar questions relating to a Time dimension. If my time dimension has a granularity of seconds (86,400 rows), all of which represent valid times, what time value would I use in my time column to represent NULL time values? Also, I suppose that, as with dates, I could use the time as the surrogate key but is the any merit in that or should I just use an integer (e.g. 00:00:00 = 1 and 23:59:59 = 86400)? The other questions above also apply to the time dimension.
I've written a SQL script to build a date dimension. I've attached it for your viewing pleasure. Our business has information about things dating back to the mid 1800s and issues regulatory approvals with 100 year expiration dates hence the uncommonly large date dimension table. If you want to run the script, in order for it to work you'd have to have a table of dates representing business closure days (It just has two columns [Cosure_Date] AS SMALLDATETIME and [Closure_Name] AS VARCHAR(32)).
Thanks in advance for any advice you can provide.
March 9, 2018 at 10:24 am
Why exactly do you want a time dimension? And what dimension data are you actually trying to capture that would need to be granular down to the second but is generic enough that it wouldn't also need a date to be relevant?
March 9, 2018 at 10:46 am
We are a government agency regulating energy resource development. We are "open" (we need to be able to be contacted) 24 hours a day. The time component of a transaction can be important in lag/duration calculations. e.g. time to respond to reported upset condition.
Since I'm new to dimensional modeling all I have to go by is the direction I've read in Ralph Kimball's book where he recommends separating the date and time dimensions.
A robust time dimension could have columns which define "breakfast time", "lunch time", "business hours", "morning", "evening", "night" etc. and would enable queries such as; what events occurred "in the morning", "over lunch", "during business hours", "after business hours", "evening", "night" etc.
Most of our systems capture dates as DATETIME. I can imagine scenarios where either us or a stakeholder, could make/save changes seconds apart at any time of the day/night.
It could be that I'm off base here and this is unrealistic in the real world...
I should also point out that, at present, it is VERY early days for our dimensional data warehouse and we don't presently have a specific requirement for time dimension. I presume that like a date dimension, it is one of those "bellybutton" conformed dimensions that every data warehouse should have. Is that wrong?
March 9, 2018 at 12:46 pm
A day dimension is pretty standard in a datawarehouse, yes you could have a time dimension but the use cases would be significantly less especially if you want time dimensions that are useful without also looking at date. For example from the data points you gave let's take say during business hours, well doesn't that change if it's a holiday or a weekend? There's also nothing wrong with pulling the original datetime values into the data warehouse along with the date key.
March 9, 2018 at 2:58 pm
CanuckBuck - Friday, March 9, 2018 10:08 AM...He also explains that a record should be added to the dimension to represent null dates coming from the data. I've seen the date 9999-12-31 used for this date (the surrogate key would be 99991231). This all makes sense to me. However I do have a few of questions about a row to represent NULL values and am seeking some advice
- If the reason for the null can be determined i.e
- The date is not applicable in context of a particular fact e.g. 9999-12-30
- The date has not yet occurred in the context of a particular fact e.g. 9999-12-29
- A plan date which is yet to be determined e.g. 9999-12-28
- There was a failure to capture the date e.g. 9999-12-27
- The date is unknown for an unknown reason e.g. 9999-12-31
Is there merit in creating multiple "unknown date" records which more explicitly specify the reason the date is missing?- How do I "document" in the dimension that a "date" is not really a date but rather that is is a value which represents so that it shows up correctly in for BI products;
- A future date
- The date is not applicable
- The date was not captured
- A date that is truly unknown
I'll give you my take on these 2 questions.
1. Yes, it is very common to have multiple special rows in a date dimension. For example, I often have a row similar to yours with a year 9999 value to represent a "no end date" situation for a date range, and another row with a date before any of my real dates to represent a "not applicable" unknown value.
2. As part of your the date dimension, you can have a column where you put that descriptive text in.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply