January 29, 2014 at 9:11 am
If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?
Or both perhaps?
I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.
Thanks
Lempster
February 26, 2014 at 1:03 pm
I'm by no means an expert of any kind on data warehousing so take this as my opinion.
I would put it in the expired record since the change is the reason why the record expired. The current record has no reason for change since it hasn't changed.
February 26, 2014 at 1:32 pm
It depends.....mostly on the level of tracking data that you want to keep. If you are storing something minor, then it is probably not an issue to keep it in the table. Just make sure you document and stay consistent on which record has the change info. Normally, I would put it in the new record and try not to touch the old record.
We have systems where they want to search for which column changed, so we have a second table to store the changed data in column/value format. It may seem redundant, but it makes the searches run fast.
I have seen approaches where the keep the change data in XML format. Any design that would use blob data should use a separate table.
February 26, 2014 at 9:33 pm
Lempster (1/29/2014)
If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?Or both perhaps?
I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.
Thanks
Lempster
You don't need the pain of maintaining an "IsCurrent" column for Type 2 SCDs. If you have both a start and end date per TYPE 2 SCDs, the dates are good enough to tell you what is current especially since you were smart enough to NOT use a NULL end date. I'll also recommend that you don't actually use 9999-12-31 as an end date. Instead, use 9999-12-30 or even just '9999' (which will auto-magically convert to 9999-01-01) so you have at least 1 day of "headroom" for certain range calculations.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2014 at 2:14 am
Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!
Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:
...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.
So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))
February 27, 2014 at 5:41 am
Lempster (2/27/2014)
Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:
...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.
So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))
As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.
where endDate = '99991230'
February 27, 2014 at 6:31 am
EricEyster (2/27/2014)
As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.where endDate = '99991230'
Yeah, I get that.
February 28, 2014 at 7:43 am
I also maintain a pair of dates as well as a current flag. On the current records the end date is null however. On my agent tables, which are SCD2 because agent numbers are reused, I expose the flag as an active attribute. THere may be others where that is also usefully exposed as an attribute.
February 28, 2014 at 10:10 am
After reflection I'm going to change my answer to agree with EricEyster.
People aren't going to look in the previously expired record to see why it expired. The current record is the one that's getting all the attention so it should have the necessary info as to why it's the current record.
February 28, 2014 at 2:00 pm
It is common practice to include both RowStartDate, RowEndDate, and RowIsCurrent. Here is Kimball Group's (Warren Thornwaite) thinking regarding "RowChangedReason" along with the code to do it:
http://www.kimballgroup.com/2006/06/01/design-tip-80-adding-a-row-change-reason-attribute/
February 28, 2014 at 8:55 pm
Lempster (2/27/2014)
Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:
...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.
So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))
Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 3, 2014 at 3:08 am
Jeff Moden (2/28/2014)
Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.
I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.
I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.
Regards
Lempster
March 3, 2014 at 6:32 am
Lempster (3/3/2014)
Jeff Moden (2/28/2014)
Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.
I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.
Regards
Lempster
Things change a little if you are going to use SSAS. The DW becomes little more than a data store to facilitate the ETL processes. Sure, you need enough to also support your debugging when things go bump in the night, but the Kimball design assumes your users are getting data from the relational engine.
If you want to display the isCurrent flag for testing or for ease of loading to SSAS, create a view to calculate the isCurrent flag using a case statement on the endDate.
March 3, 2014 at 6:40 am
Kimball design assumes your users are getting data from the relational engine
Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?
March 3, 2014 at 6:56 am
RonKyle (3/3/2014)
Kimball design assumes your users are getting data from the relational engine
Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?
Yes, follow the Kimball design in the SSAS database. Assuming you are using MOLAP, SSAS pulls the data from the relational engine during dimension/partition processing and does not touch it again. We have systems that rebuild a single partition each week, letting most the of data in the DW untouched until it is purged. no need for heavy indexing, etc, on the relational DW side to support end user queries. Instead, focus on optimizing the ETL processes for fast load, select for the partition, and purge.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply