April 7, 2009 at 1:27 pm
Goal: track current AND historical state of a dimension.
I've looked at all the different type of SCD (type1, type2, type3, type4, type6/hybrid) and none of them have the capability of easily tracking current and historical data at the same time. Storing current AND historical values in the same dimension table with a current-key pointer in each record can fix this.
When a new dimension record is initially added, create two records in the dimension table. These records will initially be identical except for the surrogate key. One record will be the current version and the other record will be the historical version. Each dimension record will need to have an attribute that contains a pointer to the current version. In the fact table we will always store the historical version of the record. When a SCD change needs to be made to the record, you need to both UPDATE (type1) the current version of the record and also split the historical record (type2). With this approach, you effectively have a type2 SCD, but each historical record has a reference to the current version and another join to the dimension table can get the current values. Note: the current-key never needs to be updated in the historical records as all versions of the record will always have the same (orginal) curent-key. In my view, this is better than updating all the historical records to point to a newly created current-key.
One downside of this approach is that you need to create two records initially to establish the current and historical records separately, but this is no different than having two separate dimensions for current vs. historical.
Another downside is the need to join the dimension table twice to get the current values from the Fact table, but this is no different than what you have to do to get current values from a type2 dimension, and it is more effecient because in the type2 case you need to join on the natural key and filter by effective date, whereas in this design you will already have the dimension key for the current record.
Another downside is that this shows a preference for historical values since they are stored in the fact table, but some people would prefer to either bias towards the current values or not have a preference.
Rationale: I've seen some people recommend creating two separate dimensions for current(type1) and historical(type2), but I think this may be overkill in some cases.
Optional optimization: you can configure you ETL so that the current records (type1) are created with negative keys, whereas the historical records (type2) have positive keys. Then a clustered index on the PK will easily be able to find the current records.
April 8, 2009 at 8:32 am
Actually, you don't need to duplicate records on the first version. You can create the current-version record at the time the first SCD split occurs. The first record would have version=0. When you need to do an SCD split, you would then duplicate the first record (with a new SK) to represent the current version. Then you would update the first record and set the version=1 and current-ref= the SK of the newly created version=0 record. Then you would create a new historical record (version=2) for the SCD split and set its current-ref= the SK of the newly created version=0 record. Then you would update the version=0 record with any updates. This way the version=0 record will always contain the same info as the maximum version record, but you can delay the creation of the version=1 record until the first SCD split occurs. The version=1 record is really an updated version=0 record (same SK) so the fact table will still reference historical dimension records. After the first SCD split, each subsequent split will just involve creating a new record with a refernce to the same version=0 record, then update the version=0 record. So it will be one insert and one update each time. No need to update all the previous versions since the version=0 SK never changes after the first split.
This is a little complicated on the ETL side, but should simplify client queries that no longer need to sort through the begin/end dates to find current records.
Does anyone see any major problems with this scheme? Do you think this is an improvement over standard Type2?
July 6, 2010 at 10:29 am
I've been using this technique for a while now and it works GREAT! Since each record has a reference to the current record it is extremely fast to the the current values from any historical record by joining again on the current PK rather than the joining on the Natural key of the historical record and then filtering by a current indicator flag. The version0 record always exists and always is current. No fact table records reference the version0 record unless it is the only version.
I don't know what to call this technique, but it is basically a combination of type1 and type2 in the same table but you get to the type1 key from the type2 record.
July 16, 2010 at 1:30 pm
KIM: Good morning, friend. How are you today?
EDGAR: Very well thanks. But what about you? If you don't mind me asking, why are you walking in that very strange and awkward way?
KIM: Ah ha! Ralph Thimble told me to. He's a clever chap. This is his Scoodle Doodle Walk Number 4086. Don't you like it?
EDGAR: Well it seems very inconvenient. What's the point?
KIM: I must admit it seems a bit strange at first. But you must understand that Mr Thimble is very well known for his silly walks. He gives them all names - and numbers too. He's made $millions out of telling people how to do them so there must be something in it.
EDGAR: Is that so? It seems like most people get on just fine without Ralph's advice. After all, people have been walking around for centuries and no-one else saw any need to change. Walking really isn't that difficult.
KIM: Well, maybe there is some room for improvement on what Ralph says. No reason why I couldn't modify the technique a little I suppose.
EDGAR: Like walking Normally? Just the Normal Form of a walk I mean.
KIM: Hmm... I don't know about that. Ralph advised against it. How about if I do this...?
EDGAR: Well that is a bit better I suppose ... but ...
KIM: Yes. That IS better! I must go tell Ralph. He could call it his Walk Number 4087. Must dash...
EDGAR: OK. Just be careful you don't trip though.
July 16, 2010 at 2:45 pm
You gotta love Monty Python's "Ministry of Silly Walks" 😀
You certainly made my day mate. Cheers.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply