June 26, 2023 at 5:01 am
Hi there,
I've got a situation I want to try to model and wanted to see what other folks have done in this situation. I've run into this before but the data set I'm working with has a lot of user error issues that I'm going to have to deal with.
My understanding with Kimball method dimensional modeling is that you don't want to update fact records, which I have been able to avoid thus far. And what I'm working with has to be atomic grain, so periodic snapshots won't really make sense.
Basically, in the source system, users may need to correct a date field or other fields that are not additive. I'll be getting updates not based on order date but on insert or update date of the source system.
I'd like to know what other people are doing when they run into a similar issue. So for example, let's say (modeled below) a user originally put the order date (there are about 7 date fields in the source system that they work with, but simplified to 1 here), in as 1/1/2020. The data gets loaded into the fact table that night, and all is good in the world.
The next day, a manager makes the employee correct the order date to 12/31/2019 because the employee failed to do the data entry. The measurements may or may not have changed. I'm focused on the date for this example.
Would you logically delete? Delete? Add a new record to reverse it? Or something else?
I've writen an extremey oversimplified example below. Forgive me for not normalizing the source data. Hopefully folks can see what I'm trying to point out.
Thanks in advance!
--load a date dimension with the 2 dates that will be in question
declare @dim_date table (dim_date_key int, cal_date datetime)
insert into @dim_date (dim_date_key, cal_date)
values
(20191231, '2019-12-31 00:00'),
(20200101, '2020-01-01 00:00')
--assume the fact staging table was loaded with source data
declare @stagetable table (order_id int, order_date datetime, amount float(2), insert_datetime datetime, update_datetime datetime)
insert into @stagetable (order_id, order_date, amount, insert_datetime, update_datetime)
values
(1, '2020-01-01', 15.35, '2020-01-01 15:30', null)
--check values of staging table
select *
from @stagetable
--set up simple fact table
declare @facttable table (dim_date_key int, amount float(2))
--insert fact records from staging that don't exist in the fact table
insert into @facttable (dim_date_key, amount)
select d.dim_date_key, s.amount
from @stagetable s
join @dim_date d on d.cal_date=s.order_date
except
select dim_date_key, amount
from @facttable
--check values of fact table
select *
from @facttable
--user fixed a date field in the source system. Consequently, there was a new version of the fact record
insert into @stagetable (order_id, order_date, amount, insert_datetime, update_datetime)
values
(1, '2019-12-31', 15.35, '2020-01-01 15:30', '2020-01-02 16:00')
--insert new records that are not already in the fact table
insert into @facttable (dim_date_key, amount)
select d.dim_date_key, s.amount
from @stagetable s
join @dim_date d on d.cal_date=s.order_date
except
select dim_date_key, amount
from @facttable
--what would you do with the original record? It's not a measure issue and not a late arriving dimension issue. The dimension was wrong.
--Do you logically delete the original? Do you throw it out then reload? Do you update it?
select *
from @facttable
June 26, 2023 at 2:49 pm
The answer is "it depends" While Kimball's "rules" state do not update a fact table, sometimes you need to do things like that.
There are instances where entire fact tables were rebuilt or updated, usually from user error.
In this case, is the issue that these records should not be loaded into the fact table until they are "complete"? Is there a way to determine that?
From what you posted, this appears to be a regular occurrence. I'm leaning toward creating a new record, and leaving the existing records as-is. The people that consume this data will need to know that the most recent is the only valid row.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
June 26, 2023 at 3:13 pm
Hi Michael,
That makes sense. I've been vacillating between that method (but also adding a current_flag or valid_flag then have the BI tool access a view with WHERE valid_flag=1) and the alternatives. These are actually vehicle work orders for public transportation. A work order could take up to 3-4 months depending on what it is. The director needs to see open work orders, not just closed. So we really can't wait till the work is complete.
I have a similar star I'm working on that has just as much fluctuation in attributes that I'll have to deal with but those are actually intended. So I'm trying to get settled on a method that might work for both. In the other data set (Incident Investigations), the incident classification changes throughout the life cycle of the process.
I appreciate your insights!
Thanks
June 26, 2023 at 4:02 pm
I detect facts that are not 'correct' and the update the fact in place while logging what was changed. I refer to the Kimball books and methods but I am not really a purist probably since I am new to dimensional modeling and come from more of a transactional background.
I tend to avoid the Current Flag' methodology. I've seen it cause performance problems and rather not give downstream consumers the requirement of filtering out bad data. Just my 2 cents.
June 26, 2023 at 4:25 pm
Thanks Chrissy,
When you say "update the fact in place while logging what was changed", can you explain what you mean? Are you updating the record with the new value?
June 26, 2023 at 4:35 pm
Yes updating the existing record with the new value.
Air code to follow
UPDATE Fact
SET F.Attribute1 = S.Attribute1
OUTPUT--do some logging
FROM Fact F
INNER JOIN Staging S
ON...
WHERE...--detect data that needs to be corrected
June 26, 2023 at 4:43 pm
Gotcha, that's what I thought you meant, but wanted to be sure. I'd probably do a merge and handle it then if I go that direction.
Very much appreciated! I've been playing ping pong with this all weekend 🙂
June 26, 2023 at 4:55 pm
I have scars from using MERGE in the past and have since used separate INSERT and UPDATE statements. I haven't revisited MERGE recently, maybe everything is fine now but you might want to review this.
https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-to-use-merge/
June 26, 2023 at 7:33 pm
Interesting, yes, I saw Aaron's original artcile about this years ago but this is good context. We do use merge a lot. Admittedly, I've never run into problems except during development with a bone-headed mistake on my part. But I understand that doesn't mean we won't. We also don't have anny columnstore indexes or partitioned tables, so perhaps that's why we haven't seen the issues.
Thanks for the advice.
June 26, 2023 at 7:44 pm
I have to agree with Kimball. Never update a fact table. Only update the source of the data that the fact table was generated from and do the ETL thing properly.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 26, 2023 at 7:57 pm
Uh-oh that means I am not following one my best practices which is 'Do what Jeff' recommends'. Any additional color or reasoning on why no fact table updates and what would be improper about doing such in the ETL?
June 27, 2023 at 12:20 am
Uh-oh that means I am not following one my best practices which is 'Do what Jeff' recommends'. Any additional color or reasoning on why no fact table updates and what would be improper about doing such in the ETL?
This isn’t a do what Jeff says so much. The point is that once data is created in a data warehouse, changing it violates the basic premise of a DW. You want all of the history of the records. This is not auditing. You want to be able to produce the data for any point in time.
In the example from the OP, I would not modify these, nor would I add a “most recent” flag of any type. I would insert a new record when the data changes. This would allow you to generate the count of open orders for any point in time.
This does not seem like a good case for an update. In the event of a human error, sure, fix the data. But otherwise, I’m not sure I can think of a good use case to make this part of a design.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
June 27, 2023 at 2:49 am
@jeff - thanks for your note. Yes, I'm aware of the tenets to never update a fact. The challenge with a lot of design theory is dealing with edge or sometimes general use cases that can't be engineered out due to normal operations. But I agree, I'm very reluctant to update the fact records and haven't done so in the past. I've just considered it.
@michael-2 - Great points. The problem I really have is that there is no value in the original record when attributes have been changed. So for example, if a date was marked incorrectly, or an incorrect code was added (these are heavy diesel, CNG and Battery Electric bus mechanics whose last priority is typically to get the data right), we need some grace to deal with the change. I posted on another forum and someone actually told me to use an SCD, which of course is a completely different idea. I wish there were some way to ensure the data was accurate, but until the shop foremen review the work orders (and they're sometimes not much more savvy), there's not really a way to avoid such changes to the original records. The foremen don't review the work orders for accuracy until the work order is closed. As I think I noted previously, a work order can be open for days, weeks or months and the director needs work order data that is still open; not just closed work orders.
Michael, in your original reply, you suggested leaving the original record and bringing in the new one, but only considering the newest one as the valid row. If we have a fact table that's got an old row that is no longer valid, and the new row that is, simply because of the datetime stamp, how would you suggest the fact table being queried (through BI Tools) by the business user?
Consider a simplified query:
select a.dimension1, b.dimension2, sum(f.measure)
from fact f
join dimension1 a on a.dimension1_key=f.dimension1_key
join dimension2 b on b.dimension2_key=f.dimension2_key
--What do you see being the predicate here to only bring in the latest row? A version id on the date in question?
group by a.dimension1, b.dimension2
June 27, 2023 at 3:35 pm
I probably should have stated my use case. It's not a traditional data warehouse but rather a tabular model on which PowerBI reports are built for analysis/business intelligence. My analysts don't want to see historical attributes, only the current attributes. Perhaps one day they will require point in time capabilities but not currently.
June 27, 2023 at 5:31 pm
I probably should have stated my use case. It's not a traditional data warehouse but rather a tabular model on which PowerBI reports are built for analysis/business intelligence. My analysts don't want to see historical attributes, only the current attributes. Perhaps one day they will require point in time capabilities but not currently.
Just to clarify for me, please.... Does this mean that the "current" data lives no where else? That it lives only in your "tabular model" and not some table in a database somewhere else as the original source of the "current data"?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply