February 5, 2016 at 3:27 am
Is it normal to have differing row counts between a staging table and the fact table?
I have a situation where I have fewer rows being loaded to the fact table than there are in the staging table. This doesn't seem right. Logically to me, any filtering of what is to go into the fact table should be done at the staging level so when the load takes place it's just a question of inserting the data.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
February 5, 2016 at 3:33 am
Without knowing more about your loading process, it's hard to really comment.
Generally, a staging table is where possibly dirty data is loaded; after that it is cleansed and the clean data goes into the final destination - in your case a fact table.
Perhaps some rows in the staging table have NULL in columns that are mandatory, so they are skipped.
Perhaps some rows in the staging table have duplicates in key columns, so they are aggregated or one of the rows is selected based on some criteria.
Perhaps some rows in the staging table are filtered because they are irrelevant in the fact table.
BWFC (2/5/2016)
Logically to me, any filtering of what is to go into the fact table should be done at the staging level so when the load takes place it's just a question of inserting the data.
Again, depends on the loading process. It is possible that the staging table is the insert of a straight import from an external source (perhaps a csv or xml file from an external source) into a table. Those load processes typically do no filtering. It is then possible to actually delete unwanted rows from that source, but it is also possible to simpliy exclude them in the where clause of the statement that loads the fact table from the staging table.
February 5, 2016 at 3:52 am
Thanks Hugo, I appreciate I didn't provide a lot of detail but at this stage I only wanted find out whether it was always a problem or not. I'm going to investigate further, I just wanted to get a handle on how serious it was. There's nothing in the fact load to filter the data any further so I think it's one of the first two things on your list. As a shot in the dark I think there's a problem with duplicates somewhere but for now that's just a guess.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
February 5, 2016 at 6:53 am
Instead of guessing, I recommend you open the queries and stored procedures (or packages if you use SSIS) used in the ETL process. Looking at the code allows you to know for sure what's going on.
February 5, 2016 at 7:44 am
That was just a back-of-a-fag-packet, as-I-type guess. I'm in the process of breaking things open to have a look now and already I've spotted one reason why I've lost some rows. There's still more work to do though.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply