November 8, 2011 at 4:16 am
Hi,
In our DW we have a facts table with millions of millions of rows.
Every 20 mins or so we want to copy this data to our DM. The old data must be replaced or updated. No conversion needed. Just a copy of the whole table.
Now what would be the most efficient way to do this?
Currently we are doing a copy of the whole table. Just dumping it into the DM table. And setting an int attribute to -1
All previous data in our DM has the same attribute set to 1.
And then we update the entire table. Changing the sign on every record.
And then deleting all -1 rows.
This is not working at the speeds we want.
Any suggestions?
November 8, 2011 at 4:53 am
A few ideas:
You could consider creating a view with the same name and schema as your current DM table has, that just passes through the data from the DM table (which would now have a different name). Now when it's time to do a fresh DM load, you can build a new empty table, bulk insert into it, add your indexes and so forth, and then issue an ALTER VIEW to point the view at the new table. Then, either drop the old table or truncate it and use it for your next bulk insert, at which point you will point the view back to it. This gives you the speed of bulk loading into an empty table with no constraints enabled, and you don't have the overhead of updating your flag column and deleting old rows. It also results in no downtime for your end users (the ALTER VIEW should be almost instantaneous), and because the view has the same name that the table used to have, it should be transparent to any reports or querying tools that you already have in place.
More simply, two full DM refresh cycles might look something like this:
* Starting condition: View DMFacts selects data from DMFacts_1
* Bulk insert latest fact data into DMFacts_2
* Build indexes on DMFacts_2
* ALTER VIEW DMFacts to select from DMFacts_2
* ...20 minutes passes...
* Drop indexes on DMFacts_1
* Truncate DMFacts_1
* Build insert latest fact data into DMFacts_1
* Build indexes on DMFacts_1
* ALTER VIEW DMFacts to select from DFMacts_1
* Rinse and repeat
Of course, the specific details would depend on your implementation, and you'd want to test some variations on the theme to see what works best for you.
November 8, 2011 at 4:58 am
I do have to ask, though, just how much data changes every 20 minutes. Is it enough to warrant doing an entire fresh load each time? You may want to consider doing something with partitioning instead. For example, if your fact data includes a date, and if new facts never get generated for dates older than the current month, you could partition by month and then only reload the current month worth of data each time; if that's a small fraction of the total number of fact table rows, it could be a much faster approach. Of course this would be a bit more complex than the view solution I posted previously, you'd need to create logic to automatically create new partitions and so forth.
November 8, 2011 at 5:24 am
Thanks very much for your input.
There is probably not a lot of changes from each reload of data. But the problem is that it in theory could be, even though in reality there will only be a few updates. Also, updates could happen on all data, so i cant really use your date alternative.
Ok. So as I see it I have the bulk copy option to new table as you suggested.
Or possibly some kind of conditional update/insert/delete as was suggested just now by my boss:
For the last option I was thinking about creating a SISS package to do the following:
Do row exist? -> (NO) -> Add row
Do row exist? -> (YES) -> Is it changed? -> YES -> Update row
Do row exist? -> (YES) -> Is it changed? -> NO -> Ignore
And then delete all DM rows that do not exist in DW
He belived this last option would be faster than the sign-switching method we currently use, since much of the data is the same from time to time. But to me it looks like the bulk copy option would be faster than this.
What do you think?
November 8, 2011 at 5:32 am
erikpoker (11/8/2011)
For the last option I was thinking about creating a SISS package to do the following:Do row exist? -> (NO) -> Add row
Do row exist? -> (YES) -> Is it changed? -> YES -> Update row
Do row exist? -> (YES) -> Is it changed? -> NO -> Ignore
And then delete all DM rows that do not exist in DW
What you've described is a classic "upsert" scenario, and SSIS is quite capable of it (even comes with a "slowly changing dimension" transformation, although that can be a bit of a hassle to work with). However, when dealing with a lot of data, doing it a row at a time like that is going to be really slow. You'd probably be better off bulk inserting all rows that don't exist yet in the DM, then loading the rest into a staging table in the same database and issuing a single UPDATE statement to update the DM from the staging table; two set based operations instead of a ton of row-based ones. You can also just load all the data into the staging table and then issue a MERGE statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
November 8, 2011 at 5:46 am
erikpoker (11/8/2011)
Any suggestions?
Any particular reason not to use the actual FACT table as the center point of your datamart? After all related DIM tables are related to it, aren't they?
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.November 8, 2011 at 5:56 am
PaulB-TheOneAndOnly (11/8/2011)
erikpoker (11/8/2011)
Any suggestions?Any particular reason not to use the actual FACT table as the center point of your datamart? After all related DIM tables are related to it, aren't they?
I was assuming they're using an Inmon-style relational DW to feed dimensional datamarts... If not, though, I agree, not sure what the need to spin off separate tables would be. Although, if the DW facts are actually changing frequently enough that the DM is being reloaded every 20 minutes...there could be some good reasons not to let reporting and analysis tools run directly against a fact table that is apparently very far from being static.
November 8, 2011 at 6:05 am
JonFox (11/8/2011)
PaulB-TheOneAndOnly (11/8/2011)
erikpoker (11/8/2011)
Any suggestions?Any particular reason not to use the actual FACT table as the center point of your datamart? After all related DIM tables are related to it, aren't they?
I was assuming they're using an Inmon-style relational DW to feed dimensional datamarts... If not, though, I agree, not sure what the need to spin off separate tables would be. Although, if the DW facts are actually changing frequently enough that the DM is being reloaded every 20 minutes...there could be some good reasons not to let reporting and analysis tools run directly against a fact table that is apparently very far from being static.
Very good questions indeed.
Well to tell you the truth, im a SQL-developer being retrained as a DW guy. So as this is my first week at my new job i dont have all the details yet. But i'll keep your question in the back of my mind until I can wrap my head around this stuff.
For now I will just focus my brainpower on testing different approaches for updating our DM facts. 🙂
November 8, 2011 at 6:23 am
Ok. This was a bit more complex than I first realized.
The job that will update our DM from the DW is not a full update.
It will only select a subset of the DW facts table, and only that subset will need to be replaced/updated with the new subset.
So creating a new table and dropping the old is not an option anymore.
Maybe creating a new table with the new data merging it with a subset of the DM table is the way to go?
November 8, 2011 at 6:27 am
erikpoker (11/8/2011)
Ok. This was a bit more complex than I first realized.The job that will update our DM from the DW is not a full update.
It will only select a subset of the DW facts table, and only that subset will need to be replaced/updated with the new subset.
So creating a new table and dropping the old is not an option anymore.
Maybe creating a new table with the new data merging it with a subset of the DM table is the way to go?
Approximately how many rows will be arriving from the DW with each refresh? Is it safe to assume that minimal downtime for the end-users during reloads is the number one driving factor for the method you choose? Or are there other more important considerations?
November 8, 2011 at 6:39 am
JonFox (11/8/2011)
Approximately how many rows will be arriving from the DW with each refresh? Is it safe to assume that minimal downtime for the end-users during reloads is the number one driving factor for the method you choose? Or are there other more important considerations?
Yes that is absolutly correct.
Rows will vary, but it could be from 0 to millions of rows. Average of about half a mill to 1 mill would be a safe guess.
November 8, 2011 at 6:48 am
erikpoker (11/8/2011)
JonFox (11/8/2011)
Approximately how many rows will be arriving from the DW with each refresh? Is it safe to assume that minimal downtime for the end-users during reloads is the number one driving factor for the method you choose? Or are there other more important considerations?Yes that is absolutly correct.
Rows will vary, but it could be from 0 to millions of rows. Average of about half a mill to 1 mill would be a safe guess.
In that case, I suspect the best option will involve some sort of sleight-of-hand along the lines of the view trick I mentioned earlier, or rebuilding a "hot" partition or similar. Even so, just staging the incoming rows to a separate table and issuing a single MERGE statement to upsert the live fact table might still provide better performance than what you're getting right now. The best thing is probably to put together a few different scenarios in your test environment and play with them to get a sense of what solution will hit your sweet spot in terms of acceptable performance and minimal downtime...
November 8, 2011 at 6:49 am
erikpoker (11/8/2011)
Ok. This was a bit more complex than I first realized.The job that will update our DM from the DW is not a full update.
It will only select a subset of the DW facts table, and only that subset will need to be replaced/updated with the new subset.
So creating a new table and dropping the old is not an option anymore.
Maybe creating a new table with the new data merging it with a subset of the DM table is the way to go?
Well... if we replace the words FACT for STAGING and DM-FACT for FACT then we have a classic ETL process moving data from a staging area to a datamart.
This should follow the rules of ETL meaning, extract data from the "source" table, apparently there is no "Transformation" but a simple merge then load the selected source rows into the target table.
Any change of leveraging an index to avoid full table scan on source FACT table?
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply