December 1, 2016 at 7:56 am
I am refreshing a warehouse table nightly. The refresh is designed so that only when all the tables are successfully refreshed is the refresh considered complete. This allows the subsequent refresh run to recover a failed refresh as well as that night's refresh.
So here is the case: The refresh is successful Monday night, but fails on Tuesday night, so on Wednesday night, the refresh has to grab not only Tuesday's changes but Wednesday's as well. This potentially means that some of the records selected on Wed may have successfully updated the warehouse prior to the failure.
Throwing some numbers around for perspective. Say a nightly refresh of table A averages 100000 records. That means on Wednesday night, there might be 200000 records (both Tuesday and Wednesday). Depending on whether the refresh failed during this particular table, or another table later in the process, that means potential redundant upserts of 0-100000 rows for table A.
What is going to cost me more nanoseconds per refresh, filtering out the redundant records, or just re-upserting records with the same data they contain? This is especially concerning since, as the business grows, the number of nightly records will grow also.
What's your perspective on this? Or do you have a favorite book reference about this type of concern you can share?
thanks,
Luther
December 1, 2016 at 11:09 pm
Redesign the refresh to finish every night, no matter what.
Instead of blast-refreshing all of the rows, do it in smaller, fully-COMMITted batches, walking each source table in some key order. Set the batch size so it completes in about a minute or some other reasonable value. 'Reasonable' means let it run long enough each batch so you're not hammering the log with little batches, but not so long that a rollback eats too much of your refresh window.
Then set up some simple logging that will allow the process to pick up where it left off - what table was it working on at failure, and what batch (key range) did it last successfully complete (include the warehouse table update and the logging table update in the same transaction, so if one fails, so does the other).
In this setup, your worst case scenario of re-merging already-applied data will be eliminated. Each batch commits or rolls back. If the batch commits, a restarted process can pick up with the batch that failed and rolled back (that's the part you have to write). You gain a lot of flexibility once you get this pattern going. You can even kill a running refresh and restart it later.
If this is executed as an Agent job, then set the job to start every few minutes during the time range you want the refresh to happen. A running job won't restart or start an extra thread if the job had started earlier and is currently executing, but a failed/killed one will. Have the first step in the job check if there's any work yet to do that night, with logic to go back to zero and start over if the previous day's run completed and this is the beginning of a new night's run.
-Eddie
Eddie Wuerch
MCM: SQL
December 2, 2016 at 6:47 am
Thanks! Its an idea, plus there already is some room to do something similar to what you described in order to avoid re-running a successful refresh in the event of a failure and restart.
Something to explore in more detail, at any rate.
December 2, 2016 at 8:37 am
latkinson (12/1/2016)
I am refreshing a warehouse table nightly. The refresh is designed so that only when all the tables are successfully refreshed is the refresh considered complete. This allows the subsequent refresh run to recover a failed refresh as well as that night's refresh....
Throwing some numbers around for perspective. Say a nightly refresh of table A averages 100000 records. That means on Wednesday night, there might be 200000 records (both Tuesday and Wednesday). Depending on whether the refresh failed during this particular table, or another table later in the process, that means potential redundant upserts of 0-100000 rows for table A.
What is going to cost me more nanoseconds per refresh, filtering out the redundant records, or just re-upserting records with the same data they contain?
What method are you using to determine what rows changed from your source? That will probably play a large part in determining the most efficient way to deal with this.
December 2, 2016 at 8:57 am
to answer the question, there is a changelog table associated with each table that has the changedate and a calculated modification number, so the initial extract is all records in the changelog since the last successful refresh, but a quick filter is done on the modification number to throw out records that don't require updating. I'd like to get rid of the modification number and just filter on the changedate, but since I inherited the process, I'm moving slowly with any modifications.
This all started because some tables are moving to another server, which also changes the modification number in the changelogs (which did not move but were recreated), and is causing some heartburn trying to decide if I need to refresh the new changelog tables completely from the previous server to synch them up. (old server is going away, so..) or just synch last X months and go from there.
anyway, I've done some analysis of the times involved in the extracts, transforms, and delivery of new or updated data since I originally posted, and quite frankly, the extract phase where the changelog information plays is actually not my worse nightmare, but further into the process where some subqueries of subqueries lurk.
December 27, 2016 at 9:59 am
I use a column check to clear the records before going forward. It does mean that I have to know the things that can go wrong, and occasionally there is something new. How it works: once the data is in the landing tables, which have the most generous checks to ensure there's no issues landing the data (eg no primary keys), the data is checked for all the things that can go wrong downstream. Rows that don't pass the test are not flagged to go forward. There is a feedback to inform about the rows that didn't go forward and why.
January 3, 2017 at 5:00 pm
Hi Luther,
Some good replies on here so far. I'm curious what causes the load to fail in the first place. Is it worth exploring it from that angle, or is the failure something out of your control?
Tim
Tim Mitchell
TimMitchell.net | @tmitch.net | Tyleris.com
ETL Best Practices
January 4, 2017 at 6:58 am
Hi Tim,
The failures are a bit out of my control. Its a third party Upsert Data Flow task. I suspect that every now and then, it hiccups and forgets to move on to the next record in its rowset, and tries to re-insert the same record.
I've left a message on the vendor's forum, but have not had a chance with holidays, etc. to go back and look for any response.
I'm thinking strongly about modifying the process to use a landing table, then use the T-SQL MERGE statement to perform the UPSERT to the target table in the Control flow rather than the data flow (much faster performance). Just not sure if I can sell the effort to the boss, as it is not a frequent occurance.
However, I have modified the process to look at the changedates in the log tables rather than the fictional modification number, and overall, the speed has not increased significantly.
January 5, 2017 at 2:31 pm
SSIS has a "checkpoint" feature that allows you to begin at a particular task if that task failed. That does not help with a actual failure rows within a data flow task however.
But I seem to recall you can combine this with some sort of raw file destination so that that if there is failure here it can pick up in the actual transfer pipeline where it left off. I will look into this.
Though in theory I prefer the process that retakes every step in the lineage even if it is redundant, since it is the safest approach (reload TUE data, then reload WED data, etc...)
----------------------------------------------------
April 20, 2017 at 7:01 am
This was removed by the editor as SPAM
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply