May 15, 2012 at 1:41 am
We've come to add a new data source to our warehouse which presents us with a challenge.
The transactional system the data are coming from can't provide us an indicator of which data has been added, updated or deleted. I appreciate this is fairly common but my concern is that the data will run to millions of rows, which feels like a lot to process each day.
The first thing we are required to do is to stage the data before modeling it into the data warehouse. We also need to keep as light a touch as possible on the source system. Luckily the source system maintains PK's on all tables.
My thoughts so far are:
1. Truncate the staging tables and reload all data each night.
2. Generate a hash value to compare against the one previously generated, replacing each record where it's different.
Both these methods feel like I need to process an awful lot of data.
I'd like to hear peoples own experiences of this as I’m sure it's quite common.
MartyRocco
May 15, 2012 at 7:48 am
You are right, there is no lite way to really address this problem. Upgrading to SQL 2008R2 and getting the Change Data Capture functionality would probably be the best incremental update solution.
May 15, 2012 at 2:34 pm
We're a small company, we don't do incremental loads after the prototype for that was waaaaaay too long-running to be a daily deal (8-10 hours on a good day). We process millions of rows daily and that data is gathered from over a dozen servers, only one of which is remote.
Once records are written to staging in a package, we have at least one validation step to ensure the pkg doesn't blow away good prod data with a bad run. We look at record counts, fields changed between prod and staging and whatever else is appropriate for those facts.
We generally have a stored procedure that contains a validation flag so it is easy to turn on and off without modifying the package or changing a package variable in the config. While validation will always be conducted, it is our choice whether or not to allow it to fail the package.
I create emails in the packages to provide validation results and those are sent regardless of pass/fail, validation enabled or not.
If validation is enabled to fail the package, prod is untouched and staging is there for review. After review of staging + validation results maybe we flip staging to prod anyway (manually, like I did this morning after an inconsequential issue in validation results). But at least we were notified that there could be issues in the staging data.
May 15, 2012 at 3:02 pm
MartyRocco (5/15/2012)
The transactional system the data are coming from can't provide us an indicator of which data has been added, updated or deleted. I appreciate this is fairly common but my concern is that the data will run to millions of rows, which feels like a lot to process each day.
It is and that's a huge deal. Even just reading all the records off the source system will be a 'heavy touch'.
The first thing we are required to do is to stage the data before modeling it into the data warehouse. We also need to keep as light a touch as possible on the source system. Luckily the source system maintains PK's on all tables.
If I might recommend, use the datetimestamp (binary code) on the source table. Then anything updated/changed will be higher than the last value in the warehouse. I would seriously look to a very lightweight change to the most heavily used source tables to get out from under the 'lack of delta' components.
However, sometimes you just can't. Vendor system, code lockdown, intransient data architects, whatever. In that case, you're looking to use brute force.
1) Make sure you definately truncate/load a staging table to do your OWN deltas from. This allows for minimal logging and the like and will make your life much less painful on the data delivery.
2) The Hash is your 99% best bet, and assign that in the staging table. Use SSIS and include the hash as a calculated column during delivery to the target staging table, and store it in the warehouse with your detail level. The best method to do that is using a Synchronous Transformation Script Task, anything else gets unwieldy quick and there's nothing built into SSIS to do this via a Derived Column Component.
2a) Be aware Hashing isn't foolproof. 'ab' + 'cabc' looks the same hashed as 'abc' + 'abc'. Every now and again you'll need to 'reset' the data to get rid of the incredibly rare glitch. My average for that was 1:10,000,000 or so that missed an update.
3) Try to avoid multi-passing the data in the staging table for updates, do as much as you can for delta-determination and validation as you can in a single pass.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
May 16, 2012 at 5:59 pm
2a) Be aware Hashing isn't foolproof. 'ab' + 'cabc' looks the same hashed as 'abc' + 'abc'. Every now and again you'll need to 'reset' the data to get rid of the incredibly rare glitch. My average for that was 1:10,000,000 or so that missed an update.
Out of interest Craig, this problem has been on my mind too. I have not opted to do resets unless someone raises a query related to data quality and I can determine there is a problem. What do you do here? Do you do them on a schedule?
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply