March 9, 2009 at 5:04 am
I am building a DW gathering data from all our sites aroung the world, but I do not want to pull down all the data for each table (dim/fact) every day, just the changes.
Each table in question has a timestamp column so I believe I should be able to make use of that by creating an interim table on the source(WAN) server containing the timestamp and pkey of the table so a comparison can be made to reduce the amount of data I transfer.
Is this a sound idea? Is there a better approach for this 'typical' scenario?
If not, how what datatype do I create to hold the timestamp value for comparison?
TIA
March 9, 2009 at 6:36 am
Timestamp is a really poor data type to use for any sort of date comparison. It isn't really a date, but an incrementing number that mimics an approximate date/time. It also isn't terribly convertable.
You can create an actual ModifiedOn column as a datetime and use this instead. However, I'm compelled to ask. If you're looking for changes only, why not use Transactional Replication to move the changes to a staging database and do your DW work from there?
Transactional Replication is relatively easy to setup, rarely breaks anymore and has the ability to "intelligently" track changes without you having to code or force anything.
March 9, 2009 at 8:10 am
Hi Brandie,
Thanks for the reply, sorry, maybe I was not clear in my description.
I wasn't suggesting to use Timestamp as a date comparison just a value comparison, as I would be doing a join based on the primary key field (not the timestamp column) and if found a match then to check if the timestamp is different, then I know the record has been modified, therfore I only send those records and ones that do not exist (based on pkey field) in my intermediate table across the WAN to the main DW server.
For replication there are several reasons why I believe unforntunately I cannot use that in my 'inherited' scenario.
As our schema changes frequently wouldn't that require a full snapshot being sent (100+GB) every time?
What is the trigger to say when a full snapshot is required, is it changes to the definitation of just the fields configured for replication, or any field in a 'replicated' table etc.?
Given our application, we have to rebuild all the indexes almost daily causing a TLOG the same size as the DB, which causes disk space issues, so we change the DB to simple mode before reindexing, then back to full so as I understand it Transactional replication would fail, is that right?
The upshot is I guess is that I want to populate a staging table on the remote server just to consist of changes and then bring that table down to the central DW server, so as they all have a timestamp column I thought I could use that in conjunction with the primary key field, to easily detect which records have changed.
March 9, 2009 at 8:22 am
sotn (3/9/2009)
I wasn't suggesting to use Timestamp as a date comparison just a value comparison, as I would be doing a join based on the primary key field (not the timestamp column) and if found a match then to check if the timestamp is different, then I know the record has been modified
Ah, that's a little different that I thought you were saying. And I don't see any reason why it shouldn't work. However, comparing between databases still takes more effort & processing power than just checking one database based on a datetime column. Especially since you're adding lots of network traffic to the burden. So, I do stand by my prior suggestion. But it is just a suggestion and since the databases are all yours, you are free to follow or ignore it as you choose. @=)
As our schema changes frequently wouldn't that require a full snapshot being sent (100+GB) every time?
Depends on your definition of "frequently". Transactional replication sends records every time there's a new transaction, sometimes it runs every minute, sometimes longer. If your schema is changing every minute of every day, then you have a problem with your development process, IMHO. However, you have a valid point that, depending on the data required and the schema changes made, you might be updating your articles more often than you want to.
What is the trigger to say when a full snapshot is required, is it changes to the definitation of just the fields configured for replication, or any field in a 'replicated' table etc.?
Only changes to the tables involved in replication require changes in your articles. Nothing else needs to be worried about. However, you can find lots more information on replication in BOL.
Given our application, we have to rebuild all the indexes almost daily causing a TLOG the same size as the DB, which causes disk space issues, so we change the DB to simple mode before reindexing, then back to full so as I understand it Transactional replication would fail, is that right?
It's been a while since I worked with replication, but I believe you can schedule it to occur during certain times of the day. Since I don't believe your index rebuilding occurs during business hours, I don't think this would be a major issue. But again, I haven't worked with Replication since SQL 2000 so you'd want to check BOL to be sure.
I hope these answers help.
March 9, 2009 at 8:39 am
Yes the index rebuild is an overnight job.
I'll experiment with replication as if changing the recovery model breaks transactional replication then I am stuck with trying to identify changes purely on the remote source server.
I may then have to build staging tables on the remote server and have a query run to compare field for field for the fields I am interested in, so that this complete staging table is transferred across the WAN.
Rather than compare field for field, I was hoping to use the timestamp column as that would mean my staging table is only 2 fields wide but does mean if a field I am not interested in changes I would detect a change and replace with the same fields. I don't know if that makes sense, but it now makes sense to me what I need to do.
Thanks for your help.
March 9, 2009 at 8:47 am
When I talk about processing resources, I'm not talking about field by field. I'm talking about having to take the timestamp of your DW to the timestamp of either your staging DB or the original DB. You're crossing the network to do this and comparing two different databases.
And whether you have the complete production database going down to your staging DB or you send the comparision directly to the production DB, you might still end up reading a full dataset from two databases. It depends on how your linked servers and queries are set up.
If replication doesn't work for you, I'd really suggest adding ModifiedOn & CreatedOn datetime columns in your source database. That way, you only have to read one database for the pull of your data each time. And you don't have to worry about database compares.
But again, if you're more comfortable doing it the way you've got it planned, by all means stick with that. There is no true "best practice" as situations differ between databases & companies.
BTW, Have you checked out the slowly changing dimension tranformations in SSIS?
March 9, 2009 at 9:05 am
I've had a look at the SCD in SSIS and while it worked very well for a dimension table of 100 records or so.
When I tried it against our Item table (400,000+ records) it was much slower than direct SQL.
I don't think I am explaning the timestamp thing very well.
What I'm thinking of is this.
The DW server is in the UK
We have source servers in Germany, UK and USA (and a dozen other countries)
On the German Server I would create a staging table.
This staging table would contain the data to send across the network to the UK DW server.
This table would be populated by comparing its data to the live application data (both tables local on the German server). Therefore to identify changes, no network traffic for the identification is required, just network traffic for the downloading of the 'changed/new' data.
The same approach (staging tables) would be true on all the other remote servers.
March 9, 2009 at 9:18 am
We're doing something similar to the timestamp column, although not quite as easy for us, since there isn't one in the source system DBs, which we don't have control over, so we're a little more limited. We generate a checksum, but after that the approach could be similar:
We build a customer DB on the source systems, and populate a series of tables for any big table we're ETLing. In that table there are only 2 columns: the Business Key (called BK from now on) and the checksum (in your case the timestamp). We then build a view that LEFT JOINs the source table to the checksum table on the BK and the checksum (timestamp for you) and WHERE checksum is NULL. These are all the new/changed rows. Then we use a SCD Wizard - like process in the ETL to split the new and changed rows. The last step is then to populate the Checksum table in the custom DB with the new values.
We like this approach because it's repeatable, and because it's very resilient.
Let me know if you need more explanation,
Rick Todd
March 9, 2009 at 10:31 am
I understand what you're saying about the timestamp perfectly. However, you're still sending a ton of records one way or the other over the network. Let me put it this way, in my opinion (and my opinion only) it is a Bad Idea to compare a staging table to a production table because you should be comparing the records in the datawarehouse to the production db. The DW is the ultimate "truth" of all your servers and if you're not comparing it properly, than you open yourself up to all sorts of problems.
Also, staging tables should, by definition, be temporary tables you can delete at any time.
If you want to keep a history table on the source side of everything you've sent, however, that's quite another thing. But you need to be aware of the differences between staging & history tables. Especially as far as storage and longevity is concerned.
Ignoring the history table idea, though, let me throw a wrench at your monkey for a second. @=)
What happens when you have two identical records in two different servers / databases which have just been updated in two different ways with just happen to have the same timestamp info?
This is what MERGE replication is used for. It seems to me you're trying to manually rebuilt a wheel that SQL Server already has provided.
A history table has the ability to do what you're looking for. However, maintenance might be a nightmare. Especially if you forget to document what's going on. The next DBA who comes along might not understand what you've done and completely hose what you're setting up now if you don't document it well.
July 21, 2009 at 5:21 pm
I understand your timestamp column logic (now called rowversion 2008).
Makes sense to have a staging table at the OLTP data centers. SSIS is still in its infancy I struggle with the poor performance weekly/daily. (I am an original Informatica DataMart ETL Man).
Cant use checksum no guarantee of uniqness, but rowversion is unique and will be updated when any column in the row is updated. To bad the EXCEPT operator returns distinct values which is a poor performer to idenitfy all cols in a row between 2 tables.
In Oracle 10g+ the MINUS set operator is by far the best table by row by column comparision available today. I am able to compare all columns in Staging tables with OLTP production tables over 100 GB a day, Transform and load in a couple of hours. Of course the use of Oracle external tables aids in the original staging table time (which is similar to "OPENROWSET( BULK '...')" in SQL server) - this eliminates the DB overhead of loading data into a database table for preliminary change identification; The identified changed rows per table are written to a DB perm table. And the MERGE command is used to load into the warehouse.
I have implemented the Oracle ETL process with MSSQL:
BCP out the data where the rowversion is different per PK per OLTP table from the last extract (staging control table(s)).
In the first step of the transform use a join with OPENROWSET( BULK '...')" and a format statement from the OLTP table BCP
Then TSQL SPs to Load the rows / FACTS and Dimensions.
SSIS cannot not compete with this strategy.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply