"poor man's replication" ETL process design

Question

"poor man's replication" ETL process design

Viewing 2 posts - 31 through 31 (of 31 total)

You must be logged in to reply to this topic. Login to reply

Eric M Russell SSC Guru Points: 125562 More actions · Answer 1

If you had access to the transaction logs, then you could do log shipping replication. If you had an actual login account, then you could leverage SSIS.

Having to operate through that linked server really limits your options. That linked server connection is actually less secure and less auditable than if they simply provided you with a dedicated ETL login account. Apparently you can now select from any table you need unfettered, but the difference is that you are forced to pull data in a technically awkward and inefficient way.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Eric M Russell SSC Guru Points: 125562 More actions · Answer 2

TheSQLGuru (1/12/2015)
So you believe that adding a timestamp and non-clustered index to a table is more likely to break production than adding a trigger than inserts into an audit table? Yes, it will add additional I/O to maintain the non-clustered index, but so would the trigger and audit table, and the audit table itself would need an index. What specifically is your concern?
Yes, I do believe the trigger is less likely to cause issues (and especially break an app) than adding a rowversion column. MANY applications out there will work improperly when a column is added to a table. Here is a short list of issues:
1) SELECT * queries. This one alone is a complete deal-breaker in many apps I see in the wild.
2) Every index on a table and data change that occurs increases the potential for a deadlock on said table.
3) Views on table could be affected.
Other issues off the top of my head:
1) Complete non-supportability by vendor. Yes, a trigger is likely non-supported as well but it is a LOT easier to disable/drop a trigger if necessary before applying a patch, calling the support hotline, etc.
2) Fragmentation caused by additional column added to table. This could require a rebuild to address, with potential outage required if cannot be done online.
3) increased maintenance/stats time/cost
4) increased backup size duration. And yes, those 8 bytes (plus the SAME 8 bytes AND the clustering key on the NC index) for every row WILL add up, especially on larger tables and/or ones with unfortunate clustering keys.
5) bloated memory usage by same stuff in 4) above --> less useful stuff in RAM --> more physical IO --> worse performance

Maybe, the solution proposed earlier of performing a full scan on each table to get IDs and hash values would prove workable. However, considering the size of the tables, I'd be surprised if the scan / hash solution would run in less than an hour for even the daily incremental extracts.

However, regarding my suggestion of adding TIMESTAMP columns, it seems to me the only compelling argument against is the one about loss of support in the event that the source database is 3rd party vendor provided. Maybe I missed it, but reading back through prior posts, it's not stated that's the case.

If the application or reports perform SELECT * in such a way that they would break in the event a new column is added, then obviously that's a non-starter for adding timestamps. But this can be confirmed by performing a comprehensive test in development or QA. Really, if that's the case, that the application crashes simply in the event a new column is added, then they're going to hit a this road block eventually anyhow, so it may be time to plan for it and make the proper refactoring changes to the SQL.

The down time required to add the TIMESTAMP column, index fragmentation, and potential deadlocking while the alter table operation is completing; these are one-off concerns that would equally apply when adding any new column for any reason. But they can be easily mitigated by performing the operation during a regular maintenance window, which typically involves an index de-fragmentation task anyhow.

If the TIMESTAMP column is null-able, then adding it to the table will not be a change of record size operation and so won't create fragmentation or extended locking. It should require only a few seconds for the table schema lock. These timestamps on existing records can be left as NULL with an assumed value of 0x0 for the purpose of change data logging.

All this can be confirmed by deploying to a dev environment and smoke testing the applications.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho