Disaster Recovery Strategy for VLDB

Question

Disaster Recovery Strategy for VLDB

DBADave

SSChampion

Points: 10962
More actions
May 4, 2009 at 1:41 pm

#137229

We have 19 databases that total 915GB in data file size and 140GB in log file size. We anticipate about 250GB of growth each year. Our disaster recovery process uses Double-Take replication software to replicate data and log files to our DR site, located < 100 miles from our office. We've found that when reindexing runs, Double-Take cannot keep up with the changes and crashes. This necessitates a remirror to the DR site, which is costly in terms of time and system resources. I have the following questions. (1) How do you replicate data of similar or larger size to your DR site? Log Shipping, Database Mirroring, SAN Copy, etc...
(2) How do you handle large amounts of data changes that occur during reindexing operations?
Here's our configuration.
SQL Server 2005 EE SP2 - 64bit
Gig NIC
EMC Clarion SAN (low-end SAN)
RAID 10 for Data and Log files (3 RAID 10 arrays for data files and 2 RAID 10 arrays for log files)
Default Filegroups with One Data File and One Log File Per Database
450MBit link to DR site

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Pam Brisjar SSChampion Points: 12094 More actions · Answer 1

I used to maintaine a 3TB DB. We had mirroring implemented on it for recovery and reporting purposes. When I did the big ops like index rebuilds, I'd have to trun off mirroring, run the maintenance stuff, bring the mirror DB up to the level of the primary, then re-start mirroring. Quite a pain, but I figured better than the pain of not having a fail-safe.

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 2

As a side bar, considering things like hurricane Katrina and other major disasters, I'm thinking that <100 miles isn't enough especially in coastal regions and earth quake prone areas.

Shifting back to the subject at hand, we had an EMC Clarion (don't know which model) and the DBA's did what they called a "SAN Snapshot" once every 4 hours. It took something less than 11 minutes for almost a tera-byte. Of course, that was server to server in the same building but a good dedicated line should be able to handle the "long haul" with only slight degredation of performance.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David Benoit SSC-Dedicated Points: 34562 More actions · Answer 3

Agreed, SAN "snapshotting" is the way to go, especially considering the growth that you are looking at. Interesting struggle that you are running into with your vendor, somewhat disappointing to hear though.

The biggest thing with the SAN replication is to ensure that you perform testing on what is going to be required to get the other site up in the case of disaster (which is true with any type of DR solution but a bit more when you are using technology that doesn't involve SQL Server like mirroring and log shipping).

I have worked with some SE's in the past to build something similar to what you would need on an higher end EMC SAN for some Oracle DR replication in the past and found it to be quite reliable. At that time (2005) it was a bit more difficult than what it appears to be today.

David

@SQLTentmaker

“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

dba-don SSC Enthusiast Points: 120 More actions · Answer 4

Hi

You should consider defragmenting the indexes and only rebuilding indexes as needed as rebuilding indexes has a large penalty on the transaction logs you may also consider setting your database to bulk-logged before performing the reindexing. This will definately minimize the bit level replication that is occuring via double take.