March 30, 2010 at 8:54 am
I'm in the process of putting together a DR solution and was wondering who has used 3rd party software for replication of their SQL system. I just watched a webcast about CA's XOSoft aka ARCserve Replication and High Availability which is replication technology for the whole network. CA touts it's a High Availability solution for ANY application or ANY server (Virtual, MS, Linux etc.).
Sounded great but I didn't know what kind of impact some of these applications have on SQL or what benefits they have over or don't in comparision to SQLs native replication processes. They said it provides realtime replication and failover for a "seemless" transition in case of diaster.
Any thoughts?
March 30, 2010 at 9:24 am
We use XOSoft at the company I currently work for. It's quite good at failover, and horrible at failback.
If we want to start using the DR servers, it's very fast. If we need to move back to using our regular servers, XOSoft takes over 24 hours to recover back to them, and ties up both sets of servers and all available bandwidth in our network while doing so.
With SQL failover using XOSoft, it can have problems with the mdf and ldf files ending up out of synch with each other, since it operates on blocks on the hard drive. If a transaction is interrupted by a drive failure, the blocks that have been copied to the DR site can be out of synch, making the database unrecoverable. This is rarely a problem, but we have seen it once (which is too often), in one test.
We're in the process of moving to SQL native DR at this time, because of these factors.
(Note that XOSoft was a considerable improvement over prior DR plans/methods for this company, when they didn't have a DBA and were going from "no DR" to "any DR". It's just that we can do better now.)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 30, 2010 at 9:46 am
GSquared,
Wow thanks for the reply. It's funny b/c the webcast really emphasised on how horrible MS Exchange was at failback and how XOSoft is a good solution in place of native Exchange technology. I wonder if they've improved it. This was for the new release (r12 I believe).
You touched on a few of my concerns with an "all-in-one" solution. If it's not really built specifically for SQL will it deliver?
Do you plan on keeping XOSoft or are you guys moving to another solution for the rest of your DR plan?
March 31, 2010 at 6:22 am
I think we're keeping XOSoft for non-SQL servers.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 13, 2010 at 6:13 am
I've seen CA XOsoft protect a large number of MS SQL servers in a variety of configurations. I've never seen issues with data consistancy as described here.
Let's first review MS SQL server and how transactional updates are applied. MS SQL is a 'write-ahead' transactional Database, when updates are applied to an MS SQL DB the transactional updates are first written to the transaction logs, then the update is asynchronously applied to the DB file. If an update were beign applied to the transaction log and the server went down hard, then when MS SQL is brought back online, SQL would execute a recovery in order to bring the DB online. SQL will compare the transaction log to the DB file. SQL will read through the checkpoints in the transaction log. Completed transactions without a checkpoint represent updates which have not been applied to the DB. During the recovery process, SQL will update the DB file with the completed transactions and set a checkpoint in the log file (Roll Forward). If the update is incompleted, then SQL will remove the partial update from the transacation log, "rolling back' to the last consistant checkpoint.
If your server went down hard with transactional updates partially completed, then those updates would be lost when the server came back online. This is a function of how MS SQL maintains data consistancy.
CA XOsoft both synchronizes and replicates both the SQL transaction logs (.LDF files) as well as the SQL DB files (.MDF & .NDF files). CA XOsoft preserves the write-order in it's replication process. An update made to SQL will be written to the transaction. CA XOsoft acts as a file system filter driver which captures changes as they pass through the kernel and are being applied to the File System. CA XOsoft creates a replication file which corresponds directly to the update being applied to the file system and sends that replication file to the Disaster Recovery Server (Replica) in real time.
When an update is applied to the SQL transaction log, it is sent to the Replica and applied to the Replica. When SQL then asynchronously updates the DB file, CA XOsoft then creates a replication file and sends that replication file to the Replica server as well.
If the Production server (Master) were to go offline, then all the updates applied to the file system have been sent to the Replica server. When a CA XOsoft switchover occurs, CA XOsoft brings the SQL instance on the Replica online. MS SQL executes a recovery and either rolls forward those completed transactions in the logs and sets a checkpoint in the log file, or MS SQL will roll back the uncompleted transactions to the last consistant checkpoint.
There is no way to get the transaction logs and DB files "out of sync" because CA XOsoft is replicating the updates made to the DB files in the exact order they are being applied. Assuming that the .ldf and .mdf files are "out of sync" as stated by a previous poster is impossible as it is inferring that transactional updates are being sent and applied out of order. This is absolutely not true.
What would cause corruption would be failing to replicate a particular DB or log file. Mounting the DB's on the replica while a CA Xosoft scenario is running, or stopping and then restarting replication and skipping synchronization are about the only way to corrupt a DB. These are unfortunate events, but all are the result of administrative oversights, not the fault of the product itself.
In regards to comments about Forward, vs. Backward perfromance.....
When you run a CA XOsoft scenario, CA XOsoft will need to synchronize the data. The process of synchronization is to compare the data on the active server with the data on the inactive server in order to compare the differences. Once the data which is different is identified, only the differences are sent from the active server to the inactive server. (I use the terms active and inactive because a switchover may have occurred and in a backwards scenario the Replica is the active server and the Master is the inactive server).
It may take some time to complete the comparison and send the data. If the backwards synchronization takes a long time, as reported AND all of the available bandwidth is consummed during the process, then I would be looking at the amount of available throughput. This, in itself, is a bit of a paradox. If we use less bandwidth it will take longer. If we try to complete synchronization faster, then we would need to use even more bandwidth. The amount of data that has changed will dictate how much data needs to be sent. The amount of data that needs to be sent will dictate how much bandwidth is used and more importantly how long that bandwidth will be used to complete the process. There is no product that can suspend the laws of physics. A T-1 will only run at 1.54 Mb/s, which means you will never push more than 16.6GB of data through that circuit in a 24 hr period (provided you completely saturate the circuit). This is a raw amount of throughput. We need to keep in mind that TCP overhead and network latency are going to reduce the 'realistic' throughput on a circuit. A Properly sized network will accomdate the synchronization process without any issues.
We should also try to keep in mind that during the reverse synchronization and subsequent replication, the users are still up and running on an active server. This is really the key we should be focusing on here.
Some of the benefits of CA XOsoft over native SQL replication are:
Auto detection of databases and the subsequent auto-configuration of the replciation scenario. In SQL replication you would need to manually modify the SQL job to add new or modified Databases. CA Xosoft allow you to execute an Auto-Detect Databases which automatically adds all new or modified DB's to the replication scenario for you.
Assured Recovery is a tool included in the product which allow you to verify the Database consistancy on the Replica server.
Data Rewind will allow you to rewind the DB to a point in time in the past, such as prior to a corruptioni event. This significantly reduces data loss and improves the recvoery point objective as well as the recovery time objective as compared to traditional backups/restores.
A Centralized management utility that allows you to manage the DR repliation and high availability of a variety of servers including: MS SQL, MS Exchange, File Server, Oracle, IIS, BES. AS well as a variety of Operating Systems: Windows, Solaris, Red Hat Linux, AIX. This unifies your DR solution as opposed to maintaining a skillset in a variey of replication solutions which would vary from application and Operating System.
April 14, 2010 at 9:22 am
On a similar functionality, has anyone here used InMage and can comment on it?
I had a visit from an InMage sales rep yesterday, & it looks like it does the same things as how XOSoft is described here.
April 14, 2010 at 10:19 am
The CA Rep that edited his post hit on some very valid points... however I'm not an SQL expert nor have I ever setup Active/Passover solutions with SQL so I'm not able to argue the other side of things. Maybe one of the SQL Elders on here will chime in...
I see the advantages of using one app to serve your whole infrastructure but I don't think it'd be cost effective if you were only trying to use it in an SQL HA/DR scenario.
Also do these third party tools say they provide HA but do they provide load balancing? Not that I need it but I wish I did!
There is a HA/DR vConference next week with a specific session dedicated to SQL HA. There are several 3rd party vendors like CA and Symantec that are having sessions. If your interested it's free and on-line.
Here is a link to the sessions: http://www.vconferenceonline.com/shows/spring10/highavailability/sessions.asp
April 14, 2010 at 11:05 am
I hate to answer a question with a question, however:
1) Are we talking about MS SQL 2005 or 2008
2) Could we clarify what you're referring when you say "Hot/Cold" as opposed to "HA"?
I think of Hot/Cold like an Active/Passive typoe of configuration like you'd see in an MSCS cluster. HA is a more generic term for High Availability which could be Active/Passive.
April 14, 2010 at 11:16 am
If you have it available SAN replication is often VERY VERY good at DR/HA solutions. I would absolutely recommend AGAINST DoubleTake. I'd rather take a hot poker in the eye, TWICE than ever deal with that again.. However SAN replication is often not available to smaller organizations. I have had a fair amount of experience with MS clustering and have largely enjoyed it. Both active/passive and active/active. I just wish that it allowed for load balancing on top of the same database files, that I think Oracle does. That would probably be my one complaint.
I would ask what scenarios you are trying to protect against and what the requirements that you need to meet. And they need to be specific, not like, "the database just needs to be available". How long can it be down? Do you need off-site replication? Do you have a BCP site. What other tech do you have available, like SAN replication and such. DR/HA solutions are rarely done solely by the DBA staff. But designed by subject matter experts such as server, storage, network, AND database in your organziation. Sometimes a couple of these people are the same person, in bigger orgs not so much..
CEWII
April 14, 2010 at 11:32 am
Steve Lavoie:
I understand that you have a vested interest in proving that XOSoft cannot create the situations that I have described. Nevertheless, the physical universe disagrees with your theories, in that this situation did occur. Your theories sound solid, but whenever experience differs with theory, it isn't experience that needs to be modified.
Even if the failure of XOSoft to keep the DR mdf and ldf synchronized was somehow caused by some outside factor, the end result is that the two were out of synch at the DR site, when we deliberately caused a "crash" during a test. Had the primary site actually been down, the DR site would have been useless to us, because SQL Server was unable to bring those databases online (they showed as Suspect). No matter how many theories you throw at that, the reality is that it didn't do what you're describing.
That point may be arguable, but what isn't arguable, is that using XOSoft for failover on our databases is fast, but failback takes the servers offline for most of a day. That's not a viable situation for our DR. This is consistent in every test of every failure situation we can come up with that requires activating the DR site.
If you can show me a way to overcome that problem, I'll happily accept it for testing and evaluation. Currently, however, we are planning to migrate away from XOSoft in June, because of that situation. That's your deadline for proving me wrong.
If I'm wrong, that would be cool, because it means less work for me. If I'm right, I have to set up a full DR solution pretty much from scratch. XOSoft is already up and running, so fixing it will definitely be better. I'd thus prefer to be wrong about the failback on it.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 14, 2010 at 11:57 am
Well said.
CEWII
April 14, 2010 at 3:00 pm
GSquared,
You said that it took your servers off-line most of the day. Was that both servers ("failed to" and "fail back")?
April 14, 2010 at 3:14 pm
AVB (4/14/2010)
GSquared,You said that it took your servers off-line most of the day. Was that both servers ("failed to" and "fail back")?
Yes, both sites go offline (effectively) while failing back.
Steve Lavoie contacted me separately, and he thinks he knows why, and might be able to do something about it, but isn't sure. Discussion continues....
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 10, 2011 at 5:03 am
Hello,
i'm very interested in this topic. I have similar issues (with another replication software). Have the corruption issues been solved?
Thank you
GSquared (4/14/2010)
AVB (4/14/2010)
GSquared,You said that it took your servers off-line most of the day. Was that both servers ("failed to" and "fail back")?
Yes, both sites go offline (effectively) while failing back.
Steve Lavoie contacted me separately, and he thinks he knows why, and might be able to do something about it, but isn't sure. Discussion continues....
October 25, 2012 at 9:19 am
GSquared, what was the ultimate outcome of your discussion with CA about this issue? We have had similar problems with ARCserve during DR tests (some databases being suspect on the replica).
I've been tasked with researching alternative solutions because management is understandably nervous about the reliability of the ARCserve solution in light of this issue. Obviously, if there is something which can be done to address the issue with ARCserve then I would prefer to do that rather than architect a new HA/DR solution from scratch.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply