October 14, 2013 at 5:04 pm
As part of a re-design of our disaster recovery and high availability environment we are considering several designs.
One design would mean trying to setup a server in one location that is part of a local 2 mode cluster so that a SQL instance would remain up if there was a hardware failure (the instance would failover to the other node). Let's say the location is New York.
Let's say then we also have a datacenter in Los Angeles. It also has a SQL instance clustered on it's own two nodes located in Los Angeles.
So the instance SqlLA\InstanceOne would be a clustered instance on the nodes LA01 and LA02 in LA. The instance SqlNY\InstanceTwo would be a clustered instance on nodes NY01 and NY02 in NY.
Then I would want to setup an HA group between SqlLA\InstanceOne and SqlNY\InstanceTwo. Although that would mean the nodes would need to be a resource in two different Microsoft clusters. I would have to add NY01 and NY02 as nodes to the LA cluster even though they are already a part of the NY cluster.
The idea is that if there was a real distaster event for say LA then the nodes in NY would take over the availability group. Then if there was a hardware failure on one of the nodes in NY, then the SQL instance in NY is still up.
My gut reaction is that this seems like a really bad idea technically as the two clusters would not coordinate well. Although I'm not sure at all, maybe this would work.
Does anybody know?
October 15, 2013 at 6:41 am
No you can't do that. You will need to geographically expand your cluster across the 4 nodes in both sites. This does of course bring with it the usual infrastructure design requirements such as domain controller deployments, IP addressing and name resolution, etc
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
October 15, 2013 at 8:16 am
Thanks for the information, I appreciate the suggestions.
We do have the ability to reference the IP addresses in one location from another. We can reference DNS names to hit those IP addresses across the locations. We do have domain controllers that handle authentication. We also have SANs in each location. We also have a dedicated circuit between the two locations that can be used just for replication traffic between the two sites. I have been able to setup mirroring between a clustered instance in one location to a standalone instance in the other location and have that traffic flow over that dedicated circuit.
We have a different circuit that handles application or file share traffic between the two sites.
It sounds like what I could do then is have one geographically "stretched" cluster that includes all 4 nodes. I could create a clustered SQL instance in the one location in LA (on just the two LA nodes) and a clustered SQL instance in the other location in NY (on just the two nodes in NY). The instances would not be able to fail over from one location to the other (say from LA to NY).
Is that possible? Maybe that's not recommended? We do not have 5-10 ms latency between the two sites so the mirroring I've done so far is asynchronous.
The flaw that has pointed out in our current design is that in one location we have clustering for local hardware HA. But in the other location we have standalone instances, so those instances would be "down" if there was a hardware failure.
We also have VMware available in both locations. So another option I have available to me is to use virtual servers in the "secondary" location. I'm not very comfortable with that but maybe that is an option also.
October 15, 2013 at 9:20 am
Ted Zatopek (10/15/2013)
It sounds like what I could do then is have one geographically "stretched" cluster that includes all 4 nodes.
Correct, that's what i said above 😉
Ted Zatopek (10/15/2013)
I could create a clustered SQL instance in the one location in LA (on just the two LA nodes) and a clustered SQL instance in the other location in NY (on just the two nodes in NY). The instances would not be able to fail over from one location to the other (say from LA to NY).
Not sure what you mean here, you would not be able use automatic failover within the AO group as this is not supported when one of the AO replicas is a clustered instance of SQL Server.
Ted Zatopek (10/15/2013)
Is that possible? Maybe that's not recommended?
You may use clustered instances of SQL Server within your AO group, just be sure to check the pre reqs before deploying. Check my article at this link[/url] for more on AO groups and FCI's.
Ted Zatopek (10/15/2013)
The flaw that has pointed out in our current design is that in one location we have clustering for local hardware HA. But in the other location we have standalone instances, so those instances would be "down" if there was a hardware failure.
Basic AlwaysOn uses a Windows cluster with stand alone instances installed 😉
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
October 15, 2013 at 10:26 am
Thanks again for your comments and your article really clears that up for me.
I can't seem to be as word efficient in my questions as you are in your responses. 🙂
So what I now understand is that in order to use an AO group, I can have the primary replica be a FCI, although the secondary replica must be a standalone instance. We are actually currently setup that way minus the fact the the server that the standalone instance in on is not a part of the WSFC that the FCI in on. Once that standalone instance server is added to the WSFC, then I'd be able to setup an AO availability group with a listener. The advantage to using an AO group with a listener is that for applications and users it's the same SQL instance name and port that is responding to requests no matter what which SQL instance behind the scenes is actually processing the requests.
The "flaw" if you will that has been pointed out to me in that setup would be if we had a real disaster event we might be live on the secondary replica for an extended amount of time. If the disaster was bad enough it could take months to rebuild the primary location from scratch when you are talking about ordering, shipping, installing, and configuring all the pieces (SAN, Servers, Network, SQL installs, VMWARE builds, application rebuilds, etc.) in addition to having to re-seed data back to that re-built location. During that time if the server that runs the secondary replica (which is now the primary and only replica) had a hardware failure the SQL instance would be down.
Although I guess I could achieve some local hardware redundancy by having another secondary replica in the same location as the first secondary replica. Then for example if the primary replica went down in say LA, then one of the secondary replicas in NY would take over and become the primary replica. The other secondary replica in NY would still be a secondary replica. If we were in that state for an extended period of time, and we were unlucky and had a hardware failure on the "new" primary replica in NY, then I still have that other secondary replica in NY that would give protection against that hardware failure. Do you see any flaw in that setup?
If I wanted to use an FCI instance at both locations (LA and NY), then I would not be able to use AO groups. I would be instead using mirroring to keep the two different FCI instances in synch. The reason why I have to consider that is we have a mixed SQL 2008 R2 and 2012 environment with some mission critical apps that cannot currently be run on a SQL 2012 instance so unfortunately AO is not an option for me for those apps.
October 15, 2013 at 12:03 pm
Introduction
Hi all of you, i am new to this forum hope soon we will become good friends and share information with each other.:-P
October 15, 2013 at 12:44 pm
Ted Zatopek (10/15/2013)
So what I now understand is that in order to use an AO group, I can have the primary replica be a FCI, although the secondary replica must be a standalone instance.
Not strictly a "must be". Even if you setup the FCI as the primary replica, as soon as a failover occurs within the AO group the FCI then becomes effectively a secondary replica.
Ted Zatopek (10/15/2013)
We are actually currently setup that way minus the fact the the server that the standalone instance in on is not a part of the WSFC that the FCI in on. Once that standalone instance server is added to the WSFC, then I'd be able to setup an AO availability group with a listener. The advantage to using an AO group with a listener is that for applications and users it's the same SQL instance name and port that is responding to requests no matter what which SQL instance behind the scenes is actually processing the requests.
Correct and if you enable read only routing you can have the listener direct read only intent requests to a specified secondary.
Ted Zatopek (10/15/2013)
The "flaw" if you will that has been pointed out to me in that setup would be if we had a real disaster event we might be live on the secondary replica for an extended amount of time. If the disaster was bad enough it could take months to rebuild the primary location from scratch when you are talking about ordering, shipping, installing, and configuring all the pieces (SAN, Servers, Network, SQL installs, VMWARE builds, application rebuilds, etc.) in addition to having to re-seed data back to that re-built location. During that time if the server that runs the secondary replica (which is now the primary and only replica) had a hardware failure the SQL instance would be down.
Flaw? How do you work that out?
You mitigate any immediate server hardware failure by clustering the Primary replica. If all else fails and you have a true catastrophic site failure the secondary site will do exactly what its supposed to do.
In theory you have server backups, you should be able to ship, rebuild and restore from tape all your primary site servers. Some companies perform this exercise periodically to ensure backups are good.
Ted Zatopek (10/15/2013)
If I wanted to use an FCI instance at both locations (LA and NY), then I would not be able to use AO groups.
Incorrect, an FCI may act as a secondary so the config above is valid, although quite why you would want to mitigate server failure on the DR site is not clear.
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
October 15, 2013 at 1:35 pm
Gotcha, I think I went from "automatic failover within the AO group is not supported when one of the AO replicas is a clustered instance of SQL Server" to "the secondary replica cannot be an FCI" in my mind. I see the inference and logic error I made. Thanks for clarifying. This has been very helpful to me. It would be a manual failover, not an automated failover.
Flaw may not be the best word. The reason why we would want to mitigate server hardware failure at the DR site is that we could be operating at the DR site for an extended period of time. If the event at the primary site was catastrophic enough this could be weeks or months. I am asked to provide a design that would then mitigate a SQL server hardware failure while operating at the DR site for say months. While we may not implement that design fully right away I need to have a plan and a ballpark cost (hardware, software licenses, SAN space, etc) associated with it.
So one way to have hardware failure mitigation at the DR site would be to have the DR instance be an FCI.
I'm guessing that the way I would do that would be to first add the DR site servers as nodes to the geographically stretched WSFC from the primary site. Then I would then "install" a new FCI instance on those two DR servers.
October 15, 2013 at 3:25 pm
Ted Zatopek (10/15/2013)
we could be operating at the DR site for an extended period of time.
Rare, but possible i suppose. Failover to DR should be considered a short term action, you should have extensive plans for recovery of your primary site such as emergency h\w procurement and retrieval of backup tapes, etc. Basically, short of a bomb blast, your primary site should be up and running as quickly as possible.
Just out of interest, how far from your primary site is your DR site?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
January 14, 2014 at 2:48 pm
Sorry there was such a delay since the last post. I lost track of this post.
The distance between the two datacenters is about 1000 miles.
We are now going through a whole redesign of our datacenters including locating them in dedicated facilities that manages the basic infrastructure like power, cooling etc.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply