March 21, 2024 at 9:46 am
Hello,
we are currently running a WSFC cluster two node setup (nodes are on two different locations) and we have a few instances (roles) on this cluster. We are using SAN as a storage solution. Because we are getting new hardware we explored a few options for HA and DR. We need to provide 0 data loss, and in case of a downed site, failover must be completed under 1 minute. Because of those requirements we are leaning more towards Availability groups. However after seting up a test system there are a few unanswered questions we have, which I would like to ask if anyone have some experience with:
I know we are 12 years late to the party and probably (I hope) the questions I asked here were answered before so I can find them somewhere, however so far I didn't find any difinitive answer. So that is why I am posting my questions here. Any help is really apriciated.
March 21, 2024 at 10:41 am
1 - So you're shared storage isn't actually shared in your current setup then, you'll have 1 SAN in the Primary site, and 1 SAN in the DR site and you'll be doing some block level type replication to keep the database updated on each SAN.
How is your SAN ensuring that the blocks are copied correctly and not out of order? Have you ever failed over the FCI from the primary to DR? If so have you ran DBCC CHECKDB on the block replicated copy of the DB?
I've seen cases in the past where doing block level replication has caused huge corruption in the copied blocks as things where copied out of sequence.
You may have gotten lucky and had no problems, and kudos if you have, but for me personally I wouldn't be block level copying if I can avoid it.
With an AG, the data is copied in sequence and applied in sequence, and it has built in corruption checking and can ask any replicas for the correct uncorrupted page should they have a clean copy, so you get some added peace of mind that corruption for the replication pieces is minimised, it's not totally gone away as certain things can still happen, but the risk is minimised.
2 - If you are going to be using SQL 2022, then look at contained availability groups and see if they will fit your use case, with this master and msdb are part of the AG, so you don't need to automate the copying of logins/jobs etc.
However if you can't use 2022 or need to use regular AG's then yes automation with things like DBATools to Sync-DbaAvailabilityGroup can help you tremendously in automating and keeping system wide objects in sync between replicas.
3 - As always with anything licensing, speak to your reseller, you may have terms and conditions imposed on you that other people do/don't have, so what you need for licensing can be different to others.
But if you are using the other side as a readable copy, then yes you will need to license ALL the CPU's in EVERY replica, just how you do that, you're reseller will be able to advise.
4 - Not exactly, what you should do is instead create CNAMEs in DNS, as I am going to hazard a guess you won't be shutting down SQLA / SQLB, then building your AGs, then bringing them into service, you're going to be doing some form of side by side migration.
So what you would want to do is you either bite the bullet now, create a CNAME for each SQLA/SQLB/SQLC etc etc that you have, and have your developers change the connection strings now. CNAMEA, CNAMEB, CNAMEC etc, which point to the DNS records of SQLA, SQLB, SQLC etc.
Or as part of your cut over plan from FCI to AG, once you shut down the FCIs, you create the CNAMEs with the old FCI names. Point these CNAMES to the listeners, so a new CNAME called SQLA points to ListenerA, SQLB points to ListenerB etc.
The issue with doing it as part of cut over is, if you ever need to shutdown the AG's and go back to FCI's as then you'll have conflicting DNS entries in the form of CNAME and A records, so it's an additional thing to remember to remove if you need to rollback.
So for me, go through the pain now, create CNAMES now, change your applications now, then on cutover repoint the CNAMES to the listeners.
Beauty of CNAMES is, then next time you do a migration, you don't need to change them again, you just point the CNAME to the new host/listener.
The 1 minute failover time has to be tested also. What is the connection like between Primary and DR? What bandwidth do you have? What latency do you have? How saturated is the link?
As for that low a time to failover you'll probably want the AG's to be in SYNC mode, which is not generally is advised when going between sites, you would generally have it in ASYNC mode.
With it being in SYNC mode, the transactions are going to need to harden at both sides before the transaction can move on, so if you have a stupidly high latency between the two sites, you're going to introduce bottlenecks to your processing, as transactions wait for the replicas to confirm they have both received the TXN and it has been hardened to the log.
March 21, 2024 at 3:18 pm
Hello,
thanks for the reply, it helps a lot.
1 - You are correct. We do have a replication between the two SAN storages and they are in two different locations. However to further anwser to your question I would have to contact our SAN admins about it to be honest, since they were the ones who did the initial setup.
2 - Nice, I think contained availability groups will do the trick, otherwise I will stick with automated tasks.
3 - About licensing I will talk with our license guy and then probably with our reseller. Thanks.
4 - We actually thought we could use our downtime hours to migrate the data between our FCI cluster and AG cluster. And maybe do this in a couple of weeks one instance at a time if it would take too long of a time to migrate all of them at once. When one of the instance would be migrated, we would shut down the instance completely. And then we would somehow rename the listeners to our previous names. However now that i look at your reply, the closest thing is to do the steps you wrote here:
So in this case we would be safe from refractoring connection strings in our applications, but it would be a pain to go back if anything goes wrong right?... We are kind of pushed into a corner at this connection strings problem, because from our side this can be done as you suggested. And if it was my call I would do actually exactly as you said since this is a once in a decade thing to do, and it would be great to do it right, however this will probably be the last thing management would sign off on. I also forgot to add this link at my OP: https://www.sqlservercentral.com/forums/topic/migrating-from-failover-cluster-to-alwayson (they are using Aliases on DNS to achieve this). Is this something we could pursue, it does sound similar to your plan?
And as for the latency, the distance between two datacenters is few miles nothing too big. So latency was never an issue. As for Availability groups, we would need Synchronous mode with automatic failover mode, yes.
Last thing to note is, everywhere we looked we were under the impression Availability groups are the way to go forward for our specific case (we were in a few Microsoft courses about AG and HR, and basically the main focus was on AGs aswell). We don't have any HUGE problems running FCI at this time it provides HA and we do have DR with our setup. However we did have a lot of small instances of annoyance with server performance on this kind of setup. We have some instances who have bigger traffic than others, that means whenever there were some extra problems with one instance it would slow down the others. So because of that we thought we would try the AG aproach.
March 21, 2024 at 3:47 pm
And as for the latency, the distance between two datacenters is few miles nothing too big. So latency was never an issue. As for Availability groups, we would need Synchronous mode with automatic failover mode, yes.
OK, best thing I would suggest is test it work a reasonable workload, your SAN to SAN replication is going to be more of an ASYNC process rather than SYNC, so the traffic isn't noticeable to the applications, but putting your AG into SYNC, it will be noticeable to your applications so it may be unacceptable for it to be in SYNC mode if it is delaying transaction times.
I also forgot to add this link at my OP: https://www.sqlservercentral.com/forums/topic/migrating-from-failover-cluster-to-alwayson (they are using Aliases on DNS to achieve this). Is this something we could pursue, it does sound similar to your plan?
Yeah a DNS Alias or CNAME the same thing, the problem you will have though is the instance names, as listeners you wont generally use an instance name against them and just use a typical NetBIOS name.
Yeah they push AG's hard now as it's the easiest way to do HADR in Azure/Cloud based setups, you can do FCI in the cloud but it's a hell of a lot more complex than AG's and MSFT is always wanting to push you into Azure for those extra cloud bucks.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply