Always on availability group vs Always on Failover cluster instance question

Question

Always on availability group vs Always on Failover cluster instance question

GreatPancake

Old Hand

Points: 337
More actions
March 21, 2024 at 9:46 am

#4399376
Hello,
we are currently running a WSFC cluster two node setup (nodes are on two different locations) and we have a few instances (roles) on this cluster. We are using SAN as a storage solution. Because we are getting new hardware we explored a few options for HA and DR. We need to provide 0 data loss, and in case of a downed site, failover must be completed under 1 minute. Because of those requirements we are leaning more towards Availability groups. However after seting up a test system there are a few unanswered questions we have, which I would like to ask if anyone have some experience with:
1. In the official documentation it is stated that one of the main PROs for Availability groups over Always on FCI is the default feature of DR. However we can achive DR on our FCI at this time with a SAN. So are we missing some other benefits in Availability groups around the topic of DR?
2. We have a lot of small databases on our instances which ocasionally need to have new SQLAgent jobs added and there are a few requests per month to add logins and users to them. I found a solution that automatically sync those instance based changes, however do you find it seamless to use such automation in a production system? Are there any drawbacks regarding this? Because at this time it is pretty much straight forward doing this on our FCI cluster.
3. We are not looking into using more than 2 nodes (primary and secondary) for each of our instances. We are however looking into using our secondary node as a read-only node. With that I have a main question about licensing. In that regard will it still be core based licensing (we do know that with AGs we can only have Enterprise licenses)? So lets say we will have all 5 instances running on virtualized SQL Servers (that is each instance on their own two clustered virtualized servers. One server on a physicall server on location A and the other on location B). And lets say that each physical server which run those virtualized servers has 16 cores. Will we only be paying Enterprise licence for those 32 cores (so 16 per physical server) or does Availability groups introduce new license requirements.
4. Last question is about listeners. We have our connection string setup as: SQLA\instanceA, SQLB\instanceB and so on... Will we be able to recreate this with a listener so that we won't need to ask our programmers to change connection strings in all of their applications? What I read about it is that listener name can be max 15 characters long and it can contain dashes(-) and underscores (_). I did hear and read on the forums that a few managed to configure listener name as their previous connection string, however I guess it depends from case to case (name to name).
I know we are 12 years late to the party and probably (I hope) the questions I asked here were answered before so I can find them somewhere, however so far I didn't find any difinitive answer. So that is why I am posting my questions here. Any help is really apriciated.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Ant-Green SSC Guru Points: 113507 More actions · Answer 1

1 - So you're shared storage isn't actually shared in your current setup then, you'll have 1 SAN in the Primary site, and 1 SAN in the DR site and you'll be doing some block level type replication to keep the database updated on each SAN.

How is your SAN ensuring that the blocks are copied correctly and not out of order? Have you ever failed over the FCI from the primary to DR? If so have you ran DBCC CHECKDB on the block replicated copy of the DB?

I've seen cases in the past where doing block level replication has caused huge corruption in the copied blocks as things where copied out of sequence.

You may have gotten lucky and had no problems, and kudos if you have, but for me personally I wouldn't be block level copying if I can avoid it.

With an AG, the data is copied in sequence and applied in sequence, and it has built in corruption checking and can ask any replicas for the correct uncorrupted page should they have a clean copy, so you get some added peace of mind that corruption for the replication pieces is minimised, it's not totally gone away as certain things can still happen, but the risk is minimised.

2 - If you are going to be using SQL 2022, then look at contained availability groups and see if they will fit your use case, with this master and msdb are part of the AG, so you don't need to automate the copying of logins/jobs etc.

However if you can't use 2022 or need to use regular AG's then yes automation with things like DBATools to Sync-DbaAvailabilityGroup can help you tremendously in automating and keeping system wide objects in sync between replicas.

3 - As always with anything licensing, speak to your reseller, you may have terms and conditions imposed on you that other people do/don't have, so what you need for licensing can be different to others.

But if you are using the other side as a readable copy, then yes you will need to license ALL the CPU's in EVERY replica, just how you do that, you're reseller will be able to advise.

4 - Not exactly, what you should do is instead create CNAMEs in DNS, as I am going to hazard a guess you won't be shutting down SQLA / SQLB, then building your AGs, then bringing them into service, you're going to be doing some form of side by side migration.

So what you would want to do is you either bite the bullet now, create a CNAME for each SQLA/SQLB/SQLC etc etc that you have, and have your developers change the connection strings now. CNAMEA, CNAMEB, CNAMEC etc, which point to the DNS records of SQLA, SQLB, SQLC etc.

Or as part of your cut over plan from FCI to AG, once you shut down the FCIs, you create the CNAMEs with the old FCI names. Point these CNAMES to the listeners, so a new CNAME called SQLA points to ListenerA, SQLB points to ListenerB etc.

The issue with doing it as part of cut over is, if you ever need to shutdown the AG's and go back to FCI's as then you'll have conflicting DNS entries in the form of CNAME and A records, so it's an additional thing to remember to remove if you need to rollback.

So for me, go through the pain now, create CNAMES now, change your applications now, then on cutover repoint the CNAMES to the listeners.

Beauty of CNAMES is, then next time you do a migration, you don't need to change them again, you just point the CNAME to the new host/listener.

The 1 minute failover time has to be tested also. What is the connection like between Primary and DR? What bandwidth do you have? What latency do you have? How saturated is the link?

As for that low a time to failover you'll probably want the AG's to be in SYNC mode, which is not generally is advised when going between sites, you would generally have it in ASYNC mode.

With it being in SYNC mode, the transactions are going to need to harden at both sides before the transaction can move on, so if you have a stupidly high latency between the two sites, you're going to introduce bottlenecks to your processing, as transactions wait for the replicas to confirm they have both received the TXN and it has been hardened to the log.

GreatPancake Old Hand Points: 337 More actions · Answer 2

Hello,

thanks for the reply, it helps a lot.

1 - You are correct. We do have a replication between the two SAN storages and they are in two different locations. However to further anwser to your question I would have to contact our SAN admins about it to be honest, since they were the ones who did the initial setup.

2 - Nice, I think contained availability groups will do the trick, otherwise I will stick with automated tasks.

3 - About licensing I will talk with our license guy and then probably with our reseller. Thanks.

4 - We actually thought we could use our downtime hours to migrate the data between our FCI cluster and AG cluster. And maybe do this in a couple of weeks one instance at a time if it would take too long of a time to migrate all of them at once. When one of the instance would be migrated, we would shut down the instance completely. And then we would somehow rename the listeners to our previous names. However now that i look at your reply, the closest thing is to do the steps you wrote here:

"Or as part of your cut over plan from FCI to AG, once you shut down the FCIs, you create the CNAMEs with the old FCI names. Point these CNAMES to the listeners, so a new CNAME called SQLA points to ListenerA, SQLB points to ListenerB etc."

So in this case we would be safe from refractoring connection strings in our applications, but it would be a pain to go back if anything goes wrong right?... We are kind of pushed into a corner at this connection strings problem, because from our side this can be done as you suggested. And if it was my call I would do actually exactly as you said since this is a once in a decade thing to do, and it would be great to do it right, however this will probably be the last thing management would sign off on. I also forgot to add this link at my OP: https://www.sqlservercentral.com/forums/topic/migrating-from-failover-cluster-to-alwayson (they are using Aliases on DNS to achieve this). Is this something we could pursue, it does sound similar to your plan?

And as for the latency, the distance between two datacenters is few miles nothing too big. So latency was never an issue. As for Availability groups, we would need Synchronous mode with automatic failover mode, yes.

Last thing to note is, everywhere we looked we were under the impression Availability groups are the way to go forward for our specific case (we were in a few Microsoft courses about AG and HR, and basically the main focus was on AGs aswell). We don't have any HUGE problems running FCI at this time it provides HA and we do have DR with our setup. However we did have a lot of small instances of annoyance with server performance on this kind of setup. We have some instances who have bigger traffic than others, that means whenever there were some extra problems with one instance it would slow down the others. So because of that we thought we would try the AG aproach.

Ant-Green SSC Guru Points: 113507 More actions · Answer 3

And as for the latency, the distance between two datacenters is few miles nothing too big. So latency was never an issue. As for Availability groups, we would need Synchronous mode with automatic failover mode, yes.

OK, best thing I would suggest is test it work a reasonable workload, your SAN to SAN replication is going to be more of an ASYNC process rather than SYNC, so the traffic isn't noticeable to the applications, but putting your AG into SYNC, it will be noticeable to your applications so it may be unacceptable for it to be in SYNC mode if it is delaying transaction times.

I also forgot to add this link at my OP: https://www.sqlservercentral.com/forums/topic/migrating-from-failover-cluster-to-alwayson (they are using Aliases on DNS to achieve this). Is this something we could pursue, it does sound similar to your plan?

Yeah a DNS Alias or CNAME the same thing, the problem you will have though is the instance names, as listeners you wont generally use an instance name against them and just use a typical NetBIOS name.

Yeah they push AG's hard now as it's the easiest way to do HADR in Azure/Cloud based setups, you can do FCI in the cloud but it's a hell of a lot more complex than AG's and MSFT is always wanting to push you into Azure for those extra cloud bucks.