October 16, 2023 at 4:04 pm
Hi All,
I have a two node AG cluster with primary and read only secondary. About two months ago, we have this issue started happening , when randomly, the primary and secondary sync just stops, the secondary is still showing as synchronizing ( async mode) but on AG dashboard it shows as delayed and time just keeps accumulating. I am aware that during index maint and other heavy duty operation the primary and secondary gets delayed but then the sync start to catch up slowly when these heavy duty operations ends. However, this one is different, it just gets blocked and even do not get resolved when there is almost no activity on the primary.
I have it let run almost the whole over night , when all the counter are very low, but the sync does not start. Even when I restart the sql server services on the secondary, the system comes back and with DB recovery process initiated the sync show no delay, but then starts to accumulate from scratch. The only resolution I have so far is to restart the server and then the issue get resolved.
I have never came across this issue in the past nor I see any mention of this in my google search. any help or direction any one of you guys can provide would be really helpful.
Thanks in advance.
October 16, 2023 at 8:13 pm
Have you attempted suspending and resuming replication? That doesn't resolve the problem but may be less of a hassle rebooting until the issue is resolved.
Is your organization using snapshot backups for VMs?
Are there any errors in the failover clustering logs that correspond with the times the syncing stop?
October 17, 2023 at 2:55 pm
I have tried the suspending and resuming but it does not fix the issue. We do have Veeam backup running on the secondary node that does utilize the snapshot tech. But the issue happens is other times when there is no backups running ( they only run once a day in late night).
No errors in the failover cluster logs. The only resolution so far is reboot, even restart the sql server services do not resolve the issue.
October 17, 2023 at 3:33 pm
I would look closer into veeam. I have succeeded in getting it off of and keeping it off of my SQL servers for the last few years, but it tended to create a lot of problems when inadequately supported. Orphaned snapshots at least used to be a regular problem that could cause something like that.
Something like this, id also start throwing patches, drivers and firmware at the problem if it is behind and nothing else obvious is the issue.
October 20, 2023 at 11:13 pm
Thank you ,
I have been thinking about the moving the snapshot veeam back to the primary, but again I dont want to mess the primary node either.
Patching is definitely in line to be applied, not that far behind to the latest. It is a virtual machine.
Any other ideas guys?
October 23, 2023 at 4:11 pm
Patching involving everything - hardware, virtualization, OS, database instance. All of these things should be in similar cadence, they are all written to work with each other at the same level.
If you are doing snapshots, this is particularly important, every snapshot-based backup I have dealt with has not been particularly tolerant of mis-matched patching levels.
I would question the snapshot backups. Are you using snapshots for your SQL backups or only the OS?
If it is only for the OS, I would see if the data disks can be excluded. Trying to quiesce the SQL databases is useless if you are already backing them up natively. If SQL is backed up, is application consistent backups implemented? Are they using VSS? Are the SAN, Servers, Virtualization platform, backup software, Windows and SQL all updated to similar levels for patches, firmware and drivers?
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply