June 29, 2020 at 6:25 pm
I am using a replication tools that has been having issues the last couple of patch cycles. I am doing some troubleshooting and I see a few issues.
Just trying to do some troubleshooting on certain issues that we have noticed lately.
The servers are SQL Server 2017. The AGs were set to manual. A failover shouldn't have happened nor should a broken log chain. So trying to determine what happened before next patching cycle.
Thanks for any and all input. When WwW
Things will work out. Get back up, change some parameters and recode.
June 30, 2020 at 1:13 pm
June 30, 2020 at 2:41 pm
Great information.
During patching, both servers are in the same group to be rebooted. We never thought about separating the servers that are involved in an AG and then insuring the secondaries are rebooted first.
The odd thing about this issue was that the AG was set to manual. There shouldn't have been a failover when the servers are set to manual.
When the failover occurred, the backup ran (since the replica was now the primary). A back up is not on the secondary replica (now primary).
Then another failover must have occurred although I am not seeing that in the logs. But the original primary is now back to being primary. We start the replication tool and bam. There is a replication error due to a missing log file.
Things will work out. Get back up, change some parameters and recode.
June 30, 2020 at 8:19 pm
These kinds of issues is one reason I do not recommend auto-patching and restarting clusters of any kind. These should always be manually patched and restarted to ensure that quorum is always maintained during the process and each node is patched, restarted and ready before it can be utilized to support any services.
One of the problems with maintenance in an AG - is the non-shared storage between nodes. Ideally, in this type of environment you would have network storage available and setup your maintenance to backup to that network storage using the UNC path. This would alleviate any issues/concerns with which node is performing the backups.
If possible - when patching a cluster I recommend failing over only a single time, that way you run for a month on node 1 - the next month on node 2 and the following month are back to node 1. The order of operations would be:
This reduces overall downtime to a single fail over event - and allows for patching and restarting to occur prior to the scheduled downtime. The patching and restarting does not impact the active system and therefore does not impact the users or the application - the only impact is the fail over.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
July 18, 2020 at 11:40 pm
For AlwaysOn AGs, you can look at there is an extended Events Health Session that can be examined to find information about failovers. I would also be curious to know what the SQL Server Error Log shows happened during the time you're describing.
I would agree with you that if both nodes in the AG are set to manual failover, then an unexpected failover should not occur. Be sure to check the AG settings to ensure that both nodes are set to manual failover.
The behavior your describing with the Ola Hallengren scripts sounds like what happens when you take a backup of a database when it actually isn't part of an AG. If a backup is part of an AG, then Ola places all the backups in a single directory.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply