January 19, 2020 at 10:03 pm
Hi all
Bit if an unusual one here, and you’ll have to forgive me for the lack of examples, it was found while recovering from a big failure in our production environment.
We use SQL Server 2016 with Availability Groups, and we’re configured with a readable secondary which is where we take our backups from. I’m aware of certain nuances around this, but we have ~15TB of data which takes a while to backup and ship into online storage.
Our cluster configuration was altered while introducing a third node to the cluster. Basically, the disks were reconfigured as cluster resources, which took every node in the cluster down.
After a certain amount of poking, we were about to bring both nodes backup, albeit as separate instances. Primary databases came back up in a restoring state, and came online easily. These were added into an availability group. Secondary databases however, came back up with a “Recovery pending” state.
One small and fairly static database rejoined the AG easily, but the other more transactional ones all errored when we attempted the same, giving messages that the LSNs were out of sync. We attempted to restore transaction logs onto the secondary databases, but all were telling us that even our latest transaction log backup was too early to be applied.
In the end we were able to recover the secondary databases by taking a fresh full backup from the primary and restoring it across, but I’m sure you can imagine that added considerable time to our recovery. Given the errors we’d received, I had no faith in the log chain we already had, and didn’t want to spend 5 hours bringing the secondary up if it was just going to give us the same error that we’d already been receiving. But this has given me doubts that our secondary backups are good enough in this case. I’ll need to try to replicate what we saw in a development environment.
But I have a few questions:
1) Does the recovery pending state actually allow transaction log backups to be restored on top? If no, I suppose the remaining questions are somewhat moot.
2) Since SQL Server was telling us our latest log backup was too early to be applied, could we have been victim of the failure happening at the exact time a log backup was being taken? If every disk became unavailable at the point of failure, it seems unlikely SQL Server would be able to rollback the transaction log backup, although I’ll admit to not knowing exactly what happens under the covers in situations like these.
3) Is this just another nuance of taking backups from a readable secondary that is in asynchronous commit?
We need to decide what to do with our backups, but I have no faith that they will help us should we lose only a secondary node in the future. They’re currently being taken from the Primary as a temporary measure.
Many thanks for any responses, apologies for the vague nature of the post.
January 20, 2020 at 10:10 pm
Thanks for posting your issue and hopefully someone will answer soon.
This is an automated bump to increase visibility of your question.
January 22, 2020 at 6:33 am
1) Does the recovery pending state actually allow transaction log backups to be restored on top? If no, I suppose the remaining questions are somewhat moot.
RECOVERY PENDING is not the same as RESTORING, which does allow logs to be restored. RECOVERY PENDING indicates the server is unable to complete the recovery process. In an AG, this also means it cannot apply log blocks to synchronise the database. this doesn't necessarily mean you need to restore from backup, it depends what the error is that is preventing recovery from continuing.
In some cases similar to yours, I've seen the problem fixed by restarting the SQL Server secondary instance because the problem came about because the disks with the log files were unavailable when SQL Server started. Restarting caused the recovery process to restart and with accessible log files, it completed successfully. I'm not saying this was your root cause or would've fixed your issue, I'm simply stating that there is a root cause for RECOVERY PENDING and without the logs to determine that, its difficult to say if restoring from backup was your only option.
2) Since SQL Server was telling us our latest log backup was too early to be applied, could we have been victim of the failure happening at the exact time a log backup was being taken? If every disk became unavailable at the point of failure, it seems unlikely SQL Server would be able to rollback the transaction log backup, although I’ll admit to not knowing exactly what happens under the covers in situations like these.
The "log backup is too early" was probably caused by the interrupted/incomplete recovery process on these databases. The database hadn't rolled forward to an LSN that you could then apply these backups too, hence the error. Again, fixing the stalled recovery process would've likely corrected this issue as well.
3) Is this just another nuance of taking backups from a readable secondary that is in asynchronous commit?
Not really. Offloaded backups are designed to maintain valid log chains just like backups on a standalone instance. Your issue doesn't sound like a problem with your backup regime, but rather the method in which you recovered from the major outage, i.e. not identifying and resolving the RECOVERY PENDING state before moving to DB restores.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply