Hi All,
I am looking for some inputs on how to avoid or prevent this particular scenario which happened yesterday. there was downtime of around 2 hours.
Below is the scenario
We have a Multi-subnet Alwayon AG configured with 10 databases. AG has 3 nodes. All are azure VMs.
2 nodes in one region and 1 node is in a different region.We are using SQL Server 2017 Enterprise edition CU23 version.
Issue description:
On Primary replica one of the user db's transaction log file of a database which is part of AG got full (9002 log full error).
The mdf is 20gb and log file has grown ~ 400GB. The data and log files are located in separate drives. every hour we take log backups.
The drive capacity of log file is 500GB.
So, to minimize the downtime , we tried adding space on secondary synchronous replica and wanted to fail over.
After adding space and when we rebooted the server didn't come online. it took like 20 mins to come up.
Secondly, the cluster service was not coming online due to which , we aren't able to do a fail over to the secondary node.
After systems escalation team was involved , they finally was able to bring the cluster service online which took another 30-45 mins to analyze the logs. We eventually failoved over to the secondary and made it primary. We followed same steps to add it on remaining nodes.
We are suspecting all this could have happened due to a long running open txn and it is waiting for AVAILABILITY_REPLICA log_reuse_wait_desc. We asked application team if they have run an transaction, they said they haven't.
Observations:
1. We have disk space monitoring in place, but the incident was with system's team and they haven't notified.
2. We got an sql server alert only for log full 9002 error.
I am looking for inputs on how can avoid some disk space issues in future.
My thoughts
- > I am thinking about configuring another sql alert to send email when the log file is reached 50% .
- > restrict auto growth of the log file
Please share more thoughts on this on how to minimize of handle such situations more efficiently.
Also, is there a way we can tell/send a list of spids/txns which caused the log file to get full?
Thank you.
Sam
August 9, 2022 at 3:59 pm
What type of log backups are you taking? Normal log backups or do you specify copy_only on your log backups?
Alerting sooner is always a great idea.
Restricting auto growth probably not so much.
You need to find the cause of the large log ideally and resolve that.
I have seen people do copy only log backups on their AGs and this doesn’t mark the log as reusable and then get into this exact issue you have mentioned.
verify the backups are actually running still and are not failing.
you won’t be able to see what caused the log to bloat any more but if you do have your log backups you could use something like fn_dump_dblog to see what is in the log and try and see who was doing what and when they where doing it etc.
August 10, 2022 at 6:15 pm
A couple more points:
The choice to try and solve this problem by failing over just increased the issues. In this type of situation - the better option would have been to remove that database from the AG. Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).
The recovery time for the system was extended because of the large transaction log.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
August 10, 2022 at 11:25 pm
The choice to try and solve this problem by failing over just increased the issues. In this type of situation - the better option would have been to remove that database from the AG. Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).
The recovery time for the system was extended because of the large transaction log.
What is meant by Cleanup the transaction log after removing the db from AG?
August 11, 2022 at 7:33 am
Jeffrey Williams wrote:The choice to try and solve this problem by failing over just increased the issues. In this type of situation - the better option would have been to remove that database from the AG. Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).
The recovery time for the system was extended because of the large transaction log.
What is meant by Cleanup the transaction log after removing the db from AG?
This means removing the database from the AG, and shrinking the log by whatever means necessary, backup and shrink, set simple and shrink etc. Then set back to full so another full and log backup and add it back to the AG.
But you need to find the source of the log growth otherwise it will happen again and again.
August 11, 2022 at 12:16 pm
Got it. Thanks everyone for all your inputs.
What is meant by Cleanup the transaction log after removing the db from AG?
This means removing the database from the AG, and shrinking the log by whatever means necessary, backup and shrink, set simple and shrink etc. Then set back to full so another full and log backup and add it back to the AG.[/quote]
Correct - but... There is no reason to change recovery model to shrink a log file - shrink the log, backup log, shrink - repeat until it is down to the size needed for normal operations.
With automatic seeding enabled - there is no need to backup the database and log. Automatic seeding performs the backup/restore for you automagically.
Even if the database was a multi-terabyte database, removing the database - fixing the transaction log - adding it back would have still been a better option. Using automatic seeding and enabling the trace flag to compress the data stream and that process could have been completed in less time (depending on network connectivity - of course).
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply