AG SQL 2016 - Large Redo Que

Question

AG SQL 2016 - Large Redo Que

curious_sqldba

SSC-Dedicated

Points: 36494
More actions
October 17, 2018 at 10:52 pm

#367352

I understand there are some well documented Microsoft articles to troubleshoot regarding large redo ques. I have two sql server in 2016 with AG set up. At night redo que size of secondary gets pretty like close to 500 GB and the redone time in AG dashboard shows it is about an hour behind.I kinda know the reason, we have some ETL processes going on primary during that time, however those processes are in batches, also i have t-logs are executing every 5 mins. I have trace flag 3459 enabled on secondary replica, from my observation serial redo seems to be better than parallel. Does any one have a better idea to shorten this redo que?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Sue_H SSC Guru Points: 90860 More actions · Answer 1

curious_sqldba - Wednesday, October 17, 2018 10:52 PM
I understand there are some well documented Microsoft articles to troubleshoot regarding large redo ques. I have two sql server in 2016 with AG set up. At night redo que size of secondary gets pretty like close to 500 GB and the redone time in AG dashboard shows it is about an hour behind.I kinda know the reason, we have some ETL processes going on primary during that time, however those processes are in batches, also i have t-logs are executing every 5 mins. I have trace flag 3459 enabled on secondary replica, from my observation serial redo seems to be better than parallel. Does any one have a better idea to shorten this redo que?

Hi Curious -
A couple of thoughts - Are you on the latest SP/CU? It seems like there have been redo fixes on most of the service packs, CUs on 2016.
One other thought - did you check if the network is what's being hammered during this time? I've seen that be an issue on the "over the night" kind of slow down, especially if you are doing backups across the network.

Sue

curious_sqldba SSC-Dedicated Points: 36494 More actions · Answer 2

Sue_H - Thursday, October 18, 2018 12:49 PM
curious_sqldba - Wednesday, October 17, 2018 10:52 PM
I understand there are some well documented Microsoft articles to troubleshoot regarding large redo ques. I have two sql server in 2016 with AG set up. At night redo que size of secondary gets pretty like close to 500 GB and the redone time in AG dashboard shows it is about an hour behind.I kinda know the reason, we have some ETL processes going on primary during that time, however those processes are in batches, also i have t-logs are executing every 5 mins. I have trace flag 3459 enabled on secondary replica, from my observation serial redo seems to be better than parallel. Does any one have a better idea to shorten this redo que?
Hi Curious -
A couple of thoughts - Are you on the latest SP/CU? It seems like there have been redo fixes on most of the service packs, CUs on 2016.
One other thought - did you check if the network is what's being hammered during this time? I've seen that be an issue on the "over the night" kind of slow down, especially if you are doing backups across the network.
Sue

I am on SP2+CU2, i am behind a CU and i think they came with SP3 also. I haven't gotten my hands dirty ,ucj on AG but here is my thought. I see the " Hardened Time" on AG dashboard on the replica's shows it is not behind, that tells me the replica has received the log. However " Commit Time", "Redone Time" shows me that i am 30 - 40 mins behind and obviously "Redo Que" size is pretty big like 50 GB. Does that make sense.

Sue_H SSC Guru Points: 90860 More actions · Answer 3

curious_sqldba - Thursday, October 18, 2018 9:32 PM
Sue_H - Thursday, October 18, 2018 12:49 PM
curious_sqldba - Wednesday, October 17, 2018 10:52 PM
I understand there are some well documented Microsoft articles to troubleshoot regarding large redo ques. I have two sql server in 2016 with AG set up. At night redo que size of secondary gets pretty like close to 500 GB and the redone time in AG dashboard shows it is about an hour behind.I kinda know the reason, we have some ETL processes going on primary during that time, however those processes are in batches, also i have t-logs are executing every 5 mins. I have trace flag 3459 enabled on secondary replica, from my observation serial redo seems to be better than parallel. Does any one have a better idea to shorten this redo que?
Hi Curious -
A couple of thoughts - Are you on the latest SP/CU? It seems like there have been redo fixes on most of the service packs, CUs on 2016.
One other thought - did you check if the network is what's being hammered during this time? I've seen that be an issue on the "over the night" kind of slow down, especially if you are doing backups across the network.
Sue
I am on SP2+CU2, i am behind a CU and i think they came with SP3 also. I haven't gotten my hands dirty ,ucj on AG but here is my thought. I see the " Hardened Time" on AG dashboard on the replica's shows it is not behind, that tells me the replica has received the log. However " Commit Time", "Redone Time" shows me that i am 30 - 40 mins behind and obviously "Redo Que" size is pretty big like 50 GB. Does that make sense.

I think you're only the one CU behind which does have one fix.
What you are saying makes sense - Are you running in asynchronous commit mode? You always have some lag with that.
And getting used to the log sizes seems to be part of AOAG.....

Sue

ImYourDBA SSCertifiable Points: 5654 More actions · Answer 4

One thing that happened to a client of mine is that their "Power Mode" in the host BIOS (was a Hyper-V VM) was set to balanced. If the server has periods of time where there aren't a lot of resources being consumed, the host can reduce the power to those resources (NIC, CPU, HD Controller, etc.) and then has a hard time distinguishing that the server is now in a period of need and never adjusts the resources back up. Make sure that the BIOS settings of the VM host, or the physical machine that you are running SQL on are set to Maximum Performance (and you might as well make sure it's set in the OS as well, just to be safe).

Other than that, what is your Redo Rate on the secondary? If the send queue and send rate empty quickly but the redo rate is low, then that is usually an indication of disk contention. Make sure that your infrastructure guys aren't running a defrag or something crazy like that during that "late night" maintenance cycle. I've seen crazier things happen. Maybe check to see what your highest wait types on the secondary are during that time (sys.dm_os_wait_stats).

Owner & Principal SQL Server Consultant
Im Your DBA, Inc.
https://www.imyourdba.com/