July 6, 2014 at 12:32 am
I have a sql server where alwayson is configured between the instances. Recently we migrated a database here which is very small of 500mb but it started blotting waits. The top most is redo_thread_pending_work.
I thought this might be because of sync alwayson which I changed to async but I dont see much benefit.
I googled and googled but not finding any relevant link to see the details of this thread. I understand its waiting for some redo work, but want to understand what work is that and how to troubleshoot and resolve.
Please help.
Here are the result I got from Paul's wait_Stats query Paul's http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/%5B/url%5D
WaitType Wait_S Resource_S Signal_S WaitCountPercentage AvgWait_SAvgRes_SAvgSig_S
REDO_THREAD_PENDING_WORK186502.00186300.73 201.26 191456995.84 0.09740.09730.0001
Thanks
----------
Ashish
July 6, 2014 at 9:51 am
crazy4sql (7/6/2014)
I have a sql server where alwayson is configured between the instances. Recently we migrated a database here which is very small of 500mb but it started blotting waits. The top most is redo_thread_pending_work.I thought this might be because of sync alwayson which I changed to async but I dont see much benefit.
I googled and googled but not finding any relevant link to see the details of this thread. I understand its waiting for some redo work, but want to understand what work is that and how to troubleshoot and resolve.
Please help.
Here are the result I got from Paul's wait_Stats query Paul's http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/%5B/url%5D
WaitType Wait_S Resource_S Signal_S WaitCountPercentage AvgWait_SAvgRes_SAvgSig_S
REDO_THREAD_PENDING_WORK186502.00186300.73 201.26 191456995.84 0.09740.09730.0001
Thanks
The size of the database doesn't matter at all for mirroring (or other transaction-consuming stuff like replication). It is the volume and size of DML that matters. I could generate a terabyte of tlog activity per day on a 10MB database very easily.
1) I would use things like dbcc sqlperf(logspace) to see the size and percent full of each database's tlog.
2) Did this just start showing up after you added in the new database? If so, after doing all the forensics you can consider moving it out of the AG and see if the problem goes away.
3) Profile activity on the database to see what is happening to it?
4) Use fn_dblog to see actual tlog guts.
5) Check sys.databases for log_reuse_wait information
6) Are you looking at RAW waits or doing a differential waits analysis?
7) What does the (rather simple and limited) AlwaysOn monitor GUI tell you? I would go to the source and start using the associated DMVs to see if they show anything.
8) Have you tested the network between the servers for issues? Perhaps someone changed a route table or some such and now you are sending IP packets out to Pluto and back to get to the secondary server (yes, I have seen that more than a few times at clients)!!
There are LOTS of things to evaluate that aren't up there. Good luck!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
July 7, 2014 at 5:33 am
Thanks Kevin, for taking time to look into this.
1) I would use things like dbcc sqlperf(logspace) to see the size and percent full of each database's tlog.
All dbs log files are >500mb and spaceused <10% except tempdb (which has 41%).
2) Did this just start showing up after you added in the new database? If so, after doing all the forensics you can consider moving it out of the AG and see if the problem goes away.
I already tried it. First I removed this specific database from AG and saw no improvement. Then I tried changing the commit to async instead of sync but this also not give me visible benefits.
3) Profile activity on the database to see what is happening to it?
I run the trace filtering on this specific db and found that it runs some set of select/inset queries. And after completion of each set, it runs sp_reset_connections. Is this something to investigate further from application side?
4) Use fn_dblog to see actual tlog guts.
I did count on operations and below is the detail :-
operation Count
---------------------------------
LOP_INSERT_ROWS 5784
LOP_FORMAT_PAGE 2029
LOP_LOCK_XACT 1682
5) Check sys.databases for log_reuse_wait information
This is bit surprising. This database is configured via logship to DR site and every 15 min log backups are running but still for this specific database, it always shows "log_backup" in log_reuse_ait_desc. I tried to see if there is any open transaction using "dbcc opentran" but there are no open transaction. Not sure why it still always shows "log_backup" pending. Anything suspicious you see here?
6) Are you looking at RAW waits or doing a differential waits analysis?
I am using Paul's query (which i mentioned in original post) and capturing the waits every 5 min. And on every occurrence the top wait is "REDO_THREAD_PENDING_WORk".
7) What does the (rather simple and limited) AlwaysOn monitor GUI tell you? I would go to the source and start using the associated DMVs to see if they show anything.
I checked the Alwayson monitor and dmvs (dm_hadr_database_replica_states and availability_replicas)to check status and it shows healthy and "log_send_queue_size" as well "redo_queue_size" is always 0.
8) Have you tested the network between the servers for issues? Perhaps someone changed a route table or some such and now you are sending IP packets out to Pluto and back to get to the secondary server (yes, I have seen that more than a few times at clients)!!
I did basic test by running some set of queries (select,update) from db server(local) and from app/web server but didn't saw any difference in completion. Both side the query completion is <1 second.
----------
Ashish
July 7, 2014 at 6:31 am
crazy4sql (7/7/2014)
Thanks Kevin, for taking time to look into this.1) I would use things like dbcc sqlperf(logspace) to see the size and percent full of each database's tlog.
All dbs log files are >500mb and spaceused <10% except tempdb (which has 41%).
2) Did this just start showing up after you added in the new database? If so, after doing all the forensics you can consider moving it out of the AG and see if the problem goes away.
I already tried it. First I removed this specific database from AG and saw no improvement. Then I tried changing the commit to async instead of sync but this also not give me visible benefits.
3) Profile activity on the database to see what is happening to it?
I run the trace filtering on this specific db and found that it runs some set of select/inset queries. And after completion of each set, it runs sp_reset_connections. Is this something to investigate further from application side?
4) Use fn_dblog to see actual tlog guts.
I did count on operations and below is the detail :-
operation Count
---------------------------------
LOP_INSERT_ROWS 5784
LOP_FORMAT_PAGE 2029
LOP_LOCK_XACT 1682
5) Check sys.databases for log_reuse_wait information
This is bit surprising. This database is configured via logship to DR site and every 15 min log backups are running but still for this specific database, it always shows "log_backup" in log_reuse_ait_desc. I tried to see if there is any open transaction using "dbcc opentran" but there are no open transaction. Not sure why it still always shows "log_backup" pending. Anything suspicious you see here?
6) Are you looking at RAW waits or doing a differential waits analysis?
I am using Paul's query (which i mentioned in original post) and capturing the waits every 5 min. And on every occurrence the top wait is "REDO_THREAD_PENDING_WORk".
7) What does the (rather simple and limited) AlwaysOn monitor GUI tell you? I would go to the source and start using the associated DMVs to see if they show anything.
I checked the Alwayson monitor and dmvs (dm_hadr_database_replica_states and availability_replicas)to check status and it shows healthy and "log_send_queue_size" as well "redo_queue_size" is always 0.
8) Have you tested the network between the servers for issues? Perhaps someone changed a route table or some such and now you are sending IP packets out to Pluto and back to get to the secondary server (yes, I have seen that more than a few times at clients)!!
I did basic test by running some set of queries (select,update) from db server(local) and from app/web server but didn't saw any difference in completion. Both side the query completion is <1 second.
A) If you removed the database from AG and saw no improvement, then this database obviously cannot be part of the problem.
B) You need to do DIFFERENTIAL waits, and/or clear them using DBCC SQLPERF(...) command. Note if you clear them you will reset everything back to zero, which may not be desirable. You can use track_waitstats_2005 (and it's reporting component) to do this. See the SQL Server 2005 Waits and Queues white paper. It is quite possible that you HAD something in the past (possibly LONG ago) that caused the REDO_THREAD_PENDING_WORK wait to grow to a massive size and it is still to this day the largest value cumulatively speaking. That would mean it is NOT a current problem and you are simply chasing a non-issue.
C) sp_reset_connection is normal and should be filtered out of pretty much every Profiler run you ever do.
D) If you have done/are doing regular full and tlog backups as part of a log shipping setup then reuse of the tlog would not be blocked by tlog backups not being done UNLESS someone had disabled CHECKPOINT. If that is the case you likely have worse problems. :w00t:
With all tlog used percentage being single-digit I am apt to go with there is no problem here. You are just seeing the remnants of a severe problem in the past.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
July 7, 2014 at 10:53 am
Thanks again Kevin. Yes, I figured out sp_reset_Connection is normal in trace and shouldn't be concern here. I was looking every difference with double glass :w00t::w00t:
Also checkpoint is enabled. I am sure if this would be disabled, I shouldn't be DBA :-P:-P:-P
So the latest update for you, I somehow managed to do a quick vip refresh, which cleared this wait stats.
Now the "REDO_THREAD_PENDING_WORK" is no more in top list. The current top waits are as below (just keeping you updated)
WaitType Wait_S Resource_SSignal_SWaitCount PercentageAvgWait_S AvgRes_SAvgSig_S
=======================================================================================================
PREEMPTIVE_9171363.96 9171363.96 0.00 1 49.72 9171363.95809171363.95800.0000
SP_SERVER_
DIAGNOSTICS
-------------------------------------------------------------------------------------------------------------------------------------
PREEMPTIVE_9171344.61 9171344.61 0.00 1 49.72 9171344.60909171344.60900.0000
HADR_LEASE_
MECHANISM
Hope the situation improve. 🙂
But I am still curious to find out why the "log_backup" still showing in "log_reuse_wait_desc" while there is log backup every 15 min and checkpoint is enabled. If you can shed any light on this so that I can investigate this as well.
Thanks again for your time..
----------
Ashish
July 7, 2014 at 1:07 pm
No problem is a good problem to have!! 😀
My guess is the log reuse is simply because you have log shipping enabled and are getting a fair amount of activity between 15 minute intervals. This could well be normal and not a thing to worry about (that is my strong suspicion). Watch dbcc sqlperf(logspace) over time and see how full the tlogs get.
BTW, I hope you have tlogs sized appropriately using Kimberly Tripps guidance? Autogrowths should be an exception, not the basis for controlling file sizes (this goes for data too!!). And growth factors should be explicitly set as well. Data growth of 1MB is DEVASTATINGLY bad, and 10% on tlog files is only HORRIBLY bad. :hehe:
Go find other stuff to fix once the above are dealt with.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply