September 25, 2018 at 7:18 am
Hi All
I have a 4 node AlwaysOn environment running the latest service pack but not on the latest cumulative update as yet. We've been running this environment a few years now. We are now having an issue when failing over to any of the secondary's were we see a performance dip (high CPU) which we can then only resolve by updating statistics on certain large and heavily used tables.
As mentioned this environment has been running for a few years and we've never had this issue before. The only noticeable change is the size of the data. Dbs that were initially around 400GB have more than doubled.
In-terms of statistics maintenance...we have daily and weekly jobs for them (also we only see this performance issue after a failover)
If I understand MS documentation correctly,this should not be the case because statistics update transactions on the primary should be replicated to the secondary's. After a failover I can see a difference between the tables statistics last updated dates.
Just curious if there is anyone else experiencing this.
September 25, 2018 at 3:36 pm
Denesh Naidoo - Tuesday, September 25, 2018 7:18 AMHi AllI have a 4 node AlwaysOn environment running the latest service pack but not on the latest cumulative update as yet. We've been running this environment a few years now. We are now having an issue when failing over to any of the secondary's were we see a performance dip (high CPU) which we can then only resolve by updating statistics on certain large and heavily used tables.
As mentioned this environment has been running for a few years and we've never had this issue before. The only noticeable change is the size of the data. Dbs that were initially around 400GB have more than doubled.
In-terms of statistics maintenance...we have daily and weekly jobs for them (also we only see this performance issue after a failover)
If I understand MS documentation correctly,this should not be the case because statistics update transactions on the primary should be replicated to the secondary's. After a failover I can see a difference between the tables statistics last updated dates.
Just curious if there is anyone else experiencing this.
Sounds like you could be running into this issue:
FIX: Bad query plan created on secondary replicas after statistics updated via FULLSCAN option on primary replica in SQL Server 2012 or 2014
Sue
September 25, 2018 at 11:01 pm
Thanks Sue_H
I have come across that link...but I understood it a different way...issues related to queries that are run on secondary server (for eg: if you have redirected your reporting queries off the primary onto the secondary). It doesn't mention anything related to when the secondary becomes the new primary.
However I see the fix they talk about is in CU5 and I only have CU4 so it's worth a try.
Will check this out and get back.
Thanks again
September 27, 2018 at 6:57 am
I can't explain the differences in the statistics (and I've reached out to some people to see if someone can). However, I can explain performance hits on failover. You're going to have to compile every single query after failover because they won't be in memory, explaining a very high CPU depending on your system because compiling plans is expensive. You're also going to have to warm up the cache because none of the data that's been living there in the primary will be there in the secondary. You may also occasionally see performance differences because the stats in the secondary that are used to compile a plan, while being the same stats as in the primary, are newer than the stats used to compile the plan in the primary (stats changed in the primary but the threshold for a recompile event was not reached). Any or all of these things will explain lots of performance problems on a failover.
Again though, a literal difference in the statistics, I'm at a loss. We'll see what the people I reached out to say.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 27, 2018 at 9:00 am
As much as anything this can be a (albeit odd) parameter sniffing issue. Your compiled plan on the old primary is likely happy and full of good things for your workload, even as things evolve. When you failover the stats are out of date (but have never hit the limit to cause them to be rebuilt), so your query comes along, it goes to compile a new plan, the stats don't support it well, and so it goes off the rails in an unexpected way. This is particularly applicable when you are using dates or incremental keys in the predicates.
To evaluate if this is the case I would grab the plans off of each server for the procs that are going sideways and compare them. See where things are different. It's likely that an alternate index is being used on the new replica prior to your stats rebuild.
October 4, 2018 at 1:12 am
Grant Fritchey - Thursday, September 27, 2018 6:57 AMI can't explain the differences in the statistics (and I've reached out to some people to see if someone can). However, I can explain performance hits on failover. You're going to have to compile every single query after failover because they won't be in memory, explaining a very high CPU depending on your system because compiling plans is expensive. You're also going to have to warm up the cache because none of the data that's been living there in the primary will be there in the secondary. You may also occasionally see performance differences because the stats in the secondary that are used to compile a plan, while being the same stats as in the primary, are newer than the stats used to compile the plan in the primary (stats changed in the primary but the threshold for a recompile event was not reached). Any or all of these things will explain lots of performance problems on a failover.Again though, a literal difference in the statistics, I'm at a loss. We'll see what the people I reached out to say.
Thank you Grant for your response.
If I understand you correctly, after a failover we should see high CPU as all queries start to recompile...but this should start getting better.
My other concern is why are we only seeing this behavior now. As mentioned we've been running this configuration since 2015 and never had performance issues on a failover.
Just to add...we are testing the latest CU in QA. Once approved we will install on production and check again
October 4, 2018 at 1:13 am
Nicholas Cain - Thursday, September 27, 2018 9:00 AMAs much as anything this can be a (albeit odd) parameter sniffing issue. Your compiled plan on the old primary is likely happy and full of good things for your workload, even as things evolve. When you failover the stats are out of date (but have never hit the limit to cause them to be rebuilt), so your query comes along, it goes to compile a new plan, the stats don't support it well, and so it goes off the rails in an unexpected way. This is particularly applicable when you are using dates or incremental keys in the predicates.To evaluate if this is the case I would grab the plans off of each server for the procs that are going sideways and compare them. See where things are different. It's likely that an alternate index is being used on the new replica prior to your stats rebuild.
Thanks Nicholas...I will compare the plans
October 4, 2018 at 11:53 pm
Denesh Naidoo - Thursday, October 4, 2018 1:12 AMThank you Grant for your response.
If I understand you correctly, after a failover we should see high CPU as all queries start to recompile...but this should start getting better.
My other concern is why are we only seeing this behavior now. As mentioned we've been running this configuration since 2015 and never had performance issues on a failover.Just to add...we are testing the latest CU in QA. Once approved we will install on production and check again
There could be a number of factors on this. It's hard to know for certain. 3 years worth of growth and changes to your data and the usage patterns of the app could explain it. It might be due to SP/CU changes. Do you have performance tracking from 3 years ago that shows it was better or just more complaints now than before? If you have the data, drill down into the details, wait statistics especially, to see where the differences come from.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply