November 10, 2014 at 10:31 am
I'm having some issues with an active/active SQL 2008R2 cluster that I'd like some thoughts on.
The system is as follows:
OS: Windows Server 2008 R2(x64)
SQL: 2008 R2 Enterprise SP2, (x64)
64 Procs, 768 GB memory per node.
Min memory is set to 256 GB, Max Memory set to 350GB.
Issue:
After a fail-over (for patching or other maintenance), withing 48 hours one of the instances will go unavailable until such time that it grabs memory up to its max setting (350GB per instance). The system will grab 150-250 GB of memory over a ~30 minute period until it reaches the max memory setting. During this time-frame the server is up, however severe blocking causes it to be very sluggish and unavailable for most application traffic. The connections timeout before the results are returned. A simple sp_who2 may take upwards of a minute to return results.
Once it gets to its max memory setting it returns to normal, the db engine resolves the blocking and everything is ok until the next maintenance window.
My theory is that a poorly performing query causes the system to grab resources until it hits its max. During this time-frame a DBA typically intercedes and kills the offending spid(s), however it never completes the rollback until the max memory setting is reached.
Also, we end up with Latchwait timeouts of type 2 and 4 for class: 'ACCESS_METHODS_DATASET_PARENT'
and Buffer Latch timeouts of type 2 for tempdb and type 4 for MSDB.
From what I've read those basically indicate a resource issue of some kind. CPU utilization ramps up to 60%, but i/o is fairly low, never really going above 1500 iops. This system pushes over 50K iops at times.
With the exception of lowering the max memory setting does anyone have any thoughts on possible causes and/or ways to remediate this issue?
Thanks in advance,
-Luke.
November 11, 2014 at 4:09 am
what trace flags do you have set on the service startup?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
November 11, 2014 at 12:04 pm
Those latches are to do with parallel table scans. Two steps I would suggest
- Increase cost threshold for parallelism (especially if it's at the default of 5) and consider changing maxdop to be 1, 1.5 or 2x the number of physical cores per NUMA node
- Tune your workload to reduce the number of unnecessary, large table scans.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 11, 2014 at 12:21 pm
Thanks for the replies... we don't have any non default trace flags turned on to my knowledge.
We are looking at turning on 1118 and or 834 as well as going from 8 to 12 temp dB files...
Unfortunately I can't really tune the workload we believe to be the issue because it's a vendor application. We opened a ticket with them but dont expect much of a response. There are about 120 databases on the two instances(active active) all competing for resources.
Since it only occurs once after a fail over until the next fail over I'm curious why all of a sudden things are fine once it hits it's max memory. And more importantly why the problematic spid won't roll back... they are typically just a select.
It'd be nice to have a way to stop the issue rather than waiting for it to hit max mem.
Since its a parralelism wait is there any way you know of to force the issue to make it repeatable. .. as yet we have been unable to replicate the issue in test(it's a much smaller system). I'll see what I can do about getting the max dop lowered... or turning on resource governor for at least that application.
Thanks again.
-Luke.
November 11, 2014 at 12:29 pm
Luke L (11/11/2014)
Since its a parralelism wait is there any way you know of to force the issue to make it repeatable. ..
Not offhand, and gut feel is that you won't be able to repo at smaller scale easily. These kind of problems are a function of hardware size and load. Enough cores that things parallel wildly, enough large tables to make the table scans parallel massively and take time and enough load to cause the contention.
Traceflag 1118 is not a bad idea, but I don't expect it'll help you here. It helps with allocation contention in TempDB. Won't hurt, nor will the multiple tempDB files. Though if you have 8 files, don't increase more unless you're seeing PageLatch_UP waits (not pageIOLatch) on the allocation pages in TempDB (pages 1, 2 and 3 in each file)
Do NOT turn traceflag 834 on! There is nothing in your post which suggests it's a good idea and there are massive side effects.
http://blogs.msdn.com/b/psssql/archive/2009/06/05/sql-server-and-large-pages-explained.aspx
Large page for the buffer pool is definitely not for everyone. You should only do this for a machine dedicated to SQL Server (and I mean dedicated) and only with careful consideration of settings like ‘‘max server memory’. Furthermore, you should test out the usage of this functionality to see if you get any measureable performance gains before using it in production.
SQL Server startup time can be significantly delayed when using trace flag 834.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 12, 2014 at 8:50 am
OK, thanks for the additional insights. I'll work on getting the cost for parallelism threshold and max dop changes through testing and our change control process. Unfortunately, we can really only test this during monthly scheduled maintenance periods.
I'll post back with any testing results and final resolution when we get one.
Thanks again,
-Luke.
November 12, 2014 at 11:30 am
Before you change maxdop, could you look through the startup messages in your error log and post any that reference NUMA?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 12, 2014 at 11:56 am
Gail, the only NUMA related messages are as follows:
Node configuration: node 0: CPU mask: 0x00000000ffff0000:0 Active CPU mask: 0x00000000ffff0000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
Node configuration: node 1: CPU mask: 0x000000000000ffff:0 Active CPU mask: 0x000000000000ffff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
Node configuration: node 2: CPU mask: 0x0000ffff00000000:0 Active CPU mask: 0x0000ffff00000000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
Node configuration: node 3: CPU mask: 0xffff000000000000:0 Active CPU mask: 0xffff000000000000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
There are 4 Physical CPUs with 8 cores each, Hyperthreaded to 64 total cores, for each machine.
To limit the scope of the change and testing which will be required, I was thinking of trying to apply the Max Dop suggestion to only those applications which we believe to be the problem using Resource governor. Do you see any issues with that approach?
Thanks,
-Luke.
November 12, 2014 at 12:12 pm
Beautiful. Ok, 4 NUMA nodes, each with 8 physical cores. Try maxdop at 12 to start, 8 and 16 also viable values.
Before you make assumptions about what apps are a problem, do some analysis. Trace, since this is SQL 2008. What you need to identify are wildly paralleling queries, queries running table scans.
To be honest, I'd set the above maxdop server-wide (and definitely set cost threshold server-wide) and add resource govenor for outliers that don't work well with those
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply