November 22, 2012 at 3:53 am
Hi,
We've recently upgraded to Windows 2008 R2, SQL Server 2008 R2 SP + CU4, multi-instance cluster. 32 cores (64 with h/t), 384GB per node. Maxdop 4 server wide, 320GB max server memory. It used to be Win2k3 R2, SQL Server 2005 SP3, 2 x single instance clusters - 8 cores, 96GB RAM, Maxdop 0 server wide.
We have 6 databases on this server, the main one being an Enterprise Data Management Platform that munges data from lots of different sources. It's a 3rd party product that has something like 18000 encrypted stored procedures, 12000 tables, 3000 views. Good, eh ;-). There are triggers on a lot of these tables that send insert/update/deletes to a local staging database, then on to a repository database across a linked server.
For the past few years we have been using a SQL monitoring product (name witheld to protect the innocent) against out production cluster (multi-instance) without any problems. However, on the day we upgraded, when we turned on a particular feature that monitors the types of queries being executed against the server, some data load went from completing in 15 mins, to taking 14 hrs to run. When we turned off this feature, it returned to normal. No other databases appear to be affected.
We contacted the vendor of the monitoring software and found out in essense that this feature runs a SQL trace capturing the following events: SQL:BatchCompleted, SQL:StmtCompleted, SP:Completed for events with a run duration of greater than 5000ms. Sure enough, just running the trace against the server whilst processing data had exactly the same effect as using the query monitoring feature of the software. So, I think we say that although the monitoring software causes the problem, it does so only by leveraging a feature native to SQL Server.
The EDM vendor was aware of the conflict between their product and the monitoring software but have been unable to explain the cause, or reproduce it in their environment. They've also tried against a version of the DB that doesn't have any encrypted procedures (I don't have that copy).
During the slow down, I query sys.dm_exec_requests and all of the sessions associated with the hung jobs have a RUNNABLE status with a 'last_wait_type' of SOS_SCHEDULER_YIELD. There's nothing in sys.dm_os_waiting_tasks. No evident blocking. CPU's are completely idle. I tried changing MAXDOP settings, affinity, hyperthreading, etc. as well (to make it more like the old system), but no effect. As soon as I stop the trace, everything resumes at normal speeds and runs through to completion.
These are the top 10 waitstats:
wait_typewait_time_mssignal_wait_time_msresource_wait_time_mspercent_total_waitspercent_total_signal_waitspercent_total_resource_waits
PREEMPTIVE_OS_WRITEFILEGATHER4729935085535490.3086820897807380.0000000000000000.308682089780738
BACKUPTHREAD4721076246686204460.3111852748087840.0011889896950610.311096281335619
ASYNC_IO_COMPLETION1543967532187974720.6783859142476040.0002565054816590.678366715331252
OLEDB10197830935093043.3745814015518420.0000000000000003.374581401551842
LCK_M_IX1921117074580010.0165539408307750.0003408822848370.016528426481413
MSQL_DQ167309020999020.0757816594836540.0000000000000000.075781659483654
LCK_M_IS941286913383340.0483005399289590.0000332685681090.048298049843943
LCK_M_U867083162701950.0097622519948940.0001523603988800.009750848127286
LCK_M_S7722316992452450.0089117616534150.0008191782205630.008850447820931
LCK_M_X6158212072416010.0087625009052360.0005819588653440.008718942461557
Any ideas what else I might check? Like I said, this used to work fine under SQL 2005/Win2k3, with the same version of the monitoring software.
Thanks,
Andy
November 23, 2012 at 7:56 am
looks specifically to be sp:Completed that's causing the problems. Any ideas?
November 23, 2012 at 8:02 am
adb2303 (11/22/2012)
Windows 2008 R2, SQL Server 2008 R2 SP + CU4, multi-instance cluster.
:w00t:
adb2303 (11/22/2012)
32 cores (64 with h/t),
:w00t:
adb2303 (11/22/2012)
384GB per node.
:w00t:
adb2303 (11/22/2012)
Maxdop 4 server wide
??
Why dont you just pull all the CPUs back out except for the one socket in that case.
Why have you installed 32 cores and then put a choker around SQL Servers neck?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
November 23, 2012 at 8:04 am
weeks of testing demonstrated that MAXDOP 4 was most performant for our workload
November 23, 2012 at 8:09 am
you never considered raising the Cost threshold for parallelism instead?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
November 23, 2012 at 8:14 am
yes. We still achieved more consistent results with setting maxdop.
November 27, 2012 at 3:09 am
anyone seen this before? It looks specifically to be the SP:Completed event that causes the delays. I suspect it's to do with the encrypted procedures, but 2005 worked okay.
November 27, 2012 at 5:13 am
if i were you Firtly i make maxdop 1 and i set automaticly set processor affinity
and automaticly set i/o affinity
November 27, 2012 at 6:35 am
Hi,
I've already tried a variety of maxdop and affinitiy options, to no avail. I can reproduce this problem in our test environments, which only have 2 vCPUs...
November 27, 2012 at 6:37 am
I think it is related your encreptyted procedures
November 27, 2012 at 6:41 am
i'm leaning towards that, too. But, what's different in 2008 R2? It worked running under 2005
November 27, 2012 at 6:44 am
Have you tried using Extended Events instead of an old-style trace? Does that do the same thing?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 27, 2012 at 6:49 am
I haven't, but only because I'm trying to mimic the behaviour of this 3rd party monitoring software. It uses ye olde style trace under the hood, and I'm not going to get that functionality changed.
I'll look into it though
November 27, 2012 at 6:53 am
adb2303 (11/27/2012)
I haven't, but only because I'm trying to mimic the behaviour of this 3rd party monitoring software. It uses ye olde style trace under the hood, and I'm not going to get that functionality changed.I'll look into it though
Yeah, you may not be able to use Extended Events in your production environment if a trace is what your monitoring software uses. But it'll still help eliminate a variable in the whole thing.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply