July 1, 2011 at 2:01 pm
Hello everyone,
I have an issue that I have been pounding my head against a wall for days on.
We have a large client DB(400 GB) that we recently moved from a good box with a dual quad (8 total CPUs) Xeon 3.00GHz box to another box with Dual 12 core (24 total CPUs) Opteron 2.1GHz procs. The performance has decreased by close to 50%. Average request time (Web application) has doubled.
The RAM, and SQL install are identical (32GB of RAM per box, allocated from 8GB to 20GB, changed during troubleshooting). SQL 2008 10.0.4000 web edition both. The only differences that I can see is a new version of windows (2003 to 2008 R2) and the CPU change.
We have been trying everything we can think of to bring performance back in line to no avail. We have tried rebuilding all existing indexes, adding suggested ones, updating stats, changing the parallelism settings and nothing has seemed to work. (Max Parrallelism currently set to 0, we have tried 1,4,8,10, and 20)
I am hoping that someone may have stumbled upon some quirk with either this version of Windows and/or increasing CPUs.
Any thoughts, I am at a loss and looking for things to try.
Thanks for any help.
July 1, 2011 at 2:23 pm
You mention many things such as CPU, memory etc, but what disk configuration did you have before, and what do you have now?
Mike
July 1, 2011 at 3:01 pm
I am interested to know the disk subsystem differences too.
Also, with MAXDOP now set to 0 what do your wait stats look like? And what was MAXDOP on the old server?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 1, 2011 at 3:41 pm
The disk subsystem is presented storage from a Dell EQ array. Both, the previous and current setups were on this, and seek times are under 10ms, so we have more or less ruled out Disk as root cause.
For MAXDOP, all our queries use default and system also does too.
WAIT stats do not look terribly bad, but I don't have a good way of seeing them over time currently? can someone suggest one?
We see sporatic times where CXPACKET times show up, but usually only for a second or two. Other times we see random PAGEIOLATCH_SH waits as well, but not enough to really see a root cause. The vast majority of queries don't show wait times and are runnable or running.
Thanks again for any pointers.
July 2, 2011 at 5:58 am
Also wanted to add that we are seeing the CPUs (average of all 24) spike to around 50 - 70% sporatically throughout the work day. The spikes generally last 5 - 15 seconds.
July 2, 2011 at 6:57 am
Since no other suggestions are comming through and you seem to have covered all the basics, I'd go through this.
When filtering the results go by CPU and you should find the top cpu hogs in the system. Hopefully that'll give you an idea of where to start digging your way out of this one.
July 2, 2011 at 8:13 am
as2higpark (7/2/2011)
Also wanted to add that we are seeing the CPUs (average of all 24) spike to around 50 - 70% sporatically throughout the work day. The spikes generally last 5 - 15 seconds.
What is your max memory setting in the new instance? I have seen people forget to set that on a new instance (including me) and have CPU problems when SQL starts fighting with Windows over memory during peak usage times.
Ignoring the new hardware scenario some random thoughts:
> the articles Ninja's posted are a good place to look
> it could also be a few bad plans might in play, or maybe some bad queries in general (scans or implicit data type conversions)
> Parallelism could be set too high...although you said there are not a lot of CXPACKET waits...you have to capture those wait stats periodically and do a diff for them to be meaningful. You can clear them after you capture them too[/url]. They also get reset when the instance restarts.
> Has any new code gone in since you've been on the new hardware?
> Have you done any stats or index fragmentation maintenance since you got onto the new hardware?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 2, 2011 at 10:49 pm
as2higpark (7/2/2011)
Also wanted to add that we are seeing the CPUs (average of all 24) spike to around 50 - 70% sporatically throughout the work day. The spikes generally last 5 - 15 seconds.
Do you have a virus scanner on the system?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2011 at 6:45 am
Thanks Guys,
I will try to setup the some traces on Tuesday, during business hours (only time we really see bad perf).
As far as the memory settings, we were initially running at 8GB allocated, but we have since bumped up to 20GB for that instance. Virus scanning is a no on this machine.
As far as capturing wait states periodically, I use a "What's running now" type of query to see a snapshot of current usage, but does anyone have a good one to use to capture wait states over time?
July 3, 2011 at 4:31 pm
check the wait type of the running statements when things are spiking in CPU... if you seed a lot of CMEMETHREAD waits then you could be hitting a bug in SQL2008.... there's an issue when you get past 16 CPUs with the way plan get inserted into the plan cache. It's been fixed by a recent hot fix as of March.
If you're seeing plenty of CMEMTHREAD waits (you can see them in Activity Monitor or sys.dm_exec_requests), then you'll need to get yourself to the latest CU to resolve the issue, or drop the server to 16 CPUs or less.
Eddie Wuerch
MCM: SQL
July 3, 2011 at 4:52 pm
Why don't you use sysmd.os_wait_stats to look at the stats since the instance was restarted. Maybe grab the results from running it in the morning perhaps before usage picks up on the instance, those results won't mean a lot at first. Then run this:
DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR)
Then, re-run the query at the end of the same day and look at your stats. Look at the top stats in particular.
I've used this query from Glenn Berry's diagnostic sql script I've used a bunch in the past:
http://sqlserverperformance.wordpress.com/2011/06/30/july-2011-version-of-sql-server-2008-diagnostic-queries/[/url]
WITH Waits AS
(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK',
'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE',
'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP','CLR_MANUAL_EVENT',
'CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'ONDEMAND_TASK_QUEUE', 'BROKER_EVENTHANDLER', 'SLEEP_BPOOL_FLUSH'))
SELECT W1.wait_type,
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 99 OPTION (RECOMPILE);
July 3, 2011 at 5:06 pm
agreeing with Patrick in a big way (wait stats is the key to, um, well, everything)
If you see lots of CMEMTHREAD (not a normal wait) then it's off to patch land. Otherwise, please report back what you're seeing as far as wait stats, and we can take a look.
Eddie Wuerch
MCM: SQL
July 4, 2011 at 5:17 am
While it may be an overly simplistic view of the situation, consider this: Each thread, in isolation, can run on a single core, which is now clocked at 2.1 GHz. On the previous machine it was clocked at 3.0 GHz. So, if a particular amount of CPU work is not taking advantage of having more cores, it's going to run slower. By the numbers alone, ignoring the differences in CPU architecture, you're looking at about a 33% longer execution time.
Now of course there are many other factors, but the slower cores might be at least part of the reason why you're seeing reduced performance.
July 4, 2011 at 6:40 am
Have you checked the power management settings, possibly even at the bios level. Whilst I've never experienced it myself, I've seen several articles about power management reducing the CPU speed during low load and not increasing the speed withought a sustained amount of high CPU usage. If the CPUs are already slower, running at half speed as well wouldn't be good.
July 4, 2011 at 6:42 am
Thanks everyone for the help on the holiday weekend.
I was able to run the query provided to see the cumulative WAIT stats since last restart. Which has been the whole time of seeing the issue, so this should be a good chunk of data. The first 5 rows are below, it looks like we might have a winner with CXPACKET, but this leads me back to parallelism, correct? I have tried almost any number of setting on parallelism, should I possibly go so far as to limit CPUs to the SQL instance?
wait_type wait_time_spct running_pct
CXPACKET 3885697.2156.42 56.42
SOS_SCHEDULER_YIELD 1121922.5716.29 72.71
PAGEIOLATCH_SH 574150.648.34 81.05
FT_IFTSHC_MUTEX 239398.763.48 84.53
LATCH_EX 224723.243.26 87.79
Viewing 15 posts - 1 through 15 (of 94 total)
You must be logged in to reply to this topic. Login to reply