Performance Degradation

Question

Performance Degradation

BrerSQL

SSCommitted

Points: 1672
More actions
October 11, 2010 at 9:31 am

#241515

I do not know if this is a SQL Server issue but I am hoping that maybe someone has experienced something similar and might be able to give me some advice on how diagnose this.
For starters I have a clustered SQL Server environment running on Windows Enterprise Server 2008 r2 (64bit). We recently rebuilt the server to 2008 r2 (JULY)
8 Quad Core 2.4Ghz processors
256 GB RAM
Version SQL Server 2005 SP3CU7 (4273)... SQL Server has 100 Gig RAM dedicated to it
A couple of weeks ago we had some reports of sluggishness on our application.
We identified the dedicated database server as the source of the sluggishness but we were unable to identify a smoking gun. We realized that the problem had to be the SQL Server machine because our remote desktop connections were taking very long to connect. We could not identify any resources that were starved on the machine.
All my DBA monitors were showing a little high in values but nothing that I can say that would create the kind of degradation that we were experiencing.
CPU utilization across the board was running at 30% to 35%, maybe 5% higher than normal.
Memory looked good but I can not say that I did a good job of analysis here...I need some better monitors and I am researching that now (Any advice for researching memory issues would be appreciated)
CPU Queue Waits looked fine
Normal Blocking ocurrances
SQL Traces showed nothing completely out of the ordinary as far as response times
Overall nothing that I could see inside SQL Server that I could say was causing the issue.
We ended up failing over SQL Server to the B side of the cluster and the problem resolved itself immediately for both the appliciations accessing SQL Server and the Remote Desktop Connection accessing the primary server. We re-booted the A side of the cluster just to be safe and then failed SQL Server back to its primary location and all was fine.
Now, two weeks later we are starting to see slight signs of the same problem.
I have opened a case with Microsoft But I have not had any satisfactory results yet.
My first Question...What resources beyond the obvious should I be monitoring and how?
Second Part...Part of a performance analysis that I did last year identified one stored procedure that was using high CPU cycles and being called quite frequently. I traced the problem to a FOR XML RAW statement. One of the applications that has been seeing the highest amount of performance degradation calls this procedure often. I have called this out to the developers in hopes that we can redesign this so that we do not use the FOR XML RAW. Note...peak usage for our sytem starts around September and I am starting to see a pattern ever two weeks.
My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?
I know this is a complicated issue.
Any advice would be greatly appreciated.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Eric, I believe you're on the right track regarding memory.

FOR XML by itself is not going to the problem, any more then any heavily used, large result set query would be. A quick question.

What's the SQL Server's memory settings?

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

BrerSQL SSCommitted Points: 1672 More actions · Answer 2

Min server Memory (MB) - 30720

Max Server Memory (MB) - 102400

Perfmon shows SQL Server:memory Manager:Total Server Memory (KB) at 104,857,600 ( or 102400 MB)

michal_marek Old Hand Points: 336 More actions · Answer 3

First question:

Are You set "lock pages in memory" in gpedit.msc for account sqlservices?

Twice:

what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

Best regards,

Michal Marek

Best regards,
Michał Marek

BrerSQL SSCommitted Points: 1672 More actions · Answer 4

michal_marek (10/12/2010)
First question:
Are You set "lock pages in memory" in gpedit.msc for account sqlservices?
Twice:
what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

Yes, we are set for "lock pages in memory".

One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

We are 0+1 across the board accept for the c: which is 1.

Tempdb has its own LUN with its own set of spindles as does log, data and system DB's

thanks for the reply, let me know if there is anything else you can think of.

I am working with Microsft to caputre statistics durring the next degradation occurance.

Please keep the questions and comments flowing, any help is appreciated.

I will post any solutions that I come across.

michal_marek Old Hand Points: 336 More actions · Answer 5

Yes, we are set for "lock pages in memory".
One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

No, i think so 🙂

There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

This set is for "lock" pages in memory and to avoid Windows and other application "memory pressure",

We are 0+1 across the board accept for the c: which is 1.
Tempdb has its own LUN with its own set of spindles as does log, data and system DB's
thanks for the reply, let me know if there is anything else you can think of.
I am working with Microsoft to caputre statistics durring the next degradation occurance.
Please keep the questions and comments flowing, any help is appreciated.
I will post any solutions that I come across.

...

My suggestion:

MS DTC, Check in PerfMon Counters for it, statistic, Operations IOs ...

Other Check:

Average disk queue lenght (recommendation is <2 ms per physical disk in LUN/RAID for Database, for Log and TempDb...

Please, read this paper: http://www.servicesorientedstorage.com/assets/pdf/tuning-microsoft-sql-server-2005-performance-wp.pdf

Use PAL tool to chceck bottlenack disk subsystem http://pal.codeplex.com/

Best regards,
Michał Marek

michal_marek Old Hand Points: 336 More actions · Answer 6

Whot is set in "sp_configure" ?

sp_configure 'show advanced options', 1;

GO

RECONFIGURE WITH OVERRIDE;

GO

sp_configure

GO

Best regards,
Michał Marek

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483

4) have you analyzed network utilization?

5) sound like you are doing a lot of remote desktop sessions. why?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

BrerSQL SSCommitted Points: 1672 More actions · Answer 8

NameMinMaxConfigValRunVal

Ad Hoc Distributed Queries0100

affinity I/O mask-2147483648214748364700

affinity mask-2147483648214748364700

affinity64 I/O mask-2147483648214748364700

affinity64 mask-2147483648214748364700

Agent XPs0111

allow updates0100

awe enabled0100

blocked process threshold08640055

c2 audit mode0100

clr enabled0100

common criteria compliance enabled0100

cost threshold for parallelism03276755

cross db ownership chaining0100

cursor threshold-12147483647-1-1

Database Mail XPs0111

default full-text language0214748364710331033

default language0999900

default trace enabled0111

disallow results from triggers0100

fill factor (%)010000

ft crawl bandwidth (max)032767100100

ft crawl bandwidth (min)03276700

ft notify bandwidth (max)032767100100

ft notify bandwidth (min)03276700

index create memory (KB)704214748364700

in-doubt xact resolution0200

lightweight pooling0100

locks5000214748364700

max degree of parallelism06444

max full-text crawl range025644

max server memory (MB)162147483647102400102400

max text repl size (B)021474836476553665536

max worker threads1283276700

media retention036500

min memory per query (KB)512214748364720482048

min server memory (MB)021474836473072030720

nested triggers0111

network packet size (B)5123276740964096

Ole Automation Procedures0111

open objects0214748364700

PH timeout (s)136006060

precompute rank0100

priority boost0100

query governor cost limit0214748364700

query wait (s)-12147483647-1-1

recovery interval (min)03276700

remote access0111

remote admin connections0100

remote login timeout (s)021474836472020

remote proc trans0100

remote query timeout (s)02147483647600600

Replication XPs0100

scan for startup procs0100

server trigger recursion0111

set working set size0100

show advanced options0111

SMO and DMO XPs0111

SQL Mail XPs0111

transform noise words0100

two digit year cutoff1753999920492049

user connections03276700

user options03276700

Web Assistant Procedures0111

xp_cmdshell0100

BrerSQL SSCommitted Points: 1672 More actions · Answer 9

1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

Yes, am am now capturing and trending these numbers daily.

2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

Yes, I have set up perfmons to capure fileIO stats, wait stats etc. If this does re-occur, I am ready to compare healthy to degraded

3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483%5B/quote%5D

I've looked into this as well as consulted MS on this, we have it set as a recomendation from MS.

4) have you analyzed network utilization?

That is monitored daily by NW engineer, but I have not monitored from the servers perspective. I have included Network stats in the perfmon traces.

5) sound like you are doing a lot of remote desktop sessions. why?

I wouldn't say we do a lot. This is a remote server and RDC is the primary means for logging into the server for debugging and administrator tasks. When we were debugging this issue it was obvious that the RDC was not responding as expected.

I appreciate the input and questions.

It sounds like I am on the right track for diagnosis. I found the MS White papper on tracking wait stats and I plan on giving that an in depth look.

Tomorrow is the day where we might expect a re-occurance. I will post my findings if it does occur.

BrerSQL SSCommitted Points: 1672 More actions · Answer 10

My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?
I know this is a complicated issue.
Any advice would be greatly appreciated.

I did a no-no and piggy backed two questions in the same topic...I am going to post this question in a more suitable forum.

notShai SSC Enthusiast Points: 196 More actions · Answer 11

hi, can you please leave a link to the new thread ?

i am following this 🙂

BrerSQL SSCommitted Points: 1672 More actions · Answer 12

Final followup to this thread in case anyone was interested....

We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.

We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.

We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.

I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.

We plan on rebuilding this machine at a later time to see if the problems go away.

Nadrek SSC-Insane Points: 20039 More actions · Answer 13

Eric1/2aB (12/20/2010)
Final followup to this thread in case anyone was interested....
We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.
We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.
We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.
I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.
We plan on rebuilding this machine at a later time to see if the problems go away.

My advice for the "corrupted OS install/patch": If you're really concerned, do a binary (or at least a hash) comparison of all OS files between, in this case, your primary and secondary sides of the cluster. A corrupt file is by definition different than the original. Jacksum[/url] is a Java-based tool to do a variety of hashes, and there are many others (md5sum is an old, out of date Unix favorite).

I hear a lot of "X must be corrupt" tossed around on issues that nobody's figured out; many times an actual investigation of X shows this to translate directly to "I have no idea", as X is identical on a working and a nonworking system (or a before/after snapshot, whichever).