October 11, 2010 at 9:31 am
I do not know if this is a SQL Server issue but I am hoping that maybe someone has experienced something similar and might be able to give me some advice on how diagnose this.
For starters I have a clustered SQL Server environment running on Windows Enterprise Server 2008 r2 (64bit). We recently rebuilt the server to 2008 r2 (JULY)
8 Quad Core 2.4Ghz processors
256 GB RAM
Version SQL Server 2005 SP3CU7 (4273)... SQL Server has 100 Gig RAM dedicated to it
A couple of weeks ago we had some reports of sluggishness on our application.
We identified the dedicated database server as the source of the sluggishness but we were unable to identify a smoking gun. We realized that the problem had to be the SQL Server machine because our remote desktop connections were taking very long to connect. We could not identify any resources that were starved on the machine.
All my DBA monitors were showing a little high in values but nothing that I can say that would create the kind of degradation that we were experiencing.
CPU utilization across the board was running at 30% to 35%, maybe 5% higher than normal.
Memory looked good but I can not say that I did a good job of analysis here...I need some better monitors and I am researching that now (Any advice for researching memory issues would be appreciated)
CPU Queue Waits looked fine
Normal Blocking ocurrances
SQL Traces showed nothing completely out of the ordinary as far as response times
Overall nothing that I could see inside SQL Server that I could say was causing the issue.
We ended up failing over SQL Server to the B side of the cluster and the problem resolved itself immediately for both the appliciations accessing SQL Server and the Remote Desktop Connection accessing the primary server. We re-booted the A side of the cluster just to be safe and then failed SQL Server back to its primary location and all was fine.
Now, two weeks later we are starting to see slight signs of the same problem.
I have opened a case with Microsoft But I have not had any satisfactory results yet.
My first Question...What resources beyond the obvious should I be monitoring and how?
Second Part...Part of a performance analysis that I did last year identified one stored procedure that was using high CPU cycles and being called quite frequently. I traced the problem to a FOR XML RAW statement. One of the applications that has been seeing the highest amount of performance degradation calls this procedure often. I have called this out to the developers in hopes that we can redesign this so that we do not use the FOR XML RAW. Note...peak usage for our sytem starts around September and I am starting to see a pattern ever two weeks.
My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?
I know this is a complicated issue.
Any advice would be greatly appreciated.
October 11, 2010 at 11:34 am
Eric, I believe you're on the right track regarding memory.
FOR XML by itself is not going to the problem, any more then any heavily used, large result set query would be. A quick question.
What's the SQL Server's memory settings?
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 11, 2010 at 2:09 pm
Min server Memory (MB) - 30720
Max Server Memory (MB) - 102400
Perfmon shows SQL Server:memory Manager:Total Server Memory (KB) at 104,857,600 ( or 102400 MB)
October 12, 2010 at 4:36 am
First question:
Are You set "lock pages in memory" in gpedit.msc for account sqlservices?
Twice:
what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?
Best regards,
Michal Marek
Best regards,
Michał Marek
October 12, 2010 at 6:47 am
michal_marek (10/12/2010)
First question:Are You set "lock pages in memory" in gpedit.msc for account sqlservices?
Twice:
what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?
Yes, we are set for "lock pages in memory".
One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?
There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.
We are 0+1 across the board accept for the c: which is 1.
Tempdb has its own LUN with its own set of spindles as does log, data and system DB's
thanks for the reply, let me know if there is anything else you can think of.
I am working with Microsft to caputre statistics durring the next degradation occurance.
Please keep the questions and comments flowing, any help is appreciated.
I will post any solutions that I come across.
October 12, 2010 at 7:40 am
Yes, we are set for "lock pages in memory".
One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?
No, i think so 🙂
There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.
This set is for "lock" pages in memory and to avoid Windows and other application "memory pressure",
We are 0+1 across the board accept for the c: which is 1.
Tempdb has its own LUN with its own set of spindles as does log, data and system DB's
thanks for the reply, let me know if there is anything else you can think of.
I am working with Microsoft to caputre statistics durring the next degradation occurance.
Please keep the questions and comments flowing, any help is appreciated.
I will post any solutions that I come across.
...
My suggestion:
MS DTC, Check in PerfMon Counters for it, statistic, Operations IOs ...
Other Check:
Average disk queue lenght (recommendation is <2 ms per physical disk in LUN/RAID for Database, for Log and TempDb...
Please, read this paper: http://www.servicesorientedstorage.com/assets/pdf/tuning-microsoft-sql-server-2005-performance-wp.pdf
Use PAL tool to chceck bottlenack disk subsystem http://pal.codeplex.com/
Best regards,
Michał Marek
October 12, 2010 at 8:02 am
Whot is set in "sp_configure" ?
sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
sp_configure
GO
Best regards,
Michał Marek
October 13, 2010 at 8:11 am
1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.
2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues
3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483
4) have you analyzed network utilization?
5) sound like you are doing a lot of remote desktop sessions. why?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 14, 2010 at 8:15 am
NameMinMaxConfigValRunVal
Ad Hoc Distributed Queries0100
affinity I/O mask-2147483648214748364700
affinity mask-2147483648214748364700
affinity64 I/O mask-2147483648214748364700
affinity64 mask-2147483648214748364700
Agent XPs0111
allow updates0100
awe enabled0100
blocked process threshold08640055
c2 audit mode0100
clr enabled0100
common criteria compliance enabled0100
cost threshold for parallelism03276755
cross db ownership chaining0100
cursor threshold-12147483647-1-1
Database Mail XPs0111
default full-text language0214748364710331033
default language0999900
default trace enabled0111
disallow results from triggers0100
fill factor (%)010000
ft crawl bandwidth (max)032767100100
ft crawl bandwidth (min)03276700
ft notify bandwidth (max)032767100100
ft notify bandwidth (min)03276700
index create memory (KB)704214748364700
in-doubt xact resolution0200
lightweight pooling0100
locks5000214748364700
max degree of parallelism06444
max full-text crawl range025644
max server memory (MB)162147483647102400102400
max text repl size (B)021474836476553665536
max worker threads1283276700
media retention036500
min memory per query (KB)512214748364720482048
min server memory (MB)021474836473072030720
nested triggers0111
network packet size (B)5123276740964096
Ole Automation Procedures0111
open objects0214748364700
PH timeout (s)136006060
precompute rank0100
priority boost0100
query governor cost limit0214748364700
query wait (s)-12147483647-1-1
recovery interval (min)03276700
remote access0111
remote admin connections0100
remote login timeout (s)021474836472020
remote proc trans0100
remote query timeout (s)02147483647600600
Replication XPs0100
scan for startup procs0100
server trigger recursion0111
set working set size0100
show advanced options0111
SMO and DMO XPs0111
SQL Mail XPs0111
transform noise words0100
two digit year cutoff1753999920492049
user connections03276700
user options03276700
Web Assistant Procedures0111
xp_cmdshell0100
October 14, 2010 at 8:35 am
1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.
Yes, am am now capturing and trending these numbers daily.
2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues
Yes, I have set up perfmons to capure fileIO stats, wait stats etc. If this does re-occur, I am ready to compare healthy to degraded
3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483%5B/quote%5D
I've looked into this as well as consulted MS on this, we have it set as a recomendation from MS.
4) have you analyzed network utilization?
That is monitored daily by NW engineer, but I have not monitored from the servers perspective. I have included Network stats in the perfmon traces.
5) sound like you are doing a lot of remote desktop sessions. why?
I wouldn't say we do a lot. This is a remote server and RDC is the primary means for logging into the server for debugging and administrator tasks. When we were debugging this issue it was obvious that the RDC was not responding as expected.
I appreciate the input and questions.
It sounds like I am on the right track for diagnosis. I found the MS White papper on tracking wait stats and I plan on giving that an in depth look.
Tomorrow is the day where we might expect a re-occurance. I will post my findings if it does occur.
October 14, 2010 at 8:38 am
My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?
I know this is a complicated issue.
Any advice would be greatly appreciated.
I did a no-no and piggy backed two questions in the same topic...I am going to post this question in a more suitable forum.
October 21, 2010 at 8:41 am
hi, can you please leave a link to the new thread ?
i am following this 🙂
December 20, 2010 at 9:14 am
Final followup to this thread in case anyone was interested....
We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.
We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.
We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.
I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.
We plan on rebuilding this machine at a later time to see if the problems go away.
December 22, 2010 at 10:53 am
Eric1/2aB (12/20/2010)
Final followup to this thread in case anyone was interested....We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.
We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.
We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.
I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.
We plan on rebuilding this machine at a later time to see if the problems go away.
My advice for the "corrupted OS install/patch": If you're really concerned, do a binary (or at least a hash) comparison of all OS files between, in this case, your primary and secondary sides of the cluster. A corrupt file is by definition different than the original. Jacksum[/url] is a Java-based tool to do a variety of hashes, and there are many others (md5sum is an old, out of date Unix favorite).
I hear a lot of "X must be corrupt" tossed around on issues that nobody's figured out; many times an actual investigation of X shows this to translate directly to "I have no idea", as X is identical on a working and a nonworking system (or a before/after snapshot, whichever).
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply