January 2, 2008 at 6:15 am
Hi All,
I have noticed over the last couple of weeks that one of our most intensive databases starts to produce stack dumps. Below is the summary of the actual configuration
SQL 2005 SP2
Windows 2003 Server
32GB of memory, AWE configured to use 24GB
The problem is that I am not very sure if the AWE is actually helping us or is the one causing problems, the issue appears to be memory problems, but I’m not sure of how to diagnose if we have a memory problem.
What normally happens is that we have a procedure which runs from time to time, blocks other processes on the server, and when we eventually decide to kill it, rollback takes forever, then we end up rebooting the server, after rebooting, there are loads of stack dump errors in the error log.
Anyone has any idea on this problem ?
January 2, 2008 at 8:47 am
If you get stack dumps on a regular base you should contact Microsoft PSS and open a support incident. The dump might show (part of) the last query which caused the dump, but even then it can be a combination of factors.
The problem could be memory related but it could also be something totally different.
[font="Verdana"]Markus Bohse[/font]
January 2, 2008 at 8:54 am
Hi
Thanks for the post, you are right. sometimes one can see the query within the stack dump, one theory could be that the code is broken, but this has been running for years without any problems and hasn't changed over the years..
Also, the Microwst Pss you talked about, does one have to pay for this service ?
Cheers
January 2, 2008 at 10:09 am
Unfortunately you don't get them free.
Depending on your support contract you may get a certain number of support calls per year before they become chargable.
Check this link out.
http://support.microsoft.com/default.aspx?scid=fh;EN-US;offerprophone&sd=tech#faq607
January 3, 2008 at 2:18 am
Hi Dean,
While your trying to get an ms contract you could have a look in the logs folder and see if you have any SQLDumpXXX.txt files and post a couple of the newest ones - if the server has sentitive data like credit cards in a db then have a read through first!!
Ed
January 3, 2008 at 4:20 am
Hi All,
Thanks for the replies.
Below is the detail of the dump log file you asked for
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\CompanyProfile_200712.ndf] in database [CompanyProfile] (12). The OS file handle is 0x000008B8. The offset of the latest long I/O is: 0x000025fe4e2000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\FusionReference.ndf] in database [CompanyProfile] (12). The OS file handle is 0x00000894. The offset of the latest long I/O is: 0x00000040b1c000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\Fusion.ndf] in database [CompanyProfile] (12). The OS file handle is 0x00000890. The offset of the latest long I/O is: 0x00000039eac000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\FusionMain.mdf] in database [CompanyProfile] (12). The OS file handle is 0x00000864. The offset of the latest long I/O is: 0x00000006526000
•2007-12-23 10:37:23.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\MSDBLog.ldf] in database [msdb] (4). The OS file handle is 0x00000880. The offset of the latest long I/O is: 0x00000000979000
•2007-12-23 10:38:39.90 Server Using 'dbghelp.dll' version '4.0.5'
•2007-12-23 10:41:48.71 spid87 A time-out occurred while waiting for buffer latch -- type 2, bp 051A6AB8, page 10:19922697, stat 0x7c0040d, database id: 12, allocation unit Id: 484272212956179/202795456246057, task 0x00C143E8 : 7, waittime 300, flags 0x1a, owning task 0x00DECD48. Not continuing to wait.
•2007-12-23 10:42:40.46 spid71 A time-out occurred while waiting for buffer latch -- type 2, bp 0C0C9424, page 1:12949, stat 0x4c1010f, database id: 12, allocation unit Id: 327680/281474977038336, task 0x008DBA68 : 0, waittime 300, flags 0x1a, owning task 0x00C14208. Not continuing to wait.
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\mssqlsystemresource.mdf] in database [mssqlsystemresource] (32767). The OS file handle is 0x00000600. The offset of the latest long I/O is: 0x000000005e0000
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 37 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\CompanyProfile_200712.ndf] in database [CompanyProfile] (12). The OS file handle is 0x000008B8. The offset of the latest long I/O is: 0x0000252b198000
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\master.mdf] in database [master] (1). The OS file handle is 0x000005D4. The offset of the latest long I/O is: 0x00000000178000
•2007-12-23 10:46:24.34 Server ***Unable to get thread context - no pss
•2007-12-23 10:46:24.36 Server * *******************************************************************************
•2007-12-23 10:46:24.36 Server *
•2007-12-23 10:46:24.36 Server * BEGIN STACK DUMP:
•2007-12-23 10:46:24.36 Server * 12/12/07 10:46:24 spid 0
•2007-12-23 10:46:24.36 Server *
•2007-12-23 10:46:24.36 Server * Non-yielding Scheduler
•2007-12-23 10:46:24.36 Server *
•2007-12-23 10:46:24.36 Server * *******************************************************************************
•2007-12-23 10:46:24.39 Server Stack Signature for the dump is 0x00000261
January 3, 2008 at 4:28 am
do you have any SQLDmpr####.mdmp files in your sql logs directory? Can you attach one or pm it?
January 3, 2008 at 4:31 am
Have you monitored perfmon? To start with, I'd look at the Physical Disk IO counters.
Does this happen all of the time or do you get periods between errors where performance is fine? If so, thats a good baseline to see when the errors occur in perfmon.
Are these SAN attached disks?
January 3, 2008 at 4:57 am
Dean Jones (1/3/2008)
Hi All,Thanks for the replies.
Below is the detail of the dump log file you asked for
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\CompanyProfile_200712.ndf] in database [CompanyProfile] (12). The OS file handle is 0x000008B8. The offset of the latest long I/O is: 0x000025fe4e2000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\FusionReference.ndf] in database [CompanyProfile] (12). The OS file handle is 0x00000894. The offset of the latest long I/O is: 0x00000040b1c000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\Fusion.ndf] in database [CompanyProfile] (12). The OS file handle is 0x00000890. The offset of the latest long I/O is: 0x00000039eac000
•2007-12-23 10:37:13.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\FusionMain.mdf] in database [CompanyProfile] (12). The OS file handle is 0x00000864. The offset of the latest long I/O is: 0x00000006526000
•2007-12-23 10:37:23.02 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\MSDBLog.ldf] in database [msdb] (4). The OS file handle is 0x00000880. The offset of the latest long I/O is: 0x00000000979000
•2007-12-23 10:38:39.90 Server Using 'dbghelp.dll' version '4.0.5'
•2007-12-23 10:41:48.71 spid87 A time-out occurred while waiting for buffer latch -- type 2, bp 051A6AB8, page 10:19922697, stat 0x7c0040d, database id: 12, allocation unit Id: 484272212956179/202795456246057, task 0x00C143E8 : 7, waittime 300, flags 0x1a, owning task 0x00DECD48. Not continuing to wait.
•2007-12-23 10:42:40.46 spid71 A time-out occurred while waiting for buffer latch -- type 2, bp 0C0C9424, page 1:12949, stat 0x4c1010f, database id: 12, allocation unit Id: 327680/281474977038336, task 0x008DBA68 : 0, waittime 300, flags 0x1a, owning task 0x00C14208. Not continuing to wait.
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\mssqlsystemresource.mdf] in database [mssqlsystemresource] (32767). The OS file handle is 0x00000600. The offset of the latest long I/O is: 0x000000005e0000
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 37 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Production\CompanyProfile\CompanyProfile_200712.ndf] in database [CompanyProfile] (12). The OS file handle is 0x000008B8. The offset of the latest long I/O is: 0x0000252b198000
•2007-12-23 10:46:24.32 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\master.mdf] in database [master] (1). The OS file handle is 0x000005D4. The offset of the latest long I/O is: 0x00000000178000
•2007-12-23 10:46:24.34 Server ***Unable to get thread context - no pss
•2007-12-23 10:46:24.36 Server *
This is prolem with you hardware,
It's not a SQL Server problem. download and run the SQLIOSim tool (see http://blogs.msdn.com/sqlserverstorageengine/archive/2006/10/06/SQLIOSim-available-for-download.aspx) to stress the IO subsystem and see what it throws up.
This error was introduced after SP4 and is an FYI more than a sql error ..
Kevin Brennan M.Sc MCDBA, MCSE, MCP+I
CodeMinkey
January 3, 2008 at 5:05 am
Kevin is right about the IO messages which take too long, but in my experience these messages don't automatically lead to a stack dump. I remember one case like this, there the batteries on the IO controller went dead and because of some HP firmware setting, the controller didn't use it's write cache any longer. Maybe worth checking too.
[font="Verdana"]Markus Bohse[/font]
January 3, 2008 at 5:31 am
SP4? He's running SQL 2005.
January 3, 2008 at 5:43 am
MarkusB (1/3/2008)
Kevin is right about the IO messages which take too long, but in my experience these messages don't automatically lead to a stack dump.
They don't. The stack dump seems to be from a hung scheduler. Very odd on SQL 2005.
*******************************************************************************
• 2007-12-23 10:46:24.36 Server *
• 2007-12-23 10:46:24.36 Server * BEGIN STACK DUMP:
• 2007-12-23 10:46:24.36 Server * 12/12/07 10:46:24 spid 0
• 2007-12-23 10:46:24.36 Server *
• 2007-12-23 10:46:24.36 Server * Non-yielding Scheduler
• 2007-12-23 10:46:24.36 Server *
• 2007-12-23 10:46:24.36 Server * *******************************************************************************
I will second the suggestion to call PSS. I doubt there's much we can do here other than speculate.
Also. have a read through the KB articles for all the post-SP2 hotfixes that are available. See if any match your symptoms.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 3, 2008 at 5:44 am
SRB (1/3/2008)
SP4? He's running SQL 2005.
Yes this feature was introduced with SQL 2000 Sp4 refer to
http://support.microsoft.com/kb/897284
I expect that they have not disabled this feature on 2005...:D
Kevin Brennan M.Sc, MCDBA, MCSE, MCP+I
CodeMinkey
January 4, 2008 at 8:45 am
We've experienced the same problem in our production environment. In our case we run SQL Server 2005 SP2 64bit, so it has nothing to do with AWE. We've encountered two different types of non-yielding scheduler dumps. We've logged the case to Microsoft support and it is under investigation now. MS support engineers are confident that the problem relates to I/O subsystem (we use Microsoft iSCSI initiator and NetApp storage). The I/O warnings you receive in SQL Server refer to slow I/O completion and may be related either to malfunctioning hardware or (as in our case) iSCSI initiator sending LUN reset commands to the storage (you can monitor the storage for such events). It turns out that one of our dumps is addressed by Cumulative Update Package 3 for SQL Server 2005 SP2 (the latest Cumulative Package is 5 and it was released two weeks ago). The other dump is still under investigation. So you may experience a complex issue related to both SQL Server and I/O subsystem.
Geyzersky Dmitry,
Senior Software Architect
January 7, 2008 at 4:43 pm
Hi Guys
Thanks very much for your help, it looks pretty much like an i/o problem, but also during some of the stack dumps that we receive in cases where a long running stored procedure was running but later forcefully stopped and it is literally taking 10 hours to rollback that we have to restart the service, we do receive stack dumps which contains SQL of the entire procedure that was killed, does anyone understand why this happens ?
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply