ACCESS_VIOLATION restarts SQLServer Service

  • We got this error at 10:13AM this morning

    SqlDumpExceptionHandler: Process 2352 generated fatal exception c0000005 EXCEPTION_ACCESS_VIOLATION. SQL Server is terminating this process.

    about 10 seconds later we got this error:

    SQL Server is aborting. Fatal exception c0000005 caught.

    And the SQL Server Service stopped and restarted. The next log entries start at 10:15AM and show SQL Server doing a normal start up after the service was stopped.

    I have been searching all day for answers to why this happened and how to fix the problem. We are running SQL Server 2000 SP3 clustered active/active. We reboot our servers once a month. We have RAID 10 for our hard drive set up.

    As far as we can tell no failover took place. Does anyone know why? From the logs it looks like SQL Server was down for two minutes (and I know the databases were not available for those two minutes). Is SQL Server completely down until it starts logging in the SQL Server log or is it technically up sooner than that (such that the cluster manager can detect it and decide not to fail over)?

    Has anyone seen these errors before? Any suggestions on what I need to do to ensure it doesn't happen again?

    Also when I run DBCC CHECKDB and DBCC CHECKTABLE I get these errors:

    Server: Msg 8951, Level 16, State 1, Line 1

    Table error: Table 'tblEmailRecipients' (ID 1212776461). Missing or invalid key in index 'tblEmailRecipients10' (ID 8) for the row:

    Server: Msg 8955, Level 16, State 1, Line 1

    Data row (1:3455:50) identified by (RID = (1:3455:50) EmailRecipientID = 30648493) has index values (RegionID = 1 and CompanyID = 100401 and UserID = 1231544 and InstID = 22223 and StaffID = 1 and EmailID = 1 and EmailRecipientID = 30648493).

    DBCC results for 'tblEmailRecipients'.

    There are 5221202 rows in 104533 pages for object 'tblEmailRecipients'.

    CHECKTABLE found 0 allocation errors and 1 consistency errors in table 'tblEmailRecipients' (object ID 1212776461).

    repair_fast is the minimum repair level for the errors found by DBCC CHECKTABLE (dbEmail.dbo.tblEmailRecipients ).

    When I run a query to select the row and using the specified index (indid 8) and another query to select the row using another index and compare them they are identical.

    I have seen some posts that are similar to this (but not exactly the same) and they recommend restoring from a backup and then ensure there is no corruption there. Is this problem that severe that there is no other way to resolve the problem?

    I posted some other problems with corrupt indexes and tables last Wed (17 Dec 2003) to the general forum. I don't know if they are relevant or not to what is going on here except to say the problem in a database here is in the same database on the same server where those corrupt indexes were found.

    Any help will be greatly appreciated.

    Robert W. Marda

    SQL Programmer

    bigdough.com

    The world’s leading capital markets contact database and software platform.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Honestly Robert, I'm not sure what happened, but I wouldn't mess with this. Call PSS.

    Steve Jones

    sjones@sqlservercentral.com

    http://www.sqlservercentral.com/columnists/sjones

    The Best of SQL Server Central.com 2002 - http://www.sqlservercentral.com/bestof/

    http://www.dkranch.net

  • Thats a fact call Microsoft. I would run the dbcc checkdb with the repair_fast that is a non distructive repair and should not cause data loss. Any time you have a access violation and a abort without any clue in the logs don't mess around you data is at risk. MS is going to want a sqldiag.exe dump and the event logs from the server event logs usually system and application logs log as much as you can from the time frame that the error took place at talk to people try to get as much information as you can. It will make the guys at PSS happier and easier to diagnose the problem.

    Wes

  • If your database was upgraded from 7.0 to 2000, read KB http://support.microsoft.com/default.aspx?scid=kb;en-us;298806&Product=sql

    I would review server application/system logs for any possible errors which could cause the service restart.

    In general, Access Violation in SQL Server indicates there are some bugs. For that, you should follow Steven suggestion, Call PSS.

    Can you post of "select @@version" here? Any particular hoxfix are you aware of which has been applied to your SQL Server?

    If the index is not clustered index, I would simple drop it and recreate it.

    Edited by - allen_cui on 12/19/2003 5:53:56 PM

  • Was an application dump file created?

    K. Brian Kelley, GSEC

    http://www.truthsolutions.com/

    Author: Start to Finish Guide to SQL Server Performance Monitoring

    http://www.netimpress.com/

    K. Brian Kelley
    @kbriankelley

  • What (or who) is PSS?

    I was planning to drop and recreate the index just now. I waited till now, because the table in question has over 5 million rows in it and the index showing corruption has 6 int columns in it. I ran DBCC CHECKTABLE AND DBCC CHECKDB and both came back clean. How can that fix itself? Our optimization and integrity checks that are part of our maintenance plans only run on Sunday and I am the only one that would do anything to fix this right now. Our DBA is in India on his honeymoon so I know he didn't do anything.

    Where do I find or create this sqldiag.exe dump? How do I find an application dump file if it was created?

    We upgraded from SQL Server 7.0 to SQL Server 2000 a little over a year ago. However, I believe this database was added after the upgrade.

    Won't Microsoft charge me for calling them?

    Is it ok to wait until Monday now?

    Please forgive all the beginner questions, I usually don't do this kind of DBA work. I am a SQL Programmer (according to my official title).

    Robert W. Marda

    SQL Programmer

    bigdough.com

    The world’s leading capital markets contact database and software platform.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Microsoft PSS = Product Support Services. If you have a Premier agreement, you could start a ticket through the Premier Web Site but you should also contact your TAM. Otherwise:

    http://support.microsoft.com/default.aspx?scid=fh;en-us;Prodoffer41a&sd=GN

    If you're going to run SQLDiag.exe on a clustered instance, you're going to need to be aware of the fact the command shell has to run on the same physical server as where the instance is. If you have xp_cmdshell enabled, not really a big deal. More here:

    http://support.microsoft.com/default.aspx?scid=kb;en-us;233332

    K. Brian Kelley, GSEC

    http://www.truthsolutions.com/

    Author: Start to Finish Guide to SQL Server Performance Monitoring

    http://www.netimpress.com/

    K. Brian Kelley
    @kbriankelley

  • Can't explain the errors, but you would not expect it to fail over in this scenario. If you go into cluster manager and look at the properties for the SQL Server service, it's probably set to retry 3 times (that's the default). So if SQL Server stops, Cluster Server will attempt to restart on the same machine. If it fails three times within a certain time limit or it is unable to start up on the same machine, it will fail over.

  • I told my manager about the recommendation many of you gave to contact Microsoft and he said to wait for our DBA to return.

    The problem happened again today. I found a file called SQLDump0015.txt on this server. Does anyone know how to understand any of it.

    Robert W. Marda

    SQL Programmer

    bigdough.com

    The world’s leading capital markets contact database and software platform.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • We were getting the exact same access violation error/service restart several times over the last few weeks.

    We were able to trace it to some large one-shot ad-hoc OPENQUERYs against a remote Oracle database. A small snippet from one of the SQLDumpxxx.txt files:

    * Short Stack Dump

    * 06588187 Module(ORA73+00038187) (vsnupr+00021F7C)

    * 0657FD29 Module(ORA73+0002FD29) (vsnupr+00019B1E)

    * 06594499 Module(ORA73+00044499) (vsnupr+0002E28E)

    * 0657FB95 Module(ORA73+0002FB95) (vsnupr+0001998A)

    ...showed that the ODBC connection was using the Oracle 7.3 provider instead of the Net8 provider.

    You might want to check if your server or SQLAgent is connecting to an external source that might not be as stable as necessary.

    -Eddie

    eddie@mirai-tech.com

    Edited by - juntasql on 12/29/2003 11:30:20 AM

    Eddie Wuerch
    MCM: SQL

  • We only use SQL Server 2000 SP3.  We had different problems with OPENQUERY a year or more ago and coded them all out.

    Our DBA returned on 29 Dec 2003.  He has been trying various things with our network engineer to isolate this problem.  Even though I have mentioned it could be useful to contact Microsoft they have not done so yet.  So I am left to investigate the problem as best I can and am posting some additional questions and info here.

    The same clustered server has rebooted siting the same Access violation error on 9 Jan 2004 and 21 Jan 2004.  Each time SQL Server logs entries to a file called exception.log and creates one or more files with a names like SQLDump0055.txt and SQLDump0056.txt.  There are also coresponding files that end with .mdmp such as SQLDump0056.mdmp.  Will the .mdmp file have more or additional info that I won't find in the .txt files?  How do I use the .mdmp files?

    In an attempt to determine if the problem is hardware related our DBA and Network Engineer have had two of our clusters running in failed over mode.  Since we got the problem again today I guess that should eliminate all hardware from being suspect except the hard drives, RAID configuration and anything else related to the hard drives that doesn't get swapped when you fail over a clustered server.

    The problem seems to revolve around a couple tables in one database.  All stored procedure calls found in the dump files use one of two tables (if not both) from that database in question.  One thing we are doing now is adding the query hint OPTION (MAXDOP 1) to all queries found in the SP calls in these dump files if the query uses one of the two suspect tables.  This is because of the following knowledge base article: 

    http://support.microsoft.com/default.aspx?scid=kb;en-us;311104&Product=sql2k

    Each time we get new dump files after a reboot the stored procedures are ones where we have not added the query hint OPTION (MAXDOP 1).

    If anyone has any ideas, I'm anxious to hear them.  If not then this is just for your info and may help someone in the future, once we get the problem resolved.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Our DBA just opened a ticket with Microsoft.  Hopefully, we'll get some useful info from them.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Robert,

    So what was the final resolution for the problem?

     

  • Forgive me for not posting the solution.  I intended to, but must have forgotten.  I just confirmed with our DBA that what I remembered as the solution was correct.

    We had hard drive disk caching on and apperently that can cause problems.  Our Senior Systems Engineer shut it off and we have not had this problem since.

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Thank you, Robert!

    The strange thing here is that your problem started to reveal itself from a certain point. I guess it is related to increased usage of your database(s?) when you began to have more users, more parallel queries at the same time, and as a result, heavier CPU and storage subsystem usage...

    We've got the problem (single instance) with exactly the same behavior last week (Thursday, September 16th) and cannot find the root of the problem. We have dedicated SQL 2000 Enterprise/Win2K server (all the latest service packs installed) running on a dual 2.8 Xeon CPU with Hyper-Threading (OS see them as 4 CPUs), 6GB RAM, RAID (RAID 1 for OS and SQL binaries; RAID 10 for SQL data) machine. We have this setup in place for about 1.5 years and never had problems before and after (so far). I checked the RAID configuration after I read your post, Robert, and YES RAID is configured for both Write and Read caching.

    Robert, could you get a little more details from your Systems Engineer? Did he disable just Write or both Read and Write (we still want to get a maximum from our hardware, otherwise what is the point to buy costly boxes and then disable their advanced features)? Did he do anything else to achieve stability and eliminate the problem?

    Victor Andreev

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply