July 30, 2006 at 12:05 pm
Just posting this in the forum in case someone else gets these nasty errors
I had a really nasty SQL server problem this week. I want to share the problem and the associated fix in case anyone else runs across it.
We have been having problems with SQL performance off and on since I started my new position (April 06). We have a pretty strong machine running SQL server (HP Proliant ML370). I had set up performance counters on the machine but I didn't have time to consistently analyze and we have had so many problems with the application running against the DB that we tended to just blame the app.
Earlier this week we started getting the following errors:
Error: 3624, Severity: 20, State: 1.
SQL Server Assertion: File: <recbase.cpp>, line=1378 Failed Assertion = 'm_offBeginVar < m_SizeRec'.
2006-07-27 20:00:29.95 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [D:\Microsoft SQL Server\MSSQL\Data\PRODE4SEKSL_Data.mdf] in database [PRODE4SEKSL] (5). The OS file handle is 0x00000484. The offset of the latest long IO is: 0x000000f6a12000
2006-07-27 20:27:07.14 spid55 Error: 3624, Severity: 20, State: 1.
2006-07-28 00:00:04.45 spid76 Sort read failure (bad page ID). pageid = (0x1:0x774dd), dbid = 2, file = G:\tempdb.mdf. Retrying.
These aren't the kind of errors you want to see. We were immediately concerned for data corruption. However, we ran DBCC CheckDB on all of our databases and no corruption was reported.
Over the next few days the system was very erratic (This is a production accounting system for a IT services company) which resulted in our consultants not being able to consistently enter their T&E and our accounting department not being able to invoice consistently. SQL Server would come up for a couple of hours and then inexplicably crash. We could not correlate the crash with any specific, repeatable action. We were approaching month end and this problem was getting more serious and we didn't have a resolution.
The next step we took was to open a ticket with Microsoft. They were very helpful -- well worth the 245 dollars for the call. They ran diagnostics for several hours and came to the conclusion the problem was with the disk subsystem. I had checked the disks and disk contollers several times using the windows O/S tools. My background: I'm a SQL DBA with some programming and consulting experience (6 years in IT). I'm not a hardware guy (though after this fiasco I know a lot more than I did before).
Here are the details on our HP Proliant Setup:
C Drive (local Disk): Windows 2003 OS, SQL 2000 SP 4
D Drive (Mounted to NAS): Data Files
E, F Drives (local): Transaction Logs
G (local): TempDB
H (Mounted to NAS): Backups -- both full and trans log.
All of the logical drives are sitting on a RAID array on the Prolian ML370.
Not the optimal setup but pretty good. We aren't running a very large system (DBs are 20G total).
Anyway, I finally called HP (we have the highest level of support) and they had me run their diagnostic tool. We got an error on POST (1798) which indicated that the accelerator card on the RAID controller was failing. We had HP out to replace the components and all of our problems magically disappeared.
What did I learn here? First, using the Windows O/S to trouble shoot hardware is a fools errand. Windows is designed to run on top of a myriad of different kinds of hardware, it can't possibly know what is going on in any kind of depth. I also learned that you need enterprise level hardware monitoring systems -- HP has a product "Systems Insight Manager" which would have helped us find this problem before it got so serious.
July 31, 2006 at 12:02 am
i also faced the same problem but on windows 2003 server i dont have a sqlserver i used it as adomain,
it also says the same problem ur data is corrupt and i have experiance in hardware i have replaced the hard disk. i think problem with windows2003 server it eating the hard disk sectors
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply