July 20, 2004 at 2:57 am
Hello
Yesterday I got a bit of a fright when I received the following Alert:
DATE/TIME: 19/07/2004 14:07:27
DESCRIPTION: Error: 823, Severity: 24, State: 2
I/O error 1450(error not found) detected during read at offset 0x000001d3b42000 in file 'R:\Data\dbName.mdf'.
COMMENT: (None)
JOB RUN: (None)
Looking this up on Google revealed a solution that involved restoring from the last good backup
We run our Live system on a 2 node Cluster, with the disk sub-system being a IBM Fast-T SAN, with a 2 node SVC Cluster. We checked the Fast-T SAN, which reported no Errors. We had a bit of fun getting into the SVC Console, but eventually we found nothing wrong there either. I received the Alert only twice, 1 minute apart. Now accessing the table appears fine. I ran a suite of DBCC checks 4 times, specifically:
--low impact check, looks for Torn Pages
DBCC CHECKDB('DBName') WITH PHYSICAL_ONLY
--Checks the allocation and structural integrity of all the objects in the specified database.
DBCC CHECKDB('DBName') WITH NO_INFOMSGS
--Checks for consistency in and between system tables in the specified database.
DBCC CHECKCATALOG('DBName') WITH NO_INFOMSGS
--Checks the consistency of disk space allocation structures for a specified database.
DBCC CHECKALLOC('DBName') WITH NO_INFOMSGS
All appears well, with No errors reported. Has anyone seen this before? Should I be worried about my SAN, or is it a one-off caused by something other than disk corruption (Network?) that I can stop fretting about?
Dave J
July 20, 2004 at 6:06 am
I don't know whether this is good advice or not, but I know one DB administrator who found that (on a fairly regular basis) he'd come in in the morning to find a whole bunch of databases marked as "suspect" and therefore unusable - torn pages being the problem. Of course, his email box was full of complaints from customers who couldn't use their web applications because the DB's were unavailable.
A lengthly problem to correct, but he never found a real error with the databases, so using Enterprise Monitor, he removed the option to check for torn pages.
He's been running the system in that state for over 2 years now with no problem.
Whether this is deemed a risky practice I really don't know.
July 20, 2004 at 6:09 am
I seem to remember that Steve or Andy had a problem where accessing the first 9000 rows of a table was OK, but anything after that produced errors.
The solution was to drop and recreate the clustered index which rebuilt the relationship between the pages.
I don't know if that helps.
July 20, 2004 at 7:01 am
Thanks for the feedback. I have done some further research & come up with:
From this article, http://support.microsoft.com/default.aspx?scid=kb;en-us;828339
Note You may or may not see errors from the DBCC CHECKDB statement on the database that is associated with the file in the error message. You can run the DBCC CHECKDB statement when you see an 823 error. If the DBCC CHECKDB statement does not report any errors, you probably have an intermittent system problem or a disk problem.
There is more info at http://support.microsoft.com/default.aspx?kbid=826433
This turns on an Extended Trace Flag to trap Unreported I/O Errors:
Trace flag 818 enables an in-memory ring buffer that is used for tracking the last 2,048 successful write operations that are performed by the computer running SQL Server, not including sort and workfile I/Os. When errors such as Error 605, 823, or 3448 occur, the incoming buffer's log sequence number (LSN) value is compared to the recent write list. If the LSN that is retrieved during the read operation is older than the one specified during the write operation, a new error message is logged in the SQL Server error log. Most SQL Server write operations occur as checkpoints or as lazy writes. A lazy write is a background task that uses asynchronous I/O. The implementation of the ring buffer is lightweight, thereby making the performance affect on the system negligible.
The following message indicates that SQL Server did not receive an error from the WriteFile API call or the ReadFile API call. However, when the LSN was reviewed, the value was not correct:
SQL Server has detected an unreported OS/hardware level read or write problem on Page (1:75007) of database 12
LSN returned (63361:16876:181), LSN expected (63361:16876:500)
Contact the hardware vendor and consider disabling caching mechanisms to correct the problem
At this point, either the read cache contains an older version of the page, or the data was not correctly written to the physical disk. In either case (a Lost Write or a Stale Read), SQL Server reports an external problem with the operating system, the driver, or the hardware layers.
Which is really scary. I'm awaiting authorisation to turn the Trace Flag on...
Dave
July 20, 2004 at 11:32 pm
Hi Dave,
You do have an interesting issue. I would certainly be concerned over the integrity of your data. The first thing is to preserve as much of your data as possible, I would recommend BCPing all your data out to text files and store these in a safe place.
Next you should have a look at the Page which contains the data where the write failed. Below you can use the following commands to display the page information. The offset you see is the number bytes from the start of the file, if you divide this by 8192 you will get the page number 957857.
dbcc traceon(3604) --instructs DBCC commands to display to the screen
dbcc page (dbname, 1,957857,3)
This should show the contents of the page, check to see if these are correct. Try the previous page and the next page, if these are correct, then you are in luck, your data may be intact.
You need to consult with your SAN vendor to see if they can disable the cache for your server and start investigating to see if there are any disk corruptions within the SAN on your spindles.
We do have a product that will read your backup file using the page information, so if you continue to get issues, we can assist you there, and this will certainly report back if their are any corruptions.
Regards
Douglas Chrystall
Imceda Software, Inc
July 21, 2004 at 1:05 am
DOuglas
thanks for your response, I have just ran the DBCC commands you sugggest. I must be honest & say I haven't seen that one before. Is it undocumented?
The output is I'm afraid not very helpful to my eyes. It outputs a header (below) and then what appear to be records from one of my tables, which at first glance look OK. What would I look for to indicate an Error? Is it m_tornBits? is 21 good or bad?
PAGE: (1:957857)
----------------
BUFFER:
-------
BUF @0x01E0BA40
---------------
bpage = 0x60EE2000 bhash = 0x00000000 bpageno = (1:957857)
bdbid = 7 breferences = 0 bstat = 0x9
bspin = 0 bnext = 0x00000000
PAGE HEADER:
------------
Page @0x60EE2000
----------------
m_pageId = (1:957857) m_headerVersion = 1 m_type = 1
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x4
m_objId = 345820344 m_indexId = 0 m_prevPage = (1:957856)
m_nextPage = (1:957858) pminlen = 88 m_slotCnt = 86
m_freeCnt = 12 m_freeData = 8008 m_reservedCnt = 0
m_lsn = (279975:3787:6) m_xactReserved = 0 m_xdesId = (0:0)
m_ghostRecCnt = 0 m_tornBits = 21
Allocation Status
-----------------
GAM (1:511232) = NOT ALLOCATED
SGAM (1:511233) = NOT ALLOCATED
PFS (1:954384) = 0x0 0_PCT_FULL
DIFF (1:511238) = NOT CHANGED
ML (1:511239) = NOT MIN_LOGGED
Slot 0 Offset 0x60
------------------
Record Type = PRIMARY_RECORD
Record Attributes = NULL_BITMAP
60EE2060: 00580010 012db245 00093b2b 000005ea ..X.E.-.+;......
60EE2070: 00000000 01232340 00009467 0000003b ....@##.g...;...
60EE2080: 0000002d 00000000 00000000 00005ad9 -............Z..
60EE2090: 000f42e2 00000000 01713532 0000946f .B......25q.o...
60EE20A0: 00000000 00000000 00000000 00000000 ................
60EE20B0: 00000000 00000000 e6880010 ............
ProvisionHistoryID = 19771973
ProvisionID = 604971
ReturnStatus = 1514
ReturnCode = [NULL]
ResponseDate = Jan 7 2004 5:40PM
RequestTypeID = 59
CallTypeID = 45
TargetDate = [NULL]
FileID = 23257
Reference = [NULL]
OrderNo = [NULL]
CreatedBy = 1000162
DateCreated = Jan 15 2004 10:24PM
UpdatedBy = [NULL]
DateUpdated = [NULL]
DateArchived = [NULL]
[snip]
Dave J
July 21, 2004 at 1:19 am
Hi Dave,
There can be many torn bits in a database, this does not indicate a problem. You need to look at the data and see if any of the values are incorrect. It could be a case of no corruption but your data was just not written. I know thats 86 rows, but it is going to be the only way to tell. Try looking at the previous page and the next page, see if they display correctly. The tornbit flag is just used with an XOR to determine where the stolen bits have gone, so the page can be reconstructed.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply