Data Consistency Error

  • We use dbspi along with hp openview(OVSD/OVSC) and we do get timely notifications of any issues(which is good enough), however for this one, we didnot get any ticket as the system event log didnt mention anything about sql server. Ideally SAN/WITNEL teams should have recieved an alert for this. Since the customer we're supporting have several hundreds of servers, i dont think it's feasible to enter all servers and all databases and check for them.

    Can you suggest something for such scenario (some scripts/not tools). Do you suggest we run dbcc checkdb every week so that such issues are recorded in the error log? We'll need to look for feasibility of this on each database basis as some of them are 24X7 and performance degradation because of running dbcc checkdb may not be acceptable in some cases.



    Pradeep Singh

  • ps (7/1/2009)


    Do you suggest we run dbcc checkdb every week so that such issues are recorded in the error log?

    Yes, absolutely, and someone needs to review the results after each run. If you don't you're highly unlikely to be able to catch corruption early enough to do anything useful about it.

    See the blog post I referenced.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gail for providing deep insight into the issue.

    I did go through the article you mentioned and another one

    http://www.sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Consistency-Checking-Options-for-a-VLDB.aspx%5B/url%5D

    which did provide very useful information on how to reduce checkdb resource consumption.



    Pradeep Singh

  • I was just going through some articles and came to know that i could have recovered only the affected pages if i had a clean backup instead of restoring the entire backup. The only condition i can think of is the affected pages must not have been changed after the full backup from which i'm trying to restore. Apart from this is there anything else i need to think of while doing page level restore? If it happened due to IO failure, should i assume that SQL Server was trying to write some data to these pages(In which case i would not be able to restore updated data?) Can this happen while reading too?

    Also, how can i find out if all those pages belonged to the same extent or different extents?

    Ref(Page restoration): http://www.mssqltips.com/tip.asp?tip=1645



    Pradeep Singh

  • Ahh, I forgot that the corrupted database was in 2000:hehe: and page level restores is possible only in 2005 and later versions? My questions in the previous posts still remain if it were a 2005 database.



    Pradeep Singh

  • ps (7/5/2009)


    The only condition i can think of is the affected pages must not have been changed after the full backup from which i'm trying to restore.

    That's not a requirement for page-level restores. The requirement is that the DB is in full recovery and log backups exist from the full backup used up to the most recent point. Books Online should have a section on page-level restores.

    If it happened due to IO failure, should i assume that SQL Server was trying to write some data to these pages

    No. All the errors you have in that CheckDB output are logical errors, so the page can be read from disk but the structure of the page is not correct.

    Also, how can i find out if all those pages belonged to the same extent or different extents?

    If the page numbers are contiguous and the lowest pageID is divisible by 8, then the lowest and subsequent 7 pages are part of an extent.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gail for the info. I'm sorry but how IO can crash a page is still not very much clear. Few examples or something you've experienced in past or links would be just fine.



    Pradeep Singh

  • Sorry, not sure what you're asking.

    If it's the cause of these, problems with the IO subsystem. I cannot be more specific as any form of problem with any portion of the IO subsystem can cause this.

    In your case it looks (from the errors) as though an entire 64kb section of the disk has been overwritten with 00000

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • IO does not crash a page but an incomplete IO does .

    Actually , an OS page is divided into 512 byte sectors .

    Every sector has to be written completely .If due to some or the other OS /Hardware issue (as pointed by Gail),

    The sectors are not written completely or the information on it does not have any meaning (like you find 0000 <-- called as zeroing)then the page is a torn page (logically inconsistent).

    This is what Gail is trying to say i supose .

    For extents and pages part : the number of every extent and its first page is same .

    So if someone says that your extent # 787979 is corrupt then your first page of that extent is 787979 and then you can just add +1 till the next 7 pages .

    Regards

    Abhay Chaudhary
    Sr.DBA (MCITP/MCTS :SQL Server 2005/2008 ,OCP 9i)

  • Thanks Gail and Abhay. It did clarify some of the doubts. 🙂



    Pradeep Singh

  • hi_abhay78 (7/6/2009)


    The sectors are not written completely or the information on it does not have any meaning (like you find 0000 <-- called as zeroing)then the page is a torn page (logically inconsistent).

    This is what Gail is trying to say i supose .

    It's not what I'm saying

    A torn page is one form of logical IO error. That happens when SQL issues a write of a page and the page, for whatever reason, is only partially written. So part of the page on disk is the new one and part is the old one. That's by no means the only form of logical IO error though.

    What it looks like here is that an entire extent was zeroed out. That cannot be done by a partial write. Something in the IO subsystem somehow wrote zeros over 64kb of the disk.

    Buggy driver, misbehaving antivirus, faulty SAN/RAID controller can all cause logical corruption.

    So if someone says that your extent # 787979 is corrupt then your first page of that extent is 787979 and then you can just add +1 till the next 7 pages .

    Extents always start with a page number that's divisible by 8, because the first extent in the file runs from page 0 to page 7. The second extent runs from page 8 to 15, etc. So the extent that contains page number 787979 starts at page 787976 (because that is divisible by 8) and ends with page number 787983

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply