Data corruption aftermath -- future prevention

  • dan-572483 (4/15/2016)


    Excuse me? Are you saying that if a database is mirrored and disk-level corruption occurs the SQL will just pull the data from the mirror??

    If it can, yes. That was one reason I decided to go with SQL Server 2008 at a previous employer. We had several reasons for going with database mirroring for fault tolerance in the server room, and that was one of the reasons.

  • Is that for routine OLTP operations or just during DBCC CheckDB or repairs?

  • dan-572483 (4/15/2016)


    Excuse me? Are you saying that if a database is mirrored and disk-level corruption occurs the SQL will just pull the data from the mirror??

    And I believe it can even use the uncorrupted mirror data to fix the bad data. RAID 1/10 drives can do that as well.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Have you identified the corruption? has the server team looked at the drives or SAN?

    Both our internal SAN team, and the vendor have looked at it, and assured me that nothing went wrong with the storage subsystem. It is just a wee bit suspicious that it happened within a couple of weeks to moving it to new storage, but regardless of that, I cannot systematically prove that it was an IO subsystem failure.

    I never heard of a single real-life case when a SAN team, or a vendor, or a maintenance contractor have found an issue with IO subsystem.

    That is what I would have thought also, but... just a week before, we had a dramatic IO freeze occur (massive latency spikes for nearly half an hour). We went through the routine, filed a complaint, and, somewhat to my amazement, the SAN vendor found a problem, recommended a solution, and the next day the SAN guys were laying out a plan to implement it.

    So, now you have an anecdote for at least one time in history that a SAN vendor didn't say the sysadmins were just imagining things.

  • jeff.mason (4/15/16)

    https://msdn.microsoft.com/en-us/library/bb677167%28v=sql.120%29.aspx%5B/quote%5D

    I was starting to get excited when I read this, because it almost exactly describes the problem I was facing...

    Until I got to the part that says

    Automatic page repair cannot repair the following control page types:

    * File header page (page ID 0).

    Which is what happened to us.

    Still, useful information, and thanks.

  • Sergiy (4/14/2016)


    nhansen pcc (4/11/2016)


    Have you identified the corruption? has the server team looked at the drives or SAN?

    Both our internal SAN team, and the vendor have looked at it, and assured me that nothing went wrong with the storage subsystem. It is just a wee bit suspicious that it happened within a couple of weeks to moving it to new storage, but regardless of that, I cannot systematically prove that it was an IO subsystem failure.

    I never heard of a single real-life case when a SAN team, or a vendor, or a maintenance contractor have found an issue with IO subsystem.

    Even when an enclosure has been overheated for a week, or it's been a yellow light on a panel for a while, or throughput of the system dropped by, say, 5 times - it's still perfectly fine, according to them.

    So, do not trust their words too much.

    Must live in Utopia, congratulations. The rest of us live in the real world where even SANS fail and the admins have to resolve problems.

  • nhansen pcc (4/15/2016)


    jeff.mason (4/15/16)

    https://msdn.microsoft.com/en-us/library/bb677167%28v=sql.120%29.aspx%5B/quote%5D

    I was starting to get excited when I read this, because it almost exactly describes the problem I was facing...

    Until I got to the part that says

    Automatic page repair cannot repair the following control page types:

    * File header page (page ID 0).

    Which is what happened to us.

    Still, useful information, and thanks.

    Which is why I said, "If it can."

  • Lynn Pettis (4/15/2016)


    Sergiy (4/14/2016)


    nhansen pcc (4/11/2016)


    Have you identified the corruption? has the server team looked at the drives or SAN?

    Both our internal SAN team, and the vendor have looked at it, and assured me that nothing went wrong with the storage subsystem. It is just a wee bit suspicious that it happened within a couple of weeks to moving it to new storage, but regardless of that, I cannot systematically prove that it was an IO subsystem failure.

    I never heard of a single real-life case when a SAN team, or a vendor, or a maintenance contractor have found an issue with IO subsystem.

    Even when an enclosure has been overheated for a week, or it's been a yellow light on a panel for a while, or throughput of the system dropped by, say, 5 times - it's still perfectly fine, according to them.

    So, do not trust their words too much.

    Must live in Utopia, congratulations. The rest of us live in the real world where even SANS fail and the admins have to resolve problems.

    Lynn, that was sarcasm. Note the last line, "do not trust their words". He's saying that the SAN has problems regularly.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply