Removing a corrupted filegroup

  • Hi!

    Is there a way to remove a corrupted filegroup?

    I've tried dropping all the tables and the rows (which i believe it was a bad idea, but don't worry, its not production) and tried to remove the filegroup, its throwing i/o error, after dropping all the tables there was nothing i could run REPAIR_ALLOW_DATA_LOSS since its at the table level but not at the filegroup level, i might try truncating all the data and running REPAIR_ALLOW_DATA_LOSS in another scenario to see if it just deallocates everything and let me remove it.

    I've tried doing dbcc checktable with repair allow data loss, most of them were fixed but one of the tables in taking hours, the largest one finished even faster, in like 1 hour, another one which is way smaller is taking 3+ hours, ive dropped all indexes in the largest one, even the clustered index and it did well.

    I tried dropping all the indexes of this one, but i couldn't drop the clustered index since the corruption didn't let me.

    We've recovered most of the data if not all but we want to get rid of this corrupted filegroup.

    I'll keep trying to find a way, everything i try, i'll try to post it here.

    best regards!

  • The test we made today was the next one.

    Since the repair_allow_data_loss of one of the tables is taking way too long.

    We eliminated all the indexes (Clustered, non clustered), keys, constraints of all the tables.

    We truncated the tables and ran DBCC CHECKTABLE on all of them, no errors, all of them passed.

    When we checked the filegroup DBCC CHECKFILEGROUP we found tons of inconsistency errors, (why repair_allow_data_loss of DBCC CHECKFILEGROUP isn't on BOL?, we tried but it was taking too long), we also tried shrinking the file to a small size but it hung, to see if check filegroup with repair_allow_data_loss would be faster, we discarded this one, since the shrink would take forever and it mostly wouldn't work.

    So yeah, the issue i think isn't only on the table levels but also the filegroup level.

    These are some of the errors that DBCC CHECKFILEGROUP threw.

    THE BEGINNING>

    Msg 2576, Level 16, State 1, Line 138
    The Index Allocation Map (IAM) page (34:10368772) is pointed to by the previous pointer of IAM page (34:5527946) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188164096 (type Unknown), but it was not detected in the scan.
    Msg 8905, Level 16, State 1, Line 138
    Extent (34:10224808) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
    Msg 8905, Level 16, State 1, Line 138
    Extent (34:10224824) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
    Msg 8905, Level 16, State 1, Line 138
    Extent (34:10224840) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
    Msg 8905, Level 16, State 1, Line 138
    Extent (34:10225144) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
    this is WAY LONGER

    THE END>
    CHECKFILEGROUP found 7570 allocation errors and 0 consistency errors not associated with any single object.
    Msg 8909, Level 16, State 1, Line 138
    Table error: Object ID 99, index ID 0, partition ID 0, alloc unit ID 6488064 (type System allocation data), page ID (34:10368768) contains an incorrect page ID in its page header. The PageId in the page header = (34:6890976).
    Msg 8906, Level 16, State 1, Line 138
    Page (34:10368768) in database ID 12 is allocated in the SGAM (34:10224641) and PFS (34:10360728), but was not allocated in any IAM. PFS flags 'IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL'.
    CHECKFILEGROUP found 2 allocation errors and 0 consistency errors in table '(Object ID 99)' (object ID 99).
    CHECKFILEGROUP found 7572 allocation errors and 0 consistency errors in database 'DB_CLONE'.
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKFILEGROUP (FG).

    The next scenario would be truncating everything and running DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS.

    Anything new i'll let you know.

    Best regards!

     

  • After running DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS, it lasted 3-4 hours in a VLDB, not even mad, that's amazing

    this was the output.

    SQL server log:

    Message

    DBCC CHECKDB (DB, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 8364 errors and repaired 7572 errors. Elapsed time: 3 hours 35 minutes 40 seconds.

    Attachments:
    You must be logged in to view attached files.
  • So were you able to remove the file group after that?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff!, thanks for replying!

    There were errors and the new CHECKDB WITH REPAIR_ALLOW_DATA_LOSS fixed again lots of errors, i'll add the output(attached) and the SQL Message.

    After it completed we ran DBCC_CHECKFILEGROUP if it still was corrupt, and it was, when this CHECKDB is over we believe it might fix the errros, when its finished ill re-run DBCC CHECKFILEGROUP again to see if it doesn't find any corruption.

    SQL Message:

    Date 9/18/2019 11:40:48 AM

    Log SQL Server (Current - 9/18/2019 12:01:00 PM)

    Source spid67

    Message

    DBCC CHECKDB (DB_Clone, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 19 errors and repaired 16 errors. Elapsed time: 3 hours 27 minutes 31 seconds.

    With this 3rd DBCC CHECKDB it will amount to almost 14 hours of downtime.

    Attachments:
    You must be logged in to view attached files.
  • The last one completed and this was the SQL Server message:

    DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS

    Date 9/18/2019 4:02:02 PM

    Log SQL Server (Current - 9/18/2019 4:44:00 PM)

    Source spid67

    Message

    DBCC CHECKDB (DB_CLONE, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 5 errors and repaired 5 errors. Elapsed time: 3 hours 59 minutes 5 seconds.

    I think it’s supposed to fix the errors he found inside the database but when I threw DBCC CHECKFILEGROUP this was the output:

    The good thing of doing DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS was that thanks to the fixes we could shrink the datafile to at least 1GB, when It was larger than 130GB, we did this so when we tried the DBCC CHECKFILEGROUP WITH REPAIR_ALLOW_DATA_LOSS it would not take too long.

    DBCC CHECKFILEGROUP

    Date 9/18/2019 4:43:56 PM

    Log SQL Server (Current - 9/18/2019 4:44:00 PM)

    Source spid67

    Message

    DBCC CHECKFILEGROUP (FGName) WITH all_errormsgs, no_infomsgs executed by domain\user found 5 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 8 seconds.

    DBCC CHECKFILEGROUP WITH REPAIR_ALLOW_DATA_LOSS

    SQL Server log output:

    Date      9/19/2019 7:48:31 AM

    Log         SQL Server (Current - 9/19/2019 7:48:00 AM)

    Source  spid67

    Message

    DBCC CHECKFILEGROUP (FGName, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 10 errors and repaired 5 errors. Elapsed time: 0 hours 0 minutes 2 seconds.

     

    The weird thing about trying to fix the DBCC CHECKFILEGROUP with repair_allow_data_loss its that IT always finds the SAME errors, in the same PAGES and in the same filegroup, it always finds 10 errors and repairs 5, even when I re-run the DBCC CHECKFILEGROUP with repair_allow_data_loss it always finds the same 10 errors in the same pages and it fixes 5 of the them, it’s like a infinite loop of the same errors and the same repairs.

     

    SSMS output:

    Msg 8966, Level 16, State 2, Line 8
    Unable to read and latch page (34:10368768) with latch type SH. Invalid page ID. failed.
    Msg 8966, Level 16, State 2, Line 8
    Unable to read and latch page (34:10368772) with latch type SH. Invalid page ID. failed.
    Msg 8966, Level 16, State 2, Line 8
    Unable to read and latch page (34:10368773) with latch type SH. Invalid page ID. failed.
    Msg 8966, Level 16, State 2, Line 8
    Unable to read and latch page (34:10368774) with latch type SH. Invalid page ID. failed.
    Msg 8966, Level 16, State 2, Line 8
    Unable to read and latch page (34:10368775) with latch type SH. Invalid page ID. failed.
    Msg 2575, Level 16, State 1, Line 8
    The Index Allocation Map (IAM) page (34:10368768) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596187836416 (type Unknown), but it was not detected in the scan.
    The error has been repaired.
    Msg 2575, Level 16, State 1, Line 8
    The Index Allocation Map (IAM) page (34:10368772) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188164096 (type Unknown), but it was not detected in the scan.
    The error has been repaired.
    Msg 2575, Level 16, State 1, Line 8
    The Index Allocation Map (IAM) page (34:10368773) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188229632 (type Unknown), but it was not detected in the scan.
    The error has been repaired.
    Msg 2575, Level 16, State 1, Line 8
    The Index Allocation Map (IAM) page (34:10368774) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188295168 (type Unknown), but it was not detected in the scan.
    The error has been repaired.
    Msg 2575, Level 16, State 1, Line 8
    The Index Allocation Map (IAM) page (34:10368775) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188360704 (type Unknown), but it was not detected in the scan.
    The error has been repaired.
    CHECKFILEGROUP found 5 allocation errors and 5 consistency errors not associated with any single object.
    CHECKFILEGROUP fixed 5 allocation errors and 0 consistency errors not associated with any single object.
    CHECKFILEGROUP found 5 allocation errors and 5 consistency errors in database 'DB_Clone'.
    CHECKFILEGROUP fixed 5 allocation errors and 0 consistency errors in database 'DB_Clone'.
    is the minimum repair level for the errors found by DBCC CHECKFILEGROUP (FGNAme, repair_allow_data_loss).

     

    I don’t think there’s a way to salvage the Filegroup unless you are really really good with internals and modifying data files through hex editors or writing pages directly into the datafiles to tweak them (Something im obviously not capable of doing since you need a really good knowledge of the storage engine).

    Or maybe if you can have your backups and restore to the same point in time and swap the non corrupted file with the corrupted file (I have not tested this one but it might work)

    This amount to 17+ hours of downtime.

  • Better get a thorough review of hardware and start ensuring backups are up to date. This is almost always hardware somehow causing a corrupt write in the filesystem.

    Double check all firmware versions and I'd really open an case with MS and be sure you discover the root cause or you may be in for the same thing again.

  • Thanks for your reply guys,

    We could not remove the corrupted filegroup.

    Mr. Steve, we found the issue and fixed is asap, we have open cases with our different partners and finding a way to tackle the issue, we are restoring, thanks 😀

    I've sent a mail to Mr. Paul Randal and he checked the post and replied my email by saying that it cannot be fixed and the best thing to do is restore or move as much data as possible to a new database. (Thank you so much for replying the email Mr. Randal)

    We dropped the tests and restored.

    Best regards,

  • Good luck and hope you don't lose much data.

     

  • Thanks!.

    We could restore and even though the database was corrupted, we used the standard restore procedure.

    Full > differential > logs

    At the time we got corruption, our full and differential backups weren't corrupted and our transaction logs despite the database being corrupted they kept coming clean.

    we could not get full/differential backups since they would bring the corruption, so we didn't make them, we just kept taking logs, since in this case, the corruption did not get to the log, we have restored a couple of times in another environment and executed DBCC CheckDB and it completed successfully without corruptions.

    we did lost the data of the corrupted pages but i think our team have been working on rebuilding the data lost.

    Thanks again! 🙂

  • What you need to do now is figure out why the data became corrupted.  Do you have a disk or controller going bad?  Was there a sudden power outage that took the box down in a sudden an unceremonious fashion?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I read each and every line of this conversation. Thanks, Alejandro for posting this issue very clearly and updating it. I agree with @jeff and @Steve. I found this doc helpful: https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/remove-defunct-filegroups-sql-server?redirectedfrom=MSDN&view=sql-server-2017

    SQL Database Recovery Expert 🙂

  • Yes Jeff, i think the SAN guys found some issues and have been working with the provider, no outages, thank god it was only one filegroup.

    prettsons, looks like defunct filegroups are part of a piecemeal restores, since we have not made backups of corruption and have been working with logs, we have not tried that option, but i think it would make good experiment, but even after these tests i think it would have issues making a piecemeal restore, i believe the FG should be in a consistent state to be released.

    If it comes up, we could try, but the message management have been to not take backups of corruption.

    best regards,

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply