Removing a corrupted filegroup

Question

Post reply

Removing a corrupted filegroup

Alejandro Santana

SSCommitted

Points: 1838
More actions
September 16, 2019 at 11:54 pm

#3680513

Hi!
Is there a way to remove a corrupted filegroup?
I've tried dropping all the tables and the rows (which i believe it was a bad idea, but don't worry, its not production) and tried to remove the filegroup, its throwing i/o error, after dropping all the tables there was nothing i could run REPAIR_ALLOW_DATA_LOSS since its at the table level but not at the filegroup level, i might try truncating all the data and running REPAIR_ALLOW_DATA_LOSS in another scenario to see if it just deallocates everything and let me remove it.
I've tried doing dbcc checktable with repair allow data loss, most of them were fixed but one of the tables in taking hours, the largest one finished even faster, in like 1 hour, another one which is way smaller is taking 3+ hours, ive dropped all indexes in the largest one, even the clustered index and it did well.
I tried dropping all the indexes of this one, but i couldn't drop the clustered index since the corruption didn't let me.
We've recovered most of the data if not all but we want to get rid of this corrupted filegroup.
I'll keep trying to find a way, everything i try, i'll try to post it here.
best regards!

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 1

The test we made today was the next one.

Since the repair_allow_data_loss of one of the tables is taking way too long.

We eliminated all the indexes (Clustered, non clustered), keys, constraints of all the tables.

We truncated the tables and ran DBCC CHECKTABLE on all of them, no errors, all of them passed.

When we checked the filegroup DBCC CHECKFILEGROUP we found tons of inconsistency errors, (why repair_allow_data_loss of DBCC CHECKFILEGROUP isn't on BOL?, we tried but it was taking too long), we also tried shrinking the file to a small size but it hung, to see if check filegroup with repair_allow_data_loss would be faster, we discarded this one, since the shrink would take forever and it mostly wouldn't work.

So yeah, the issue i think isn't only on the table levels but also the filegroup level.

These are some of the errors that DBCC CHECKFILEGROUP threw.

THE BEGINNING>

Msg 2576, Level 16, State 1, Line 138
The Index Allocation Map (IAM) page (34:10368772) is pointed to by the previous pointer of IAM page (34:5527946) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188164096 (type Unknown), but it was not detected in the scan.
Msg 8905, Level 16, State 1, Line 138
Extent (34:10224808) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Msg 8905, Level 16, State 1, Line 138
Extent (34:10224824) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Msg 8905, Level 16, State 1, Line 138
Extent (34:10224840) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Msg 8905, Level 16, State 1, Line 138
Extent (34:10225144) in database ID 12 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
this is WAY LONGER

THE END>
CHECKFILEGROUP found 7570 allocation errors and 0 consistency errors not associated with any single object.
Msg 8909, Level 16, State 1, Line 138
Table error: Object ID 99, index ID 0, partition ID 0, alloc unit ID 6488064 (type System allocation data), page ID (34:10368768) contains an incorrect page ID in its page header. The PageId in the page header = (34:6890976).
Msg 8906, Level 16, State 1, Line 138
Page (34:10368768) in database ID 12 is allocated in the SGAM (34:10224641) and PFS (34:10360728), but was not allocated in any IAM. PFS flags 'IAM_PG MIXED_EXT ALLOCATED   0_PCT_FULL'.
CHECKFILEGROUP found 2 allocation errors and 0 consistency errors in table '(Object ID 99)' (object ID 99).
CHECKFILEGROUP found 7572 allocation errors and 0 consistency errors in database 'DB_CLONE'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKFILEGROUP (FG).

The next scenario would be truncating everything and running DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS.

Anything new i'll let you know.

Best regards!

This reply was modified 5 years, 2 months ago by Alejandro Santana.

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 2

After running DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS, it lasted 3-4 hours in a VLDB, not even mad, that's amazing

this was the output.

SQL server log:

Message

DBCC CHECKDB (DB, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 8364 errors and repaired 7572 errors. Elapsed time: 3 hours 35 minutes 40 seconds.

Attachments:

You must be logged in to view attached files.

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 3

So were you able to remove the file group after that?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 4

Hi Jeff!, thanks for replying!

There were errors and the new CHECKDB WITH REPAIR_ALLOW_DATA_LOSS fixed again lots of errors, i'll add the output(attached) and the SQL Message.

After it completed we ran DBCC_CHECKFILEGROUP if it still was corrupt, and it was, when this CHECKDB is over we believe it might fix the errros, when its finished ill re-run DBCC CHECKFILEGROUP again to see if it doesn't find any corruption.

SQL Message:

Date 9/18/2019 11:40:48 AM

Log SQL Server (Current - 9/18/2019 12:01:00 PM)

Source spid67

Message

DBCC CHECKDB (DB_Clone, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 19 errors and repaired 16 errors. Elapsed time: 3 hours 27 minutes 31 seconds.

With this 3rd DBCC CHECKDB it will amount to almost 14 hours of downtime.

Attachments:

You must be logged in to view attached files.

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 5

The last one completed and this was the SQL Server message:

DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS

Date 9/18/2019 4:02:02 PM

Log SQL Server (Current - 9/18/2019 4:44:00 PM)

Source spid67

Message

DBCC CHECKDB (DB_CLONE, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 5 errors and repaired 5 errors. Elapsed time: 3 hours 59 minutes 5 seconds.

I think it’s supposed to fix the errors he found inside the database but when I threw DBCC CHECKFILEGROUP this was the output:

The good thing of doing DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS was that thanks to the fixes we could shrink the datafile to at least 1GB, when It was larger than 130GB, we did this so when we tried the DBCC CHECKFILEGROUP WITH REPAIR_ALLOW_DATA_LOSS it would not take too long.

DBCC CHECKFILEGROUP

Date 9/18/2019 4:43:56 PM

Log SQL Server (Current - 9/18/2019 4:44:00 PM)

Source spid67

Message

DBCC CHECKFILEGROUP (FGName) WITH all_errormsgs, no_infomsgs executed by domain\user found 5 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 8 seconds.

DBCC CHECKFILEGROUP WITH REPAIR_ALLOW_DATA_LOSS

SQL Server log output:

Date 9/19/2019 7:48:31 AM

Log SQL Server (Current - 9/19/2019 7:48:00 AM)

Source spid67

Message

DBCC CHECKFILEGROUP (FGName, repair_allow_data_loss) WITH all_errormsgs, no_infomsgs executed by domain\user found 10 errors and repaired 5 errors. Elapsed time: 0 hours 0 minutes 2 seconds.

The weird thing about trying to fix the DBCC CHECKFILEGROUP with repair_allow_data_loss its that IT always finds the SAME errors, in the same PAGES and in the same filegroup, it always finds 10 errors and repairs 5, even when I re-run the DBCC CHECKFILEGROUP with repair_allow_data_loss it always finds the same 10 errors in the same pages and it fixes 5 of the them, it’s like a infinite loop of the same errors and the same repairs.

SSMS output:

Msg 8966, Level 16, State 2, Line 8
Unable to read and latch page (34:10368768) with latch type SH. Invalid page ID. failed.
Msg 8966, Level 16, State 2, Line 8
Unable to read and latch page (34:10368772) with latch type SH. Invalid page ID. failed.
Msg 8966, Level 16, State 2, Line 8
Unable to read and latch page (34:10368773) with latch type SH. Invalid page ID. failed.
Msg 8966, Level 16, State 2, Line 8
Unable to read and latch page (34:10368774) with latch type SH. Invalid page ID. failed.
Msg 8966, Level 16, State 2, Line 8
Unable to read and latch page (34:10368775) with latch type SH. Invalid page ID. failed.
Msg 2575, Level 16, State 1, Line 8
The Index Allocation Map (IAM) page (34:10368768) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596187836416 (type Unknown), but it was not detected in the scan.
        The error has been repaired.
Msg 2575, Level 16, State 1, Line 8
The Index Allocation Map (IAM) page (34:10368772) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188164096 (type Unknown), but it was not detected in the scan.
        The error has been repaired.
Msg 2575, Level 16, State 1, Line 8
The Index Allocation Map (IAM) page (34:10368773) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188229632 (type Unknown), but it was not detected in the scan.
        The error has been repaired.
Msg 2575, Level 16, State 1, Line 8
The Index Allocation Map (IAM) page (34:10368774) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188295168 (type Unknown), but it was not detected in the scan.
        The error has been repaired.
Msg 2575, Level 16, State 1, Line 8
The Index Allocation Map (IAM) page (34:10368775) is pointed to by the next pointer of IAM page (0:0) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057596188360704 (type Unknown), but it was not detected in the scan.
        The error has been repaired.
CHECKFILEGROUP found 5 allocation errors and 5 consistency errors not associated with any single object.
CHECKFILEGROUP fixed 5 allocation errors and 0 consistency errors not associated with any single object.
CHECKFILEGROUP found 5 allocation errors and 5 consistency errors in database 'DB_Clone'.
CHECKFILEGROUP fixed 5 allocation errors and 0 consistency errors in database 'DB_Clone'.
 is the minimum repair level for the errors found by DBCC CHECKFILEGROUP (FGNAme, repair_allow_data_loss).

I don’t think there’s a way to salvage the Filegroup unless you are really really good with internals and modifying data files through hex editors or writing pages directly into the datafiles to tweak them (Something im obviously not capable of doing since you need a really good knowledge of the storage engine).

Or maybe if you can have your backups and restore to the same point in time and swap the non corrupted file with the corrupted file (I have not tested this one but it might work)

This amount to 17+ hours of downtime.

Steve Jones - SSC Editor SSC Guru Points: 736222 More actions · Answer 6

Better get a thorough review of hardware and start ensuring backups are up to date. This is almost always hardware somehow causing a corrupt write in the filesystem.

Double check all firmware versions and I'd really open an case with MS and be sure you discover the root cause or you may be in for the same thing again.

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 7

Thanks for your reply guys,

We could not remove the corrupted filegroup.

Mr. Steve, we found the issue and fixed is asap, we have open cases with our different partners and finding a way to tackle the issue, we are restoring, thanks 😀

I've sent a mail to Mr. Paul Randal and he checked the post and replied my email by saying that it cannot be fixed and the best thing to do is restore or move as much data as possible to a new database. (Thank you so much for replying the email Mr. Randal)

We dropped the tests and restored.

Best regards,

Steve Jones - SSC Editor SSC Guru Points: 736222 More actions · Answer 8

Steve Jones - SSC Editor

SSC Guru

Points: 736222

September 23, 2019 at 2:56 pm

#3682576

Good luck and hope you don't lose much data.

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 9

Thanks!.

We could restore and even though the database was corrupted, we used the standard restore procedure.

Full > differential > logs

At the time we got corruption, our full and differential backups weren't corrupted and our transaction logs despite the database being corrupted they kept coming clean.

we could not get full/differential backups since they would bring the corruption, so we didn't make them, we just kept taking logs, since in this case, the corruption did not get to the log, we have restored a couple of times in another environment and executed DBCC CheckDB and it completed successfully without corruptions.

we did lost the data of the corrupted pages but i think our team have been working on rebuilding the data lost.

Thanks again! 🙂

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 10

What you need to do now is figure out why the data became corrupted. Do you have a disk or controller going bad? Was there a sudden power outage that took the box down in a sudden an unceremonious fashion?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

prettsons SSCertifiable Points: 7453 More actions · Answer 11

I read each and every line of this conversation. Thanks, Alejandro for posting this issue very clearly and updating it. I agree with @jeff and @Steve. I found this doc helpful: https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/remove-defunct-filegroups-sql-server?redirectedfrom=MSDN&view=sql-server-2017

SQL Database Recovery Expert 🙂

Alejandro Santana SSCommitted Points: 1838 More actions · Answer 12

Yes Jeff, i think the SAN guys found some issues and have been working with the provider, no outages, thank god it was only one filegroup.

prettsons, looks like defunct filegroups are part of a piecemeal restores, since we have not made backups of corruption and have been working with logs, we have not tried that option, but i think it would make good experiment, but even after these tests i think it would have issues making a piecemeal restore, i believe the FG should be in a consistent state to be released.

If it comes up, we could try, but the message management have been to not take backups of corruption.

best regards,