Yeah! my first real corrupt DB

  • Paul Randal (8/10/2011)


    http://www.sqlskills.com/T_MCMVideos.asp

    Ok, but where's the advanced stuff? :hehe:

    Ok so take 3 months off, checked!

    Brain surgury, checked!

  • Hey guys, I'm starting to think I've hit a bug in MS.

    We have not completed the whole machine check up yet but we have not found any issues so far. I went ahead and restored another backup from our latest PROD db on the same server. I ran the same exact query exactly twice and I got the same error message.

    I'm currently replicating the test on 2 more servers (perfect vm copies, different san, different physical host). I'd like to know what / how I need to go about this in case I need to post a bug on connect. What do I need to log, save etc.

  • In that case you should call Product Support to help you out.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Ok, I'll call PSS if we don't find anything wrong with the vm.

    Any way I can test the assumption that my prod db logs are not corrupt to start with? CheckDB returns nothing but then again it was the case on the other db.

    I'm thinking fill the logs to 100% and roll back the transaction. Is that a valid test?

  • 99% sure it's a bug in MS now.

    I tried a lot of combos, new vms, new hardware, shrunk the log file, added a new log file, created a new blank db and swapped the log file.

    All test failed exactly at the same time, the same way.

    Next week I'll try to repro the bug in a new, smaller db with sharable data so you guys can play with it.

    Have a great week-end.

  • @@version?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Microsoft SQL Server 2005 - 9.00.4035.00 (Intel X86) Nov 24 2008 13:01:59 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    AND

    Microsoft SQL Server 2005 - 9.00.3042.00 (Intel X86) Feb 9 2007 22:47:07 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    I wanted to try on 2008 & R2 too but I don't have dev or trial installed here.

  • Patch to SP4 first and confirm that the behaviour's unchanged (CSS will almost certainly want you to do that)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (8/11/2011)


    Patch to SP4 first and confirm that the behaviour's unchanged (CSS will almost certainly want you to do that)

    Tx, I had forgotten about that one (1 call ever, 6 years ago 😛 .... been lucky on that end).

    Am I better off going to the latest CU as well just to cover all bases, don't want to go through 2 approvals for the upgrade?

    You guys are interested in playing with a test repro DB? Just let me know and I'll upload either here or on my ftp.

  • GilaMonster (8/11/2011)


    Patch to SP4 first and confirm that the behaviour's unchanged (CSS will almost certainly want you to do that)

    Am I better off proving this is a bug in 2008 & R2 before calling this in? IIRC 2k5 is out of support unless you have extended support.

    Do we have to pay for such a call (would have to approve the budget first)? Or do we need to pay upfront and then refund if this is really a bug and hence a free call?

  • Hey guys, finally got around to building a demo db that still fails the exact same way my "prod" db was failing.

    This data is 100% obfuscated and only a 3 MB download.

    Here are my findings. I've tested both on 2005 SP3 and 2008 R2 Express RTM.

    On 2K5 it fails with this error with the DB sent into suspect mode :

    Msg 3316, Level 21, State 2, Line 135

    During undo of a logged operation in database 'Remi', an error occurred at log record ID (101944:25015:51). The row was not found. Restore the database from a full backup, or repair the database.

    Msg 3314, Level 21, State 4, Line 135

    During undoing of a logged operation in database 'Remi', an error occurred at log record ID (101944:25015:51). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

    Msg 9001, Level 21, State 1, Line 135

    The log for database 'Remi' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

    Msg 3314, Level 21, State 5, Line 135

    During undoing of a logged operation in database 'Remi', an error occurred at log record ID (101944:22501:1). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

    Msg 0, Level 20, State 0, Line 0

    A severe error occurred on the current command. The results, if any, should be discarded.

    On the prod version I had the same error, only 45 events instead of 2. I've pruned a lot of data and all the useless columns which is only normal that you get less linkage errors.

    When running checkdb repair_allow_data_loss I get this :

    Repair: The Clustered index successfully rebuilt for the object "dbo.RPT_Historique_Couts" in database "Remi".

    Repair: The Nonclustered index successfully rebuilt for the object "dbo.RPT_Historique_Couts, PK_RPT_Historique_Couts_1" in database "Remi".

    Msg 8945, Level 16, State 1, Line 1

    Table error: Object ID 1935462815, index ID 1 will be rebuilt.

    The error has been repaired.

    Msg 2511, Level 16, State 1, Line 1

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 16 and 17.

    The error has been repaired.

    Msg 2511, Level 16, State 1, Line 1

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 35 and 36.

    The error has been repaired.

    Msg 8945, Level 16, State 1, Line 1

    Table error: Object ID 1935462815, index ID 2 will be rebuilt.

    The error has been repaired.

    CHECKDB found 0 allocation errors and 2 consistency errors in table 'RPT_Historique_Couts' (object ID 1935462815).

    CHECKDB fixed 0 allocation errors and 2 consistency errors in table 'RPT_Historique_Couts' (object ID 1935462815).

    CHECKDB found 0 allocation errors and 2 consistency errors in database 'Remi'.

    CHECKDB fixed 0 allocation errors and 2 consistency errors in database 'Remi'.

    The good news is that I don't have any dataloss (except all the time offline and previous PIT restore)

    On 2K8 R2 I get a completely different error (actually much less useful). But the db still goes into suspect mode

    Msg 208, Level 16, State 1, Line 3

    Invalid object name 'dbo.RPT_Historique_Couts'.

    Run this to see if it brings it back online :

    ALTER DATABASE Remi SET EMERGENCY

    ALTER DATABASE Remi SET SINGLE_USER

    DBCC CHECKDB('Remi', REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS

    ALTER DATABASE Remi SET MULTI_USER

    This time CHECKDB doesn't recover (says it doesn't). It looks like the code failed in a different way on 2K8 R2 :

    Repair: The Clustered index successfully rebuilt for the object "dbo.RPT_Historique_Couts" in database "Remi".

    Repair: The Nonclustered index successfully rebuilt for the object "dbo.RPT_Historique_Couts, PK_RPT_Historique_Couts_1" in database "Remi".

    Msg 8945, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1 will be rebuilt.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 16 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 35 and 36.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 35 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 56 and 57.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 56 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 73 and 74.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 73 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 86 and 87.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 86 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 108 and 109.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 108 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 2511, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type In-row data). Keys out of order on page (3:528), slots 114 and 115.

    The error has been repaired.

    Msg 2570, Level 16, State 3, Line 3

    Page (3:528), slot 114 in object ID 1935462815, index ID 1, partition ID 72057599794216960, alloc unit ID 72057600371261440 (type "In-row data"). Column "DateHistorique_fin" value is out of range for data type "datetime". Update column to a legal value.

    The system cannot self repair this error.

    Msg 8945, Level 16, State 1, Line 3

    Table error: Object ID 1935462815, index ID 2 will be rebuilt.

    The error has been repaired.

    CHECKDB found 0 allocation errors and 13 consistency errors in table 'RPT_Historique_Couts' (object ID 1935462815).

    CHECKDB fixed 0 allocation errors and 6 consistency errors in table 'RPT_Historique_Couts' (object ID 1935462815).

    CHECKDB found 0 allocation errors and 13 consistency errors in database 'Remi'.

    CHECKDB fixed 0 allocation errors and 6 consistency errors in database 'Remi'.

    However if you check the rowcount, it's actually accurate.

    Now the REAL fun part is that if you do COMMIT instead of rollback at the end of the script, the operation succeeds and no corruption is found! That in both versions.

    Since I'm able to finish my work here my company has decided against opening a case with PSS. But I've been authorized to report this on connect.

    I'd love to get as much feedback as possible for the versions I've not tested. I'm most interested in the latest SPs for 2K5+ all the way to Denali.

    Please post your results along with the version(s) you tested on.

    TIA.

    Once I have it I'll open the connect and refference this thread.

  • Cool - if you can email me the setup script I'll play with it this weekend! paul @ sqlskills.com

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • When snapshot isolation is on, SQL Server makes a mess of the logging when changing the NULLability of the leading key column of a compound clustered index, causing error 3316 in XdesRMReadWrite::RollbackToLsn.

    The following script includes notes and reproduces the issue without restoring a database.

    USE master

    GO

    CREATE DATABASE Bang

    GO

    -- Has to be ON

    ALTER DATABASE Bang SET ALLOW_SNAPSHOT_ISOLATION ON

    GO

    USE Bang

    GO

    CREATE TABLE dbo.Test

    (

    col1INTEGER NOT NULL,

    col2INTEGER NOT NULL,

    col3INTEGER NOT NULL

    )

    -- col2 is leading key of multi-column clustered index

    -- no problem if index is not clustered

    -- uniqueness not important

    CREATE CLUSTERED INDEX cx ON dbo.Test (col2, col3, col1)

    -- Must add a row

    INSERT dbo.Test (col1, col2, col3) VALUES (0, 0, 0)

    -- Txn required

    BEGIN TRANSACTION

    -- Add a column - default not important, can be NULL

    ALTER TABLE dbo.Test ADD col4 INTEGER NOT NULL DEFAULT 0

    GO

    -- Update the new column (can be NULL)

    UPDATE dbo.Test SET col4 = 0

    -- Redefine cx leading key to allow NULL (bug!)

    ALTER TABLE dbo.Test

    ALTER COLUMN col2 INTEGER NULL

    /*

    Side note: This statement would now fail (as it should):

    ALTER TABLE dbo.Test

    ALTER COLUMN col2 INTEGER NOT NULL

    Msg 5074, Level 16, State 1, Line 1

    The index 'cx' is dependent on column 'col2'.

    Msg 4922, Level 16, State 9, Line 1

    ALTER TABLE ALTER COLUMN col2 failed because one or more objects access this column.

    */

    GO

    -- Add a row with NULL in cx leading key column

    INSERT dbo.Test

    (col1, col2, col3)

    VALUES

    (0, NULL, 0)

    -- Bang!

    ROLLBACK

    GO

    -- Reconnect, then:

    -- Clean up

    USE master

    ALTER DATABASE Bang SET SINGLE_USER WITH ROLLBACK IMMEDIATE

    DROP DATABASE Bang

    Validated on SQL Server 2008 R2 build 10.50.2772 (latest available at time of writing). This issue is fixed in Denali CTP 3.

  • Paul Randal (8/18/2011)


    Cool - if you can email me the setup script I'll play with it this weekend! paul @ sqlskills.com

    It's already in the zip file.

  • Only 1 problem @SqlWiki, [DateHistorique_fin] is not the leading column, it's the 2nd of 3 columns being changed from NOT NULL to NULLable.

    ALTER TABLE [dbo].[RPT_Historique_Couts] ADD CONSTRAINT [IX_Historique_Couts_NoArticle] UNIQUE CLUSTERED

    (

    [DateHistorique] ASC,

    [DateHistorique_fin] ASC,

    [NoArticle] ASC

    )

    I've also tried your demo and while it errors out on the same error, the DB doesn't go into suspect mode. I give it to you, it's pretty darn close, but it's not the same issue (unless your demo just needs more data to have the same output, I have ±1K pages in my table that gets corrupted).

    Do you have a KB link that talks about this?

    Downloading denali atm to try my code on it. Maybe they fixed it there too!

    P.S. Where do you Work Paul, I see your name associated with SqlCat. With all the hidden knowledge you seem to have it wouldn't surprise me the least if that were true :-D.

Viewing 15 posts - 31 through 45 (of 78 total)

You must be logged in to reply to this topic. Login to reply