March 23, 2012 at 11:29 am
Today developers reported errors when running queries that had a locking hints in them - example SELECT TOP 500 * FROM [CORElibrary].[dbo].[Auth_Primary] AS a WITH(READUNCOMMITTED)
Msg 601, Level 12, State 3, Line 1
Could not continue scan with NOLOCK due to data movement.
I asked them to remove the hint and try again and they gotMsg 605, Level 21, State 3, Line 5
Attempt to fetch logical page (1:229734) in database 24 failed. It belongs to allocation unit 72057594251182080 not to 72057594275561472.
I got suspicious and checked into our replication setup, all looked good with the log reader agents and so forth, but when I checked on the "View Details", I noticed that since last night at 8:12pm the following error happened (over and over again)Attempt to fetch logical page (1:229734) in database 24 failed. It belongs to allocation unit 72057594251182080 not to 72057594275561472. (Source: MSSQLServer, Error number: 605)
I immediately attempted DBCC CHECKDB('CORELIBRARY') WITH NO_INFOMSGS, ALL_ERRORMSGS and received the following:
Msg 8928, Level 16, State 1, Line 2
Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data): Page (1:221165) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 2
Table error: Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data), page (1:221165). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 63047689 and -4.
Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data). Page (1:221165) was not seen in the scan although its parent (1:221323) and previous (1:221164) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data). Page (1:221166) is missing a reference from previous page (1:221165). Possible chain linkage problem.
Msg 2533, Level 16, State 1, Line 2
Table error: page (1:229734) allocated to object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data). Page (1:229734) was not seen in the scan although its parent (1:229881) and previous (1:229733) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 1081106942, index ID 1, partition ID 72057594254786560, alloc unit ID 72057594275561472 (type In-row data). Page (1:229735) is missing a reference from previous page (1:229734). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 7 consistency errors in table 'Auth_Primary' (object ID 1081106942).
CHECKDB found 0 allocation errors and 7 consistency errors in database 'CORELIBRARY'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (CORELIBRARY).
Sweet. Bad index, I'll just drop and recreate it and hope for the best? Nope...
Msg 1750, Level 16, State 0, Line 3
Could not create constraint. See previous errors.
The statement has been terminated.
Msg 605, Level 21, State 3, Line 3
Attempt to fetch logical page (1:229734) in database 24 failed. It belongs to allocation unit 72057594251182080 not to 72057594275561472.
Okay so I tried to get to the bottom of the consistency errors, reran the CHECKDB again but this time put the DB in SINGLE USER mode and ran a DBCC CHECKDB('CORELIBRARY', REPAIR_ALLOW_DATA_LOSS) - results attached :: CHECKDB_REPAIR1.txt
Thought it was pretty strange that while I was in SINGLE USER MODE running the above SQL that my process was deadlocked on another process but okay...
I ran CHECKDB again with just errors messages and everything came back without any errors - awesome! Went to try and create the dropped clustered index above - no go
Msg 1750, Level 16, State 0, Line 3
Could not create constraint. See previous errors.
The statement has been terminated.
Msg 605, Level 21, State 3, Line 3
Attempt to fetch logical page (1:229734) in database 24 failed. It belongs to allocation unit 72057594251182080 not to 72057594275561472.
Running another CHECKDB with no info msgs, all error msgs returns more bad news...
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:229734) allocated to object ID 1081106942, index ID 0, partition ID 72057594264158208, alloc unit ID 72057594275561472 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'Auth_Primary' (object ID 1081106942).
CHECKDB found 0 allocation errors and 1 consistency errors in database 'CORELIBRARY'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (CORELIBRARY).
What on earth can I try next other than restoring the DB from a prior backup? I know I could do that but this is a database that is transactionally replicated to and I honestly wouldn't know how to go about getting it back (unless I completed redid the replication for that database)
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 23, 2012 at 11:38 am
This kind of error is precisely why NoLock is recommended AGAINST by serious database professionals. Devs love it, because it makes them feel like they just discovered a magical "execute faster" button for their queries. But, it causes this kind of error fairly commonly on busy systems.
In your dev/QA environment, try implementing Read Committed Snapshot Isolation and getting rid of all the NoLock/Read Uncommitted hints. See if that allows for the level of performance your systems need. If so, migrate that to production in your standard manner.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 23, 2012 at 11:40 am
As far as the corrupted database goes, I think restoring from backups is your best bet. Assuming you have regular log backups and can do point-in-time recovery, of course.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 23, 2012 at 11:54 am
Thanks for the input! Regarding the restore....Due to resource restrictions I have all these databases in simple mode as they are pretty much only used for reporting. In the event something like this happens, re-doing the publication/subscription is generally quicker than restoring from the latest backups/log files.
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 23, 2012 at 11:59 am
I've got databases in the same situation. Whatever your rebuild/restore methodology is, it'll be better than "repair allow data loss".
And definitely look into the various snapshot isolation strategies available, so you can avoid this kind of thing in the future.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 23, 2012 at 12:09 pm
I've read that READ COMMITTED SNAPSHOT can put some serious bloat across tempdb as it needs to put everything in there temporarily while a query "does what it needs to do" - not that I don't trust our developers (which I clearly don't) but they write queries that join 20+ tables, on views that have another 6-7 tables within it, with aggregates, order by, and select DISTINCTin them (all nice juicy tempdb ingredients).
Any thoughts on that?
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 23, 2012 at 1:29 pm
Do you have a lot of large updates/deletes happening in the table on a routine basis?
What snapshot isolation (either flavor) does, is take the rows that are being updated/deleted and puts the "before version" of them in tempdb, with a row-version-number appended. Then any concurrent selects get the data from tempdb's version-store instead of the main table, till the update/delete is done, and then they read from the live table. The rows in tempdb are only there as long as they are being updated, or for as long after that as they are being held onto by an active Select.
So, if you have dozens of updates that update hundreds/thousands/etc. of rows at a time, then tempdb can blow up pretty heavily. Often still better than other alternatives (like a corrupted databases), but it can be problematic.
On the other hand, if updates and deletes are usually small, a data-page or less is best, and/or are infrequent, then tempdb usually won't blow up much, if at all.
For example, I have a database with over 5-million name and address records in it. 18 Gig worth. I have it in Read Committed Snapshot Isolation, and it gets a lot (thousands per hour) of inserts and updates. There are another 34 databases on that server, also in RCSI mode. tempdb on that server is 197 Meg. Trivial, no big deal, kind of thing.
That's because most of the data modifications are one row at a time, and the bigger mods (a whole day's data at once, potentially thousands of rows) are done once per day. So the row-version-store in tempdb isn't a significant bottleneck.
What's your data modification load look like. Not Selects, just update/insert/delete. How big, how frequent, and how long-running?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 23, 2012 at 2:24 pm
So I was finally above to get a clean CHECKDB and have isolated the source of the problem: There are 3 replicated transactions/records that no longer exist at the subscriber...from BOL
This error is raised in transactional replication if the Distribution Agent attempts to update a row at the Subscriber, but the row has been deleted or the primary key of the row has been changed. By default, Subscribers to transactional publications should be treated as read-only, because changes are not propagated back to the Publisher. For transactional replication, user changes should be made at the Subscriber only if updatable subscriptions or peer-to-peer replication is used. For information about these options, see Updatable Subscriptions for Transactional Replication and Peer-to-Peer Transactional Replication.
To temporarily get things back running and so daily reports can continue to run I've added the -SkipErrors 20598. parameter to the distribution agent so it can still do what it needs to do while I track down these three bad records...
What I have difficult understanding is "how" this issue could have occurred when everyone's access to this DB is READ ONLY. How could the record no longer exist/be changed at the subscriber???
Oh the joys of replication...
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 23, 2012 at 2:49 pm
MyDoggieJessie (3/23/2012)
What I have difficult understanding is "how" this issue could have occurred when everyone's access to this DB is READ ONLY. How could the record no longer exist/be changed at the subscriber???
That's the DB that you ran CheckDB with Repair_Allow_Data_Loss on?
If so, the rows are missing because the repair deallocated a page of data.
Repair: The page (1:229734) has been deallocated from object ID 1081106942, index ID 0, partition ID 72057594264158208, alloc unit ID 72057594275561472 (type In-row data).
It's called Repair_allow_data_loss for a reason.
As for the additional errors, if can happen that one corruption error masks a second, so repair one and a second appears.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 23, 2012 at 2:51 pm
MyDoggieJessie (3/23/2012)
Sweet. Bad index, I'll just drop and recreate it and hope for the best?
No, index 1 is the clustered index, that's data damaged in the base table. Repairing is going to discard data. Better restore from backup or otherwise recreate the DB
Take a look at this article. http://www.sqlservercentral.com/articles/65804/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 23, 2012 at 10:12 pm
Gail, thanks for the article, very informative and appreciated.
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply