Distribution clean up: distribution errors...

Question

Distribution clean up: distribution errors...

MyDoggieJessie

SSC-Forever

Points: 44515
More actions
November 8, 2011 at 10:00 pm

#246498

As of 7am this morning I've been seeing this error on our Transactional replication Distribution Cleanup Agent job:
"Could not continue scan with NOLOCK due to data movement"
I ran
DBCC CHECKDB ('Distribution') WITH NO_INFOMSGS, ALL_ERRORMSGS
And saw a lot of errors pertaining to the clustered index for MSrepl_commands. So my first attempts at fixing it all up where to drop the clustered index on the table, recreate it and then run the CHECKDB again. This worked. No more errors from the DBCC and no more pages are being sent to me.
Several hours go by and the job is running successfully until an hour ago. Same error being output to in the Agent log: "Could not continue scan with NOLOCK due to data movement"
I ran another CHECKDB (as shown above) and got even more errors than I did this morning:
Msg 8928, Level 16, State 1, Line 1
Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data): Page (1:522132) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data), page (1:522132). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 113379337 and -4.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Page (1:522132) was not seen in the scan although its parent (1:522288) and previous (1:522151) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:522135) allocated to object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Index node page (1:522288), slot 87 refers to child page (1:522135) and previous child (1:522132), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:522140) allocated to object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Page (1:522140) was not seen in the scan although its parent (1:522288) and previous (1:522161) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:522141) allocated to object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Index node page (1:522288), slot 93 refers to child page (1:522141) and previous child (1:522140), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Page (1:522164) is missing a reference from previous page (1:522135). Possible chain linkage problem.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 197575742, index ID 1, partition ID 72057594051428352, alloc unit ID 72057594052739072 (type In-row data). Page (1:522168) is missing a reference from previous page (1:522141). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 11 consistency errors in table 'MSrepl_commands' (object ID 197575742).
CHECKDB found 0 allocation errors and 11 consistency errors in database 'distribution'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (distribution).
And the job has been failing ever since (which I would expect at this point).
I guess I don't understand "why" this is happening so shortly after correcting the issue earlier today. Has anyone else come across this or something similar?
I did some checking online and found this article - http://social.msdn.microsoft.com/Forums/en-US/sqlreplication/thread/a3267f15-d613-407b-a5c5-16064e3d4d3d/ which pretty much tells me that the Distribution DB has gotten corrupted and I need to either restore from a recent backup, or redo the replication all over again (not something I really want to do (because it will involve an unschedule maintenance on one of our production servers), but will if there's no other way).
Has anyone comes across this before and if so, what steps did you take to resolve it which are different that what's described in the article above?
This is transactional replication, running on a nice 64-bit SQL 2008 EE server (lots of ram, 12 spindles, split across 2 arrays in RAID10)
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply

MyDoggieJessie SSC-Forever Points: 44515 More actions · Answer 1

Well I wasn't able to figure out "why" this keeps happening but I was able to resolve it by manually running scome TSQL to clean up the problematic replication commands. For those who are interested...

Step 1 - In the GUI for the Replication Monitor, select the publication, then under the "All Subscriptions" tab, right-click on the agent and select "View Details". This should pull up the Subscriber details window.

Step 2 - Select the "Distributor To Subscriber" tab. Choose Synchronizations with errors from the View drop down list. In the 3 panes below you should see:

- Status/Start/End/Duration/Error Message

- 3 distinct actions from the sessions

- The error details for the selected session

In our situation the error is typically "The row was not found at the Subscriber when applying the replicated command". The magic key you'll need to help correct the issue(s) resides in the Error Details section - it's the Transaction sequence number i.e. 0x000018B900099481000400000000

Step 3 - Run this to verify the repl_command is still in the commands table:

SELECT * FROM MSrepl_commands

WHERE publisher_database_id = 3

AND xact_seqno = 0x000018B900099481000400000000

AND command_id = 1

Step 4 - Whack it!

DELETE

FROM MSrepl_commands

WHERE publisher_database_id = 3

AND xact_seqno = 0x000018B900099481000400000000

AND command_id = 1

Generally within 1-3 minutes the Distribution Agent will re-run and either it will spit out another error with a different Transaction Sequence Number, or you'll get lucky and get a nice green arrow and a Status od Completed. If you get another error just repeat the 2 steps above (SELECT/DELETE) and wait for the Distributor Agent to restart...

It took about 14 tries for me to get them all.

It's not a solution to the problem but it's a short-term fix until we get to a point where we can rebuild our transactional replication for this database.

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience