January 28, 2014 at 9:21 am
Hi,
I have a Service Broker implementation for auditing web traffic. It deals with 20+ million messages a day without even quickening it's pulse (I've ironed out issues over the years and it's now running smoothly and efficiently on a beefy new 2012 cluster).
Last night, just after 5pm, all my monitoring went mad - the queue was building up, server CPU quadrupled, user connections through the roof and high disk transfer rates and disk queue lengths.
After much poking and prodding, I narrowed it down to transaction log activity - with no change to the rate of messages coming in, or their size, we had suddenly hit 10-15x the amount of transaction log activity. Processes were queued many deep in blocking chains, the head of the chain stuck on IO_COMPLETION, LOGBUFFER and WRITELOG wait types.
I was up into the night changing backups around to prevent running out of disk space, while my transaction log backups every 15 minutes were now taking over half an hour each rather than their normal 10-15 seconds!
Fast forward to today, with the crisis averted (but all the odd behaviour still apparent) I had time to investigate properly.
What I found was, as far as I can ascertain*, a bunch of messages that came in around 5pm last night shuttling back and forth between the transmission queue and the target queue. This was the massive transaction log, and normal messaging was proceeding, slowly, with whatever resource it could find. Ending the conversations for these rogue messages brought an end to the rampaging behaviour, and everything instantly settled down to my normal near-empty queue and low CPU usage etc.
* There were a number of messages on the target queue that were queued around 5pm yesterday, the numbers were going up and down each time I checked, and the same was happening on the transmission queue. Put that together with the fact that the 'extra' activity was all centred around massive transaction logs and the only conclusion I can come to is that these messages were bouncing back and forth between the two queues.
My question, if you've bravely made it this far, is simply what on earth was going on?!?!?
Have you ever heard of this before?
What can cause this behaviour?
Can it be the result of malign activity from outside (we have had some spider activity around the same time that looks a bit like penetration testing - could they be related? I'm not at all convinced as all the ouside world can do is put things in the XML, but the queue doesn't care what's in it, just the nature of the queues, routes, message types etc.) or would it be something in my setup, or in SQL Server's handling of service broker queues?
Any insight would be very welcome, as I have no idea what mechanism could be responsible, and of course I'd like to prevent it happening again.
Thanks,
Dave.
January 28, 2014 at 9:47 am
Hmm,
I'm not an expert, but I was wondering if you have poison message handling enabled on your queues? My first thought was that you may have had something that was causing the transaction to rollback continuously.
January 29, 2014 at 4:27 am
My understanding (and if it's wrong, I'd be happy to be corrected):
The queue was not being poisoned. That disables the queue, which did not happen.
Also, poisoning happens because of errors in the processing. The target queue would be poisoned if an unprocessable message was repeatedly processed, causing errors. Do that 5 times and the queue is stopped.
I have error handling in place to extract any unprocessable messages and store them in a separate table, and further processing to prevent transaction-breaking errors that would prevent such storage - like type-conversion errors.
None of these 'shuttling' messages ended up in my failed message repository, there is no sign they were processed, I have no errors, no disabled queues (or reports from my queue restarter job - sometimes the queue is disabled through other reasons such as processing deadlocks so I have a job that checks the state of the queue and restarts it if necessary, and emails me when it does).
The shuttling between transmission and target queues happens earlier in the pipeline than the message processing - at which point the content of the XML is irrelevant as it is never decoded. Unless there's a way for failed processing to avoid my error handling (the RECEIVE is within the TRY...CATCH block that contains the error handling) and push a message back not onto the target queue, but onto the transmission queue? That sounds unlikely, but as usual, if you know of a mechanism or have seen it, please let me know.
January 29, 2014 at 9:39 am
January 31, 2014 at 2:53 am
Thanks Jim,
Anyone else have anything to suggest? There must be some Service Broker experts out there in this community?
March 18, 2014 at 7:10 am
I just noticed something in SP1 CU 9 for 2012 that may relate to what you encountered, but it is in context of an AlwaysOn group:
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply