Introduction
In part 1 of this series I described how to use the multi-threading capability of Service Broker to speed up execution of database-maintenance runs. This was done by spreading a series of independent tasks - statistics updates of individual tables - across multiple CPUs and running them in parallel. This approach works well, and I have found it useful in shrinking database-maintenance windows. There are potential pitfalls to be aware of, however. In this part I describe techniques for monitoring and troubleshooting Service-Broker execution and recovering from errors.
Much of what is described here is based on information I first came across in this link: How to troubleshoot Service Broker problems.
Monitoring
The following query returns information on all tasks currently activated by Service Broker:
SELECT at.spid , DB_NAME(at.database_id) AS [dbName] , at.queue_id , at.[procedure_name] , s.login_time , s.[status] FROM sys.dm_broker_activated_tasks at INNER JOIN sys.dm_exec_sessions s ON at.spid = s.session_id;
Query 1: Broker-activated tasks
In a scenario where the maximum number of threads (MaxQueueReaders) on the initiator and target queues has been set to 4, the output will be similar to this:
spid | dbName | queue_id | procedure_name | login_time | status |
77 | UTILITY | 818817979 | [dbo].[UpdStats_internalActivationTarget] | 2013-10-27 05:58:01.027 | sleeping |
14 | UTILITY | 818817979 | [dbo].[UpdStats_internalActivationTarget] | 2013-10-27 05:57:50.887 | sleeping |
40 | UTILITY | 818817979 | [dbo].[UpdStats_internalActivationTarget] | 2013-10-27 05:58:06.027 | sleeping |
48 | UTILITY | 818817979 | [dbo].[UpdStats_internalActivationTarget] | 2013-10-27 05:57:56.003 | sleeping |
Notice that status shows as "sleeping", even though these sessions are very much awake and active. At the moment I do not have a very good explanation for this, other than attributing it to the fact that these are background system processes running asynchronously.
Troubleshooting
Once we initiate the update-stats Service-Broker run ("EXEC [UTILITY].[dbo].[UpdStats_serviceBroker_beginDialog]"), there are several symptoms to alert us that something is wrong.
(1) After a few minutes these 3 queries return no results:
SELECT * FROM [UTILITY].[dbo].[UpdateStatsTables] WHERE processed = 1; SELECT * FROM [UTILITY].[dbo].[ServiceBroker_UpdStatsLog] WITH (NOLOCK); SELECT * FROM [UTILITY].[dbo].[ServiceBroker_EndConversation];
This means that no tables (even the smallest ones) have been processed, an unlikely scenario if the run was executing smoothly.
(2) Errors are recorded in the ServiceBroker_Errors table, errMsg column:
SELECT * FROM [UTILITY].[dbo].[ServiceBroker_Errors];
Example Outputs; errMsg column:
ErrorNumber - 916 : ErrorMessage - The server principal "utilDBOwner" is not able to access the database "SBdemo" under the current security context.
OR:
ErrorNumber - 9617 : ErrorMessage - The service queue "UpdStats_TargetQueue" is currently disabled.
(3) The query identifying the broker-activated tasks/sessions (Query 1 above) returns no results. If Service Broker is not running properly, we can follow these steps to fix it:
- Identify and fix the cause;
- Clean out all relevant conversations;
- Re-enable any queues that may have been disabled as a result of Poison Message Detection.
1. Identify and fix the cause
The first place to look at when troubleshooting Service-Broker issues is catalog view sys.transmission_queue. Every sent message sits here until the target sends back an acknowledgement that it received it. Normally, querying the sys.transmission_queue should return no data, even when the Service-Broker process is currently running. If data is returned, this may indicate a problem with the message transmission.
USE [UTILITY]; SELECT to_service_name , from_service_name , service_contract_name , enqueue_time , message_type_name ,transmission_status FROM sys.transmission_queue;
Query 2: Transmission Queue
If, for example, the database owner SID of the master database does not match that of the UTILITY database, we will get this error message under the transmission_status column:
An exception occurred while enqueueing a message in the target queue. Error: 33009, State: 2. The database owner SID recorded in the master database differs from the database owner SID recorded in database 'UTILITY'. You should correct this situation by resetting the owner of database 'UTILITY' using the ALTER AUTHORIZATION statement.
Or, if the UTILITY database is not broker-enabled:
The broker is disabled in the sender’s database.
Or, if the UTILITY database - the database on which Service Broker is running - is not set to TRUSTWORTHY ON:
One or more messages could not be delivered to the local service targeted by this dialog.
If we see the last message, in particular, we need to also check if one of the queues is disabled. Disabled queues are a common issue in Service Broker. I am showing how we deal with that shortly.
Here is a sample of rows from the query output of Query 2 (with the first message shown):
to_service_name | from_service_name | service_contract_name | enqueue_time | message_type_name | transmission_status |
//UTILITY/UpdStats/TargetService | //UTILITY/UpdStats/InitiatorService | //UTILITY/Contract | 2013-10-25 19:29:33.240 | //UTILITY/RequestMessage | An exception occurred while enqueueing a message in the target queue. Error: 33009... |
//UTILITY/UpdStats/TargetService | //UTILITY/UpdStats/InitiatorService | //UTILITY/Contract | 2013-10-25 19:29:33.237 | //UTILITY/RequestMessage | An exception occurred while enqueueing a message in the target queue. Error: 33009... |
//UTILITY/UpdStats/TargetService | //UTILITY/UpdStats/InitiatorService | //UTILITY/Contract | 2013-10-25 19:29:33.233 | //UTILITY/RequestMessage | An exception occurred while enqueueing a message in the target queue. Error: 33009... |
In this case the fix is to reset the database owner of the UTILITY database to "sa" and then follow the steps below to put Service Broker back in working condition.
(2) Clean out all conversations:
USE [UTILITY]; DECLARE @handle AS UNIQUEIDENTIFIER; DECLARE conv CURSOR FOR SELECT [conversation_handle] FROM sys.conversation_endpoints; OPEN conv; FETCH NEXT FROM conv INTO @handle; WHILE @@FETCH_STATUS = 0 BEGIN; END CONVERSATION @handle WITH CLEANUP; FETCH NEXT FROM conv INTO @handle; END; CLOSE conv; DEALLOCATE conv;
Query 3: Cleanup of Conversations
That will in effect empty out the sys.transmission_queue.
Checking now the UTILITY..ServiceBroker_Errors table, we may see the following error message that points us to the next and final step (errMsg column):
ErrorNumber - 9617 : ErrorMessage - The service queue "UpdStats_TargetQueue" is currently disabled.
(3) Identify and re-enable disabled queues:
When a transaction that contains a RECEIVE statement rolls back five times, Service Broker responds by disabling the queue that the transaction was to receive messages from. This capability is known as Poison Message Detection. A poison message is one that a service-broker application is unable to process and instead errors out.
In our setup the queue associated with the RECEIVE-statement transaction is the target queue, UpdStats_TargetQueue. Because of an earlier failure - for example, owner-SID mismatch between master and UTILITY databases - this transaction had to roll back multiple times (as many times as the number of records in the UTILITY..UpdateStatsTables table); once for every table in the target database, whose statistics we are updating. (In part 1 the example target database I used was SBDemo.) As a result, Service Broker disabled the target queue.
The following query returns the service queues, along with the enabled/disabled status for each:
USE [UTILITY]; SELECT name , is_activation_enabled , is_receive_enabled , is_enqueue_enabled FROM sys.service_queues WHERE name like 'UpdStats%';
Query 4: Service Queues and Enabled/Disabled Status
Output:
name | is_activation_enabled | is_receive_enabled | is_enqueue_enabled |
UpdStats_InitiatorQueue | 1 | 1 | 1 |
UpdStats_TargetQueue | 1 | 0 | 0 |
We see that the target queue is currently disabled. To enable it we run the following:
ALTER QUEUE [UpdStats_TargetQueue] WITH STATUS = ON;
Query 5: Enabling a Queue
At this point the Service-Broker setup is back in optimal health and should be ready to be used again as part of the UpdStats functionality.
To eliminate other errors, we simply need to repeat steps 1 to 3 until all errors are gone.
Aborting a Run
If, after having started the Service-Broker run, we change our mind and want to abort it, we can follow these steps:
(1) Clean out all conversations (Query 3 above).
(2) If step 1 is blocked by a session running a broker-activated task (Query 1), kill these sessions one by one, until step 1 is free to complete.
Summary
In this article I have described techniques for diagnosing and fixing issues that may come up when using Service Broker. The example I have used is that of executing Update-Statistics commands against the tables, one by one, of a target database. In a future article I hope to be able to demonstrate this methodology with other more disk-IO-bound tasks, such as backups and index rebuilds. As of now, I have found that such tasks do not benefit as much from the multi-threading capability of Service Broker. Newly emerging technologies, such as Fusion-IO, promise to enhance disk throughputs 10-fold relative to current values and could therefore be the ticket to exciting future applications of this methodology.