Checking for Stalled Agent Jobs

  • I am having a confusing situation here - I manage about 30 servers with about a google of jobs betwixt them. I have a few overlord jobs in place to check each of these servers for failed jobs on a daily basis. Works ok for the most part.

    However.

    I've come across a problem recently that I haven't seen since 6.5. (we're mostly a 7.0 shop with a few 2000 installs). We utilize sql mail heavily. When exchange has problems (i.e. dies and must be rebooted), it does something to some (not all) running sql mail sessions. The next time a sql mail tries to run, it gets 'lost'. Jobs don't fail. They just stay in an 'executing' (if the mail is in the bulk of the job) or 'performing completing actions' (if the mail is a notification at the end of a job) state. (they usually can't be cancelled, either, because the thread is lost.)

    So, my problem is this: HOW do I check for a server that has this problem? Besides looking through every job every day (and committing hari kari in about a week)... I have never heard of anything that checked for 'stalled' jobs - i.e. jobs that "should probably be done by now, but aren't"... any ideas?

    (and I should mention that I already stop/start mail daily to try to avoid the situation... sigh...)

  • I'm guessing you could use "sp_help_job" SP and look at the "current_execution_status" column to determine the current status of a job. Of course I have no idea what the status might be for 'performing completing actions'. 'Executing' is 1 and "Not Running" is 4.

    You might also be able to get the average run time for a job for the job history tables, and the determine if a job has run to long, and report.

    Sorry no real help here.

    Gregory Larsen, DBA

    If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples

    Gregory A. Larsen, MVP

  • I had coded (initially in my nightly jobs check routine) to go against MSDB and check for status and average time a job usually takes, but found 2 problems:

    1) our servers are not all synched with respect to time, so I'd have to run on each server

    2) MSDB is not instantly updated with status - there is some sort of delay - so spurious errors, espeically with frequently run jobs - overload the reports

  • Im trying to think through the best way to catch this problem too. My current thought (which means no real code!) is to poll from a different server, check job state, compare run time with average run time from past 3 jobs runs, see if its way above/below the average. Something like that anyway.

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • Hi Betsy,

    Sync'ing time is pretty easy to accomplish whether you have NT 4.0 or Windows 2000 in your environment. Take a look at the net time command. Here are some basic resources (they are fresh for me because I've had to deal with this issue just today):

    How to Set Up And Synchronize with Domain Time Source Servers (NT)

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;131715

    NET TIME /Domain Will Not Sync Time with Domain Time Source Server

    http://support.microsoft.com/default.aspx?scid=kb;en-us;193825

    The Windows Time Service (Windows 2000)

    http://www.microsoft.com/windows2000/docs/wintimeserv.doc

    As far as determining whether a job is hanging, you could have each job write to a table when it starts (say Job Step 1). The last job step would be to either write to the same table saying it's completing or instead get the time on the initial start record, delete the record, calculate the difference, and store that statistic in another table. You'd also want to set each job step to go to a fail step where it does the clean-up but somehow marks the job as failed.

    If you query the status table, you should be able to tell what jobs are marked as currently running and be able to set up some thresholds where after a certain amount of time you flag the job.

    K. Brian Kelley

    http://www.truthsolutions.com/

    Author: Start to Finish Guide to SQL Server Performance Monitoring

    http://www.netimpress.com/shop/product.asp?ProductID=NI-SQL1

    K. Brian Kelley
    @kbriankelley

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply