February 12, 2003 at 12:34 pm
I am having a confusing situation here - I manage about 30 servers with about a google of jobs betwixt them. I have a few overlord jobs in place to check each of these servers for failed jobs on a daily basis. Works ok for the most part.
However.
I've come across a problem recently that I haven't seen since 6.5. (we're mostly a 7.0 shop with a few 2000 installs). We utilize sql mail heavily. When exchange has problems (i.e. dies and must be rebooted), it does something to some (not all) running sql mail sessions. The next time a sql mail tries to run, it gets 'lost'. Jobs don't fail. They just stay in an 'executing' (if the mail is in the bulk of the job) or 'performing completing actions' (if the mail is a notification at the end of a job) state. (they usually can't be cancelled, either, because the thread is lost.)
So, my problem is this: HOW do I check for a server that has this problem? Besides looking through every job every day (and committing hari kari in about a week)... I have never heard of anything that checked for 'stalled' jobs - i.e. jobs that "should probably be done by now, but aren't"... any ideas?
(and I should mention that I already stop/start mail daily to try to avoid the situation... sigh...)
February 12, 2003 at 1:06 pm
I'm guessing you could use "sp_help_job" SP and look at the "current_execution_status" column to determine the current status of a job. Of course I have no idea what the status might be for 'performing completing actions'. 'Executing' is 1 and "Not Running" is 4.
You might also be able to get the average run time for a job for the job history tables, and the determine if a job has run to long, and report.
Sorry no real help here.
Gregory Larsen, DBA
If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples
Gregory A. Larsen, MVP
February 12, 2003 at 1:49 pm
I had coded (initially in my nightly jobs check routine) to go against MSDB and check for status and average time a job usually takes, but found 2 problems:
1) our servers are not all synched with respect to time, so I'd have to run on each server
2) MSDB is not instantly updated with status - there is some sort of delay - so spurious errors, espeically with frequently run jobs - overload the reports
February 12, 2003 at 6:48 pm
Im trying to think through the best way to catch this problem too. My current thought (which means no real code!) is to poll from a different server, check job state, compare run time with average run time from past 3 jobs runs, see if its way above/below the average. Something like that anyway.
Andy
February 12, 2003 at 9:03 pm
Hi Betsy,
Sync'ing time is pretty easy to accomplish whether you have NT 4.0 or Windows 2000 in your environment. Take a look at the net time command. Here are some basic resources (they are fresh for me because I've had to deal with this issue just today):
How to Set Up And Synchronize with Domain Time Source Servers (NT)
http://support.microsoft.com/default.aspx?scid=kb;EN-US;131715
NET TIME /Domain Will Not Sync Time with Domain Time Source Server
http://support.microsoft.com/default.aspx?scid=kb;en-us;193825
The Windows Time Service (Windows 2000)
http://www.microsoft.com/windows2000/docs/wintimeserv.doc
As far as determining whether a job is hanging, you could have each job write to a table when it starts (say Job Step 1). The last job step would be to either write to the same table saying it's completing or instead get the time on the initial start record, delete the record, calculate the difference, and store that statistic in another table. You'd also want to set each job step to go to a fail step where it does the clean-up but somehow marks the job as failed.
If you query the status table, you should be able to tell what jobs are marked as currently running and be able to set up some thresholds where after a certain amount of time you flag the job.
K. Brian Kelley
http://www.truthsolutions.com/
Author: Start to Finish Guide to SQL Server Performance Monitoring
http://www.netimpress.com/shop/product.asp?ProductID=NI-SQL1
K. Brian Kelley
@kbriankelley
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply