How necessary is that SQL Server Agent job that you have running on the server? I ask that question of my clients on a routine basis.
Sometimes, I will ask that question as a routine part of a health check for the server. Others, it might be due to a migration or upgrade that is required for the server. Generally, the answer to the question will be one of two things: a) “Yes it is necessary.” or b) “What is that job doing?”.
Believe it or not, both answers will typically spawn more questions. You see, I don’t usually ask that question unless the job is failing on a regular basis. You wouldn’t believe how many jobs exist out there that are scheduled and just fail every time as well.
When I encounter a situation like this, it means it is time to have a discussion. In order to have that discussion, one needs to have done a bit of homework in order to better understand the situation. For me, part of this homework involves running the numbers and figuring out the frequency of the job’s failure or success.
Data Gathering
For me, I like to understand how often a job has executed and what is the frequency of failure for that quantity of executions. If I see a job that has not succeeded successfully in 60 consecutive executions, it is probably a safe bet that the job is not needed. Why? Well, if nobody has noticed the job hasn’t been working for that long, the likelihood of the job providing any use to the business is slim to none. In this case, I would present a case to the business as to why it should be removed.
But, how do I get to that point? Well, you could go through the job history for each job one by one and run some manual analytics. Or, you could take advantage of a script. I prefer the script route because it is faster, more reliable and a lot less mundane.
WITH jobhist AS (SELECT jh.job_id , jh.run_date , jh.run_status , jh.step_id , ROW_NUMBER() OVER (PARTITION BY jh.job_id, jh.run_status ORDER BY jh.run_date DESC) AS rownum FROM dbo.sysjobhistory jh WHERE jh.step_id = 0) , agglastsixty AS (SELECT sjh.job_id , sjh.run_status , COUNT(*) AS RunCount FROMdbo.sysjobhistory sjh INNER JOIN jobhist jh ON jh.job_id = sjh.job_id AND jh.run_status = sjh.run_status AND jh.step_id = sjh.step_id WHERECONVERT(DATE, CONVERT(VARCHAR, sjh.run_date)) > DATEADD( d , -60 , CONVERT(DATE, CONVERT(VARCHAR, jh.run_date)) ) AND jh.rownum = 1 GROUP BYsjh.job_id , sjh.run_status) , aggtotal AS (SELECT sjh.job_id , sjh.run_status , COUNT(*) AS RunCount FROMdbo.sysjobhistory sjh INNER JOIN jobhist jh ON jh.job_id = sjh.job_id AND jh.run_status = sjh.run_status AND jh.step_id = sjh.step_id WHEREjh.rownum = 1 GROUP BYsjh.job_id , sjh.run_status) SELECT j.name AS JobName , sc.name AS CategoryName , sp.name AS OwnerName , j.owner_sid , j.date_created , j.enabled , CONVERT(DATE, CONVERT(VARCHAR, oa.run_date)) AS RunDate , CASE oa.run_status WHEN 0 THEN 'Failed' WHEN 1 THEN 'Succeeded' WHEN 2 THEN 'Retry' WHEN 3 THEN 'Canceled' WHEN 4 THEN 'In Progress' END AS run_status , als.RunCount AS Last60StatusCount , agt.RunCount AS TotalStatusCount FROMdbo.sysjobs j LEFT OUTER JOINjobhist oa ON oa.job_id = j.job_id LEFT OUTER JOIN agglastsixty als ON als.job_id = oa.job_id AND als.run_status = oa.run_status LEFT OUTER JOIN aggtotal agt ON agt.job_id = oa.job_id AND agt.run_status = oa.run_status INNER JOINsys.server_principals sp ON j.owner_sid = sp.sid INNER JOINdbo.syscategories sc ON j.category_id = sc.category_id WHEREoa.rownum = 1 ORDER BYRunDate DESC;
Running that script against my sandbox, I may see something such as the following.
Here you will note that the “wtf” job has two entries. One entry for “Succeeded” (in green) and one entry for “Failed” (in red). Each row receiving counts for number of executions.
This is the type of information I can use in a meeting to discuss with the business. This is no longer a discussion of opinion, but rather one that is now based on facts and data. It now becomes very easy to demonstrate to the business that a job has failed 60/60 times and nobody noticed it or cared enough about the consistent failures to do anything about it. Imagine if the failing job happens to be the database backups. I wonder what the action items for that job failure might include.
The Wrap
An important part of any DBAs job is to ensure database related jobs are running prim and proper. Sometimes that just doesn’t happen. When jobs are being overlooked, it is useful to be able to gather data related to consistency of job success or failure. This script will help you in your investigation efforts. In addition, I also recommend this article in your agent job audit efforts.
This has been another post in the back to basics series. Other topics in the series include (but are not limited to): Backups, backup history and user logins.