Kendra’s query was a good starting point, and I used most of it in the first CTE shown below. This query basically looks at msdb.dbo.sysjobhistory and msdb.dbo.sysjobactivity, joining them on the job_id, which is the PK. However, we are only looking at steps, which is anything with a step_id > 0. The step_id = 0 is for the overall job.Recently a customer was asking for a way to alert on job steps that failed, but the job succeeded. They really wanted a custom metric for SQL Monitor, which I submitted, but this post looks at the query in general, trying to determine if a job step failed.
Note: Let me start by noting that this is based on work by Kendra from her post, SQL Agent Jobs: Checking for failed steps at the end of a job.
Based on Kendra’s query, I looked through what is happening in msdb.dbo.sysjobhistory and msdb.dbo.sysjobactivity. In Kendra’s query, she is looking for a specific job, but I wanted all jobs. This lead me to build a CTE that queries for the data.
As you can see in the code below, I use most of Kendra’s query to join these two tables together. First, we look for steps, so step_id != 0, and we look for failures. A status of 0 is a failure here, where 1 is success.
WITH cteJobStep (Job_ID, Step_Name)
AS (SELECT jh.job_id,
jh.step_name AS JobStepName
FROM msdb.dbo.sysjobhistory jh
INNER JOIN msdb.dbo.sysjobactivity ja
ON jh.job_id = ja.job_id
WHERE jh.run_status = 0 --step failed
AND jh.step_id != 0
-- and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC')
AND --this block ensures that we just pull information from the most recent job run
(
-- The start time of the step, converted to datetime
CONVERT(DATETIME, RTRIM(jh.run_date)) + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100 * 10)
/ 216e4 >= ja.start_execution_date -- the time the job last started
))
SELECT COUNT(*) AS Failures
FROM cteJobStep c
Next, I kept most of Kendra’s query, but I commented out the line that limits this to a specific ID. I just want all step failures. I did keep the part that only checks the latest execution of the job.
The outer query just counts these up and returns a number. This lets me know how many job steps have failed during their latest execution.
This is a good first step, but this is something I could add in SQL Monitor as a custom metric or in any tool for alerting. When I have steps failing, I might want to know.