Again and again as a DBA I am trying to say that automation is the way for a DBA to make the job easier. In many cases I have seen that DBAs are using a manual labor to verify if scheduled jobs are running, backup files are created or deleted, and so on. But it is a time for the automation of any ideas that may popup in your head. Especially in the cases when DBA is maintaining multiple servers and many databases. Let’s see an automated approach for the next situation with scheduled job. The job that is running on a server has its duration. Usually, the duration for a job step in a server is a permanent value. It means that if today job step is running within 20 minutes you may expect that tomorrow it will have the same duration e.g. 20 plus/minus few minutes. There are some situations when a job step is succeeding (not failed) but was running an unusually long time. It may happen for many reasons – network issues, high CPU utilization, some excessive blocking situations etc. And as a DBA you probably would like to know when it is occurs. To get an idea of such suddenly unusual situations I did setup the automated process. There are few ways to organize the process. It depended from the scheduled jobs and job’s settings. Let’s see how we can use the SQL Server system tables to setup an automated control and when job step duration was unusually long email a warning.
Logical idea of the process is to analyze the duration for the job step and find the difference between average running time and the duration the job step had. To make a clear shut, we should exclude unusually long running instances of the job step from previous executions. In many cases it is more practical to analyze each step separately from the job itself. In addition, unusually long running behavior should be defined as well. Based on my practical experience, I do consider job duration time unusually long if it is about 20 - 30% over the average time. But if job step duration is short (I do identify short as less than 1 minute) then check threshold should be bumped to 100%.
For example, if job step runs 30 seconds then warning flag should be raised only if duration more than 60 seconds. In case of job step duration is 200 seconds a warning should be raised when duration is over 260 seconds. For the long running job steps (over 30-45 minutes) level of awareness may be lowered to 20% (sometime even 10-15 percent).
Let’s see the code for the logic. First, create a table to keep all running job steps historical information in convenient format with some additional flags.
Create table DBA_JOB_STEPS_HISTORY ( history_id int identity (1,1) Primary key, Insert_DT datetime default (getdate()), instance_id int, -- system job step history unique identifier job_id uniqueidentifier, -- Unique ID of the job job_step_id int, -- step number in the job job_nm sysname, -- Name of the job step_nm varchar(128), -- name of the step last_run_date datetime, -- Date and time the job last started execution duration int, -- time job run in seconds long_running_flag char(1) -- flag to identify if long running this time )
Before inserting any information into the table, verification process will be conducted and based on the result outcome an email will be sent to the recipients.
CREATE procedure dbo.P_CHECK_JOBS_RUN_TIME @recipients varchar(100) as BEGIN SET NOCOUNT ON declare @message varchar(4000), @instance_id INT, @flag_exists int, @subject varchar(80), @long_running_flag char(1), @minid int, @maxid int, @job_id uniqueidentifier, @job_nm varchar(128),@step_id int, @avgduration int, @duration int, @step_nm varchar(128), @last_run_date datetime DECLARE @allactivejobs table (instance_id int, job_id uniqueidentifier, job_step_id int, job_nm varchar(128), step_nm varchar(128), last_run_date datetime,duration int, long_running_flag char(1), tid int identity(1,1) ) SELECT @instance_id = max(instance_id) from DBA_JOB_STEPS_HISTORY set @instance_id = IsNull(@instance_id,0) set @message = 'Unusually Long Run Steps: ' set @flag_exists = 0 set @subject = 'Unusually Long Run Steps on server ' + @@servername set @long_running_flag = 'N' insert into @allactivejobs ( instance_id, job_id , job_step_id, job_nm, step_nm, last_run_date, duration, long_running_flag) select soh.instance_id,sj.job_id, sos.step_id, sj.name ,sos.step_name, CAST ( Left(cast(run_date as varchar(8)),4) + '/' + substring(cast(run_date as varchar(8)), 5,2) + '/' + Right(cast(run_date as varchar(8)), 2) + ' ' + cast( ((run_time/10000) %100) as varchar ) + ':' + cast( ((run_time/100) %100) as varchar ) + ':' + cast( (run_time %100) as varchar ) as datetime), ( run_duration % 100 )+ -- seconds (((run_duration/100) % 100 ) * 60) + -- minutes in seconds (((run_duration/10000) % 100 ) * 3600), -- hours in seconds 'N' from msdb..sysjobs sj inner join msdb..sysjobsteps sos on sos.job_id = sj.job_id inner join msdb..sysjobhistory soh on soh.job_id = sos.job_id and soh.step_id = sos.step_id where soh.instance_id > @instance_id and soh.run_status = 1 -- successful jobs order by sj.name,sos.step_name select @minid = min(tid),@maxid = max(tid) from @allactivejobs while (@minid <= @maxid) begin set @long_running_flag = 'N' select @job_id = job_id, @job_nm = job_nm, @step_id = job_step_id, @step_nm = step_nm, @duration = duration, @last_run_date = last_run_date from @allactivejobs where tid = @minid -- exclude checks the first 2 times IF ( (select count(*) from DBA_JOB_STEPS_HISTORY where job_id = @job_id and job_step_id = @step_id) >= 2 ) BEGIN -- define average run time excluding unusually long running instances select @avgduration = avg(duration) from DBA_JOB_STEPS_HISTORY where job_id = @job_id and job_step_id = @step_id and @long_running_flag = 'N' set @avgduration = IsNull(@avgduration,1) IF (@avgduration < 60) begin IF (@duration > (@avgduration * 2) ) begin select @message = @message + ' ' + @job_nm + ': ' + @step_nm set @flag_exists = 1 set @long_running_flag = 'Y' end end IF (@avgduration > 60 and @avgduration < 1800) begin IF (@duration > (@avgduration * 1.3) ) begin select @message = @message + ' ' + @job_nm + ': ' + @step_nm set @flag_exists = 1 set @long_running_flag = 'Y' end end IF (@avgduration > 1800) begin IF (@duration > (@avgduration * 1.2) ) begin select @message = @message + ' ' + @job_nm + ': ' + @step_nm set @flag_exists = 1 set @long_running_flag = 'Y' end end IF (@long_running_flag = 'Y') begin update @allactivejobs set long_running_flag = 'Y' where tid = @minid end end select @minid = @minid + 1 end IF (@flag_exists = 1) begin exec master..xp_sendmail @recipients = @recipients, @subject = @subject, @message = @message end insert into DBA_JOB_STEPS_HISTORY ( instance_id, job_id , job_step_id, job_nm, step_nm, last_run_date, duration, long_running_flag) select instance_id, job_id , job_step_id, job_nm, step_nm, last_run_date, duration, long_running_flag from @allactivejobs SET NOCOUNT OFF END
I did setup clean up of the table DBA_JOB_STEPS_HISTORY as one of the steps in my weekly job based on 30 days of historical information.
Delete from DBA_JOB_STEPS_HISTORY Where datediff(dd, Insert_DT, getdate()) > 30
Additional check may need to be established, especially if job is running unusually long time during 2 first runs. Usually, I am manually controlling any new job created on a SQL Server within the first week or first 2-3 times the job runs (whatever happened first). In many other cases table DBA_JOB_STEPS_HISTORY may help with statistical information about jobs scheduled on the SQL Server. Some other issues may arise when job steps are rearranged or some steps are deleted because the column job_step_id in table msdb..sysjobsteps will change the its value. It is not very often jobs are changed in production server but a change may create some confusion. The stored procedure above can be modified to utilize a step name instead of a step id, but there is a possibility to get the same problem if a step name is changed. The approach of using a step name, probably safer because we less likely changing job step names but more often rearranging the steps.
In addition, changes can be done to help analyze those job steps which are still running but duration of the step is already unusually long. It can be achieved by adding additional criteria to the select statement for the field run_status.
Simply exchange line
and soh.run_status = 1 -- successful jobs
with line
and ( soh.run_status = 1 or soh.run_status = 4 ) -- successful and in progress jobs
In this case you may have an issues with average duration, and may need to exclude the rows for the steps with run_status = 4 (in progress) from insertion to the table DBA_JOB_STEPS_HISTORY
Conclusion
This article shows you the way to identify and automate a control for an unusually long running job steps. You can summarize all the steps and create a procedure for a whole job or add some additional columns to the table DBA_JOB_STEPS_HISTORY and make the code more sophisticated. Same process can be extended to the degree where control for the long running processes will be provided from one centralized location. But it will require dynamic SQL or use of VBS code.