SQL Server comes with a default SQL agent job installed (for most installations) to help manage the collection of system health data. I would dare say this job is ignored by most people and few probably even know it exists.
This topic is not new to me. You may recall a previous article I wrote entailing one type of failure and how to resolve that failure. That article can be found here.
I recently ran into a variant of the problem outline in that previous article that requires just a bit of a different approach. The errors turn out to be similar on the surface but really are different upon closer inspection.
Phantom Health Records
If you are unfamiliar with the topic, I recommend reading the previous article. Then after reading that article, maybe brush up a little bit on the SQL Agent. The failures we will be looking at are within the SQL Agent and come from the job called: syspolicy_purge_history.
For this latest bout of failure, I will start basically where I left off in the the last article. The job fails on a server and I have access to other servers of the same SQL version where the job works.
Before diving any further into this problem, let’s look at what the error is:
A job step received an error at line 1 in a PowerShell script.
The corresponding line is ‘set-executionpolicy RemoteSigned -scope process -Force’.
Correct the script and reschedule the job. The error information returned by PowerShell is:
‘Security error. ‘. Process Exit Code -1. The step failed.
Having the error in hand, and knowing that the job works elsewhere, my next logical step (again based on experience from the last article with this job) is to script the job from another server and use it to replace the job on the server where it fails. In principle this is an AWESOME idea.
Sadly, that idea was met with initial failure. As it turns out, the error remained exactly the same. This is good and unfortunate at the same time. Good in that I was able to confirm that the job was correctly configured with the following script in the job:
if ('$(ESCAPE_SQUOTE(INST))' -eq 'MSSQLSERVER') {$a = 'DEFAULT'} ELSE {$a = ''}; (Get-Item SQLSERVER:SQLPolicy$(ESCAPE_NONE(SRVR))$a).EraseSystemHealthPhantomRecords()
Since the step fails from SQL Server let’s see what else we can do to make it run. Let’s take that code and try it from a powershell ise. So, for giggles, let’s cram that script into powershell and see what blows up!
SRVR : The term 'SRVR' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:2 char:46
Now isn’t that a charming result! This result should be somewhat expected since the code I just threw into the ISE is not entirely powershell. If you look closer at the code, you will notice that it is using sqlcmd like conventions to execute a parameterized powershell script. Now, that makes perfect sense, right? So let’s clean it up to look like a standard PoSH script. We need to replace some parameters and then try again.
(Get-Item sqlserver:SQLPolicyMyServerDefault).EraseSystemHealthPhantomRecords()
This will result in the following (resume reading after you scratch your head for a moment):
Get-Item : Cannot find drive. A drive with the name 'sqlserver' does not exist. At line:1 char:2 + (Get-Item sqlserver:SQLPolicySPMS1Default).EraseSystemHealthPhantomRecords()
The key in this failure happens to be in the sqlserver. PoSH thinks we are trying to pass a drive letter when we are just trying to access the SQLServer stuff. Depending on your version of server, SQL Server, and PoSH you may need to do one of a couple different things. For this particular client/issue, this is what I had to try to get the script to work.
Import-Module “sqlps” -DisableNameChecking; (Get-Item sqlserver:SQLPolicymyserverDefault).EraseSystemHealthPhantomRecords()
If you read the previous article, you may notice this command looks very much like the command that was causing the problems detailed in that previous article. Yes, we have just concluded our 180 return to where we started a few years back. Suffice it to say, this is expected to be a temporary fix until we are able to update the system to PoSH 5 and are able to install the updated sqlserver module.
As is, this script is not quite enough to make the job succeed now. To finish that off, I created a ps1 file to house this script. Then from a new step (defined as a sqlcmd step type) in the syspolicy purge job, I execute that powershell script as follows:
powershell.exe "c:dbasyshealth.ps1"
Tada, nuisance job failure alert is resolved and the system is functioning again.
Conclusion
I dare say the quickest resolution to this job is to probably just disable it. I have seen numerous servers with this job disabled entirely for the simple reason that it fails frequently and just creates noise alerts when it fails. Too many fixes abound for this particular job and too few resolve the failures permanently.
I would generally err on the side of fixing the job. Worst case, you learn 1000 ways of what not to do to fix it.
Given this job is tightly related to the system_health black box sessions (sp_server_diagnostics and system_health xe session), I recommend fixing the job. In addition, I also recommend reading the following series about XE and some of those black box recorder sessions – here.