One evening I started receiving a flood of alerts from almost every single database server for one of my clients on a SCOM console. The error message appeared as follows:
"MSSQL 2016: SQL Server cannot authenticate using Kerberos because the Service Principal Name (SPN) is missing, misplaced, or duplicated."
Before I started the technical investigation, first I went to the SCOM team and asked about the reason for these numerous alerts. I assumed it could be due to some SCOM maintenance activity that happened the previous day. However, this was not the cause.
Now I had to find and fix the issue. As per the error message, it clearly says that the SPN is missing, misplaced or duplicated on the servers. So I started looking at the SPN's on each instance with the below command:
SetSPN -L ZZZZ\SRV001 <domain\service account>
The service account is the domain account where the SQL services are running.
If the SPN is missing that means either SPN's are not registered or the SQL services are not running with a domain account. So, in the former case. you only need to register the SPN. For the latter case, you first need to change the service account with a domain account and then register the SPN with the below commands. To execute these commands, you should be a member of the Domain Admin group. If not, I would suggest acting the Active Directory administrator for help.
Setspn -A MSSQLSvc/<servername>.zz.group.com:1433 ZZZZ\SRV11111 Setspn -A MSSQLSvc/<servername>.zz.group.com ZZZZ\SRV11111 Setspn -A MSSQLSvc/<servername>:1433 ZZZZ\SRV11111 Setspn -A MSSQLSvc/<servername> ZZZZ\SRV11111
I was able to fix 90 percent of my errors messages by registering the SPN. But here comes the true problem. On few servers, I was still getting alerts where the SPN was already registered and even the service account was running with domain account, below is an example. Before I started any investigation, as a normal human tendency I compared the below result with the working server. I found some difference in both results, TCP port 1433 seemed missing with host name. So I thought to add that missing value or completely re-register the SPNs to be same as working server.
MSSQLSvc/BRRXXXXX.zz.group.com MSSQLSvc/BRRXXXXX MSSQLSvc/BRRXXXXX.zz.group.com:BRRXXXXX MSSQLSvc/BRRXXXXX.zz.group.com:1433 ---------------------------------------- MSSQLSvc/BRRXXXXX [1433?] It should be as: MSSQLSvc/BRRXXXXX:1433
On the other hand, I looked at the application logs and found below repeatedly error messages. Its still pointing some issue with failure of SPN registration.
The SQL Server Network Interface library could not register the Service Principal Name (SPN)[MSSQLSvc/BRRXXXXX.zz.group.com:1433 ] for the SQL Server service. Windows return code: 0x2098, state: 15. Failure to register a SPN might cause integrated authentication to use NTLM instead of Kerberos. This is an informational message. Further action is only required if Kerberos authentication is required by authentication policies and if the SPN has not been manually registered.
I requested AD team to de-register and provided the following scripts to execute it on Server BRRXXXXX:
Setspn -D MSSQLSvc/BRRXXXXX.zz.group.com:1433 ZZZZ\SRV11111 Setspn -D MSSQLSvc/BRRXXXXX ZZZZ\SRV11111 Setspn -D MSSQLSvc/BRRXXXXX.zz.group.com ZZZZ\SRV11111 Setspn -D MSSQLSvc/BRRXXXXX.zz.group.com:BRRXXXXX ZZZZ\SRV11111
Also, provided the following scripts to re-register SPN to AD team
Setspn -A MSSQLSvc/BRRXXXXX.zz.group.com:1433 ZZZZ\SRV11111 Setspn -A MSSQLSvc/BRRXXXXX.zz.group.com ZZZZ\SRV11111 Setspn -A MSSQLSvc/BRRXXXXX:1433 ZZZZ\SRV11111 Setspn -A MSSQLSvc/BRRXXXXX ZZZZ\SRV11111
But all in vain it made me stand at the same place from where I took start. Even after re-registering SPNs as same as the working server, I was still having the same issue.
To investigate further, I install the tool ProcMon [download] which shows real-time file system, registry and process/thread activity. Now, I need to analyze the data by looking at each single process and identify the parameter/script causing an issue. This part is a bit tricky.
Fortunately, I'm able to found that the below parameter which was causing issue when workflow runs.
"C:\Windows\system32\cscript.exe" /nologo "GetSQL2016SPNState.vbs" BRRXXXXX.zz.group.com BRRXXXXX BRRXXXXX " ZZZZ\SRV11111" MSSQL$BRRXXXXX LDAP
I try to run the script manually via CMD using the domain account, <ZZZZ\SRV11111>, or my ID as I am a sysadmin on the machine. the command gives me a positive result and appears to be working fine. I am relieved and can sense it could be a permission issue. Apparently, SCOM uses "nt authority\system" to access database.
Then I try executing the file with "nt authority\system" and this time the output is different. It is returning "Access is Denied". That means there is some permission issue with this account. I gave the sysadmin permission to "nt authority\system", the same as the service account, and guess what, it starts working fine. Well, that's a separate topic whether to give SYSADMIN permission to "nt authority\system" or not. I grant the sysadmin permission and alerts stop appearing at the console. Later we make the policy to have sysadmin role for "nt authority\system" on each machine.
My conclusion: NT Authority\system account should have sysadmin permission to SCOM monitoring work properly.