Advanced Troubleshooting Week at SQL University, Lesson 1
Welcome to Advanced Troubleshooting Week at SQL University. Once again, I’m your guest professor for this course and today I will be talking about Advanced Troubleshooting. Being a Harry Potter fan, I'm subtitling this week's lessons as @SQLSoldier's Defense Against the Dark Arts. In this case, the dark arts would be poor performance. Unfortunately, I won't have any magical tricks to teach you. Just solid techniques for finding performance problems and fixing them. For this week's lessons, I'm going to try to adopt the Agile development method. Instead of doing 3 large posts, I'm planning to submit shorter, daily posts. So check back everyday from today through Saturday for a new post. I may even have something for Sunday if I wake up early enough to get it done before my wife wakes up. We'll call that one extra credit.
|
Doing Battle With Nargles
Detecting a Bad SPN
Okay, we all know the error that this causes. Even saying SSPI context is bad juju, and I feel dirty when I talk about it, but it has to be said:
Cannot generate SSPI Context
This is a generic error that can be caused by 1000 different things, but experience has shown me that 98% of the time, it is caused by 1 of 2 problems. Either the server is not able to connect to a domain controller to establish the SSPI context or there is an invlaid SPN for the SQL Server service in AD. To verify if the domain controller conenction may be a problem, I check teh System event log. There may be errors stating that SQL Server was not able to connect to a domain controller or kErberos errors. The error logged may have ocurred several days before you see the SSPI context errors. 99% of the time, this error is resolved by rebooting the server.
I want to talk about the second cause. An invlaid SPN will cause this error. No SPN at all will NOT cause this error. That bears repeating: No SPN at all will NOT cause this error. Adding an SPN to AD will NOT fix this error. Setting up kerberos or constrained delegation when tehre is an invalid SPN will NOT fix this error. Starting to see a pattern?
To summarize the above, there area lot of things that people will recommend trying that do not fix this error. If anything, they complicate trying to fix it. So what does fix this error? Well, there is only one way to fix an SSPI context error caused by an invalid SPN. Delete the invalid SPN.
It is actually quite simple to detect that your SPN is invalid as long as you are using a domain account for your SQL Server service (which you should be). Simply check for SPN's that are registered for your service account and for the server. If you do not find matching SQL Server SPN's for both, then there is an invlaid SPN. If the SPN's match, but the server returns extra SPN's that are not in the service account list, there is an invlaid SPN. And lastly, if everything matches, checking the spelling carefully as I have been bitten by misspelled manually created SPN's several times.
The tool I use for checking for SPN's is setspn from a command window. Let me demonstrate:
Example 1:
Let's say that our service account is MyDomain\SQLSrvrSvc and the name of my SQL Server is ProdSQL02. My check of the SPN's might look like this:
C:\users\sqlsoldier>setspn -l MyDomain\SQLSrvrSvc
Registered ServicePrincipalNames for CN=SQL Server Service Acct,OU=UserAccounts,DC=MyDomain,DC=com:
C:\users\sqlsoldier>setspn -l ProdSQL02
Registered ServicePrincipalNames for CN=ProdSQL02,OU=Workstations,OU=Machines,DC=MyDomain,DC=com:
TERMSRV/ProdSQL02
TERMSRV/ProdSQL02.MyDomain.com
HOST/ProdSQL02
HOST/ProdSQL02.MyDomain.com
MSSQLSvc/ProdSQL02:1433
MSSQLSvc/ProdSQL02.MyDomain.com:1433
Do you see the problem? No? When I checked for SPN's for the server, 2 SPN's for SQL Server show up. When I checked for SPN's for the service account, no SPN's for that server show up. This means that the SPN's configured for SQL Server are not owned by that service account. Most likely, the service account was changed recently, and these SPN's were left over from teh previous service account. This happens frequently when changing the service account from Local Service to a domain account (more on this later). This will generate the SSPI context error, and connections will fail. Delete the SPN's.
Example 2:
Let's say that our service account is MyDomain\SQLSrvrSvc and the name of my SQL Server is ProdSQL02. My check of the SPN's might look like this:
C:\users\sqlsoldier>setspn -l MyDomain\SQLSrvrSvc
Registered ServicePrincipalNames for CN=SQL Server Service Acct,OU=UserAccounts,DC=MyDomain,DC=com:
MSSQLSvc/ProdSQL02:1433
C:\users\sqlsoldier>setspn -l ProdSQL02
Registered ServicePrincipalNames for CN=ProdSQL02,OU=Workstations,OU=Machines,DC=MyDomain,DC=com:
TERMSRV/ProdSQL02
TERMSRV/ProdSQL02.MyDomain.com
HOST/ProdSQL02
HOST/ProdSQL02.MyDomain.com
MSSQLSvc/ProdSQL02:1433
MSSQLSvc/ProdSQL02.MyDomain.com:1433
Do you see it this time? There are matching SPN's for the account without the fully qualified domain name (FQDN), but the SPN for the non-FQDN does not have a match. In this scenario, some users will connect fine, and some may not. Specifically, anyone using the FQDN will fail with the SSPI Context error, but users not using the FQDN will connect successfully. Delete the SPN that does not match or just delete all of them if you are not using them for establishing Kerberos conenctions.
Example 3:
Let's say that our service account is MyDomain\SQLSrvrSvc and the name of my SQL Server is ProdSQL02. My check of the SPN's might look like this:
C:\users\sqlsoldier>setspn -l MyDomain\SQLSrvrSvc
Registered ServicePrincipalNames for CN=SQL Server Service Acct,OU=UserAccounts,DC=MyDomain,DC=com:
MSSQLSvc/ProdSQL02:1433
MSSQLSvc/ProdSQL02.MyDoman.com:1433
C:\users\sqlsoldier>setspn -l ProdSQL02
Registered ServicePrincipalNames for CN=ProdSQL02,OU=Workstations,OU=Machines,DC=MyDomain,DC=com:
TERMSRV/ProdSQL02
TERMSRV/ProdSQL02.MyDomain.com
HOST/ProdSQL02
HOST/ProdSQL02.MyDomain.com
MSSQLSvc/ProdSQL02:1433
MSSQLSvc/ProdSQL02.MyDoman.com:1433
Do you see it this time? This one is a little trickier. I already mentioned it, so it's not that tricky. There are matching SPN's for both the FQDN and the non-FQDN, but there is still an invalid SPN. I'll give you a hint, users using the non-FQDN can connect successfully, but users using the FQDN get the SSPI Context error. Ah, now you go it!! Yes, the domain name is misspelled and therefore the SPN is invalid. User's are connecting via ProdSQL02.MyDomain.com.
What Causes a Bad SPN
Nargles of course, I already covered that. Other than Nargles, most invalid SPN's are caused by one of 3 things. Either the service account was changed from Local Service to a domain account incorrectly; the SPN was created manually, and the SPN was not deleted and regenerated; or it was manually created incorrectly (human error). I want to take a closer look at the first option because this seems to be the most common cause I deal with.
There is what I consider to be a bug in SQL Server Configuration Manager. If you change the service account of the SQL Server service and then restart the service so that the change takes affect, the existing SPN is not deleted. Local Service is a privileged account, and is allowed to create and delete its own SPN's in AD. As a result, if the service sars as Local Sysm, and SPN will be successfully created. If you then stop the service, the SPN will be deleted. If you change the service account from Local System without stopping the service first, the SPN may not be deleted. Now this is not a 100% scenario. Sometimes it will be deleted and sometimes not. If not, then the SPN will be invalid when the service restarts with a different account and users will get the SSPI Context error.
To fix this problem, you need someone who has sufficient permissions in AD to delete the account. In most production domains, this is not going to be you or anyone on your team. This means contacting whoever manages your domain or its domain entities. This of course means delays while you wait for them to make the change for you. Luckily, we know someone that has privileges to delete this SPN. He's right in front of us.
As I said above, Local System has permissions to create and delete its own SPN's. This SPN is owned by Local System. Here is a fix that is generally much faster than waiting for someone else to fix it for you. And if nobody can connect to the server anyway, restarting the service is not an issue as you are effetively down anyway. In one of the cases above where some users can connect and some can't, then you may be better of waiting for the deletion.
- To delete the SPN manually:
- Stop the SQL Server service
- Change the SQL Server service account back to the old account (Local System)
- Start the SQL Server service
- Stop the SQL Server service
- Change the service accounts back to the domain account
- Start the SQL Server service
- Start the SQL Server Agent service if it was originally running
That's it. The SPN should be deleted and connections should work now. To ward of Nargles (and prevent this error from reocurring), always follow these steps when changing the service account of the SQL Server service.
Looking forward to Lesson 2
Tomorrow, i want to talk about those terrible water demons known as Grindylows. You know, the ones that force SQL Server to perform a table/clustered index scan even though there are indexes that it could use. We'll dig into the scenarios where SQL Server may choose to perform a full scan instead of an index seek/scan and how to improve queries suffering from a grindylow attack.