Introduction
In an earlier post I walked through the diagnosis of a problem with TokenAndPermUserStore bloat in SQL 2005.In this post I will go through what I did to work through the issues and the results of those efforts.
Triage
Once I understood that the issue was TokenAndPermUserStore bloat, the first step was to stop the bleeding. The quick fix I chose was to create a job that ran every 30 minutes to issue a DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). Once the job was in and running the random server slowdowns immediately stopped. At this point I knew I was on the right track even if all I had was a band-aid.
Research
There is not a lot of good documentation out there on TokenAndPermUserStore bloat. I am guessing not too many people are running into it. This post is the best resource I have found. I had already installed all of the suggested service packs and cumulative updates so that left me with trace flags and fixing the underlying code.
While fixing the code seemed like the most obvious fix, some of the code that is causing the problem is optimal for the purpose. Of all the alternatives that have been tried, the current code is the best way to get the job done. Going back and asking for resources to change proven code to deal with what I see as a SQL Server problem just did not seem right.
Ruling out a code fix left me with two options, let the cache clearing job keep running every 30 minutes and hope nobody ever turns it off or add a trace flag to control the behavior of the server. I ruled out leaving the job running and set out to learn everything I could about –T4618. What I quickly learned is that there really is not that much information out there on the subject. What I was able to find through various sources is that –T4618 limits the number of entries in the TokenAndPermUserStore to 1024. I know from watching the counts in the cache that every time I free the cache it quickly shoots up to about 5000 entries then grows slowly from there so –T4618 would cause a huge amount of churn and CPU use. I then found that to get around the 1024 entries limitation I could also apply –T4610, limiting the cache to 8124 entries. 8124 sounded pretty good given the behavior I was seeing.
I did also run across this article on using –T4621 to customize the size of the Security Cache, but decided it was not for me due to the complexities of picking a good number, converting it to binary and editing the registry on the cluster. In SQL 2008 the value can be changed with an sp_configure call (no trace flag), making it a much more viable option. For my SQL 2005 server I saw –T4621 as something to try if –T4618 and –T4610 did not work the way I expected.
Turning on the Trace Flags
I found a number of articles that said you can turn –T4618 and –T4610 on while the server is running. Not true. You have to add them as startup parameters using SQL Configuration Manager then restart your server, raising the stakes somewhat if I guessed wrong on the setting.
I decided I was comfortable enough with my estimates that I pushed all of the paperwork and during a slack time on the system, added –T4618;-T4610 to the list of parameters on both cluster nodes. I then took the cluster resource offline and brought it back online. It all happened so fast that it was almost hard to tell I did anything. The only real evidence besides the log rolling over at an odd time is an entry in the SQL log that says “–T4618”, another that says “–T4610” and finally one that says “TokenAndPermUserStore Memory Quota set by -T4618 as 8192.”
The Results
The primary application that uses this SQL Server is a web site. It does pretty decent volume so over the years the business has invested in some pretty cool monitoring tools. The tools were instrumental in helping identify and monitor the problem, telling me when I needed to spring into action to prevent lost sales. I have sort of a love / hate relationship with these tools because every time I said the problem was fixed they disagreed.
The relationship I have with those tools has now changed somewhat. I now love them. The monitors show that the web site is now running 20% faster than it was even when I thought it was performing well. To say I am happy with the results is an understatement.
I am blown away by how much fixing the TokenAndPermUserStore bloat helped overall application performance.