Sometimes being a DBA really stinks. You try to follow best practices in everything you do. You’re not shrinking databases on a daily basis. You’re not using the auto-close option on your databases. You know you’re not supposed to flush the procedure cache on a live server. Sometimes though, in spite of your best efforts, you find yourself elbow-deep someplace you don’t want to be. Grab a shovel, and I’ll tell you about a couple of those times in my recent past.
The topic for this month’s T-SQL Tuesday is automation. The assignment is to write about some aspect of automation. What options are there for automating tasks, when should you consider automation, share an example of something you’ve automated, etc… Just write something about automation and SQL Server. I’ve chosen to share a couple of examples of where I’ve used automation to break one of the sacred DBA commandments. And I’m OK with it.
As I’ve mentioned in a few earlier posts, the developers in my company have adopted nHibernate. Our applications have always leaned heavy towards an ad hoc workload, and with the spread of nHibernate, that lean is becoming more pronounced. In fact, we’re leaning so far that we tipped over SQL Server 2012, forcing us to resort to using forced parameterization. All these years of working with SQL Server and I’ve never had to use this option until now. Three cheers for nHibernate!
Due to the use of forced parameterization, and the volume of ad hoc queries that hit our servers, we’ve seen an increase in bad query plans resulting from bad parameter sniffing. There are several techniques available for combating bad parameter sniffing, but when you can’t change the queries that are being submitted, your options are limited as a DBA. Bottom line is this: when your server is on its knees pleading for mercy because of a bad query plan, you need to eliminate that plan. Assuming your statistics are in good shape and the bad plan is truly caused by bad parameter sniffing, you need to get that plan out of the cache in order to make the server happy again.
That’s exactly what I and my fellow DBAs found ourselves doing frequently, sometimes in the middle of the night or on the weekend. This was fun for about a week, but then it got old, so I decided to automate it. Making things even more interesting is the fact that we’re running a mix of SQL versions, ranging from 2005 through 2012.
Part Artform, Part Brain Surgery, Part Brute Force
So how does one go about getting a bad query plan out of the plan cache? Depending on the version of SQL you’re running, you have some options. All of them involve the use of the DBCC FREEPROCCACHE command, with some subtle differences. On a modern version of SQL, 2008 or later, you can pass a specific plan handle to DBCC FREEPROCCACHE and remove just that single plan. Anything older than SQL 2008, it gets a little muddy. Your only option for removing a single query plan from the cache is to dump the ENTIRE cache. We’re told time and time again to never do this on a live production server because of the immediate performance hit that you’re going to suffer. It’s good advice, but if you weigh that performance hit against the hit caused by a really bad plan, sometimes it’s the lesser of two evils.
Enough talk, let’s get down to details. I’ll show you the less ugly of the two examples first, focusing on one specific query that would frequently get a bad plan that would utterly bring the server to a halt. There were other queries as well, but this was the most visible. Literally visible, in that it shows up nice on the Ignite graphs, making it easy to prove to my boss how awesome my solution was. Here’s the Ignite report – those big red bars are bad, that’s our problem query racking up wait time.
As you can see, every few days this query would generate a ton of wait time, all because of bad parameter sniffing. It got really annoying there in mid-November, until I started running the follow as a SQL Agent job every five minutes:
DECLARE @PlanHandle VARBINARY(255); DECLARE @TopPlans TABLE (plan_handle VARBINARY(255)); INSERT INTO @TopQueries SELECT TOP 20 plan_handle FROM sys.dm_exec_query_stats ORDER BY total_worker_time DESC; DECLARE TopQueryCursor CURSOR FOR SELECT plan_handle FROM @TopQueries; OPEN TopQueryCursor; FETCH NEXT FROM TopQueryCursor INTO @PlanHandle; WHILE @@FETCH_STATUS = 0 BEGIN DBCC FREEPROCCACHE(@PlanHandle); FETCH NEXT FROM TopQueryCursor INTO @PlanHandle; END CLOSE TopQueryCursor; DEALLOCATE TopQueryCursor;
Worked like magic. This particular query no longer causes us performance problems, and system performance is much more stable.
Now For The Ugly
That solution worked pretty well for us on SQL 2012. Unfortunately, like every family, we have a black sheep. That would be SQL 2005. Not only does it smell bad and look funny, but it also won’t allow us to flush a single query plan. Fortunately, our parameter sniffing problem isn’t as severe on that platform. There’s really only one specific query that we’ve had trouble with. Like the previous example, it’s a framework-generated query, and like the previous example, when it goes bad, it goes really, really bad.
Turning again to Ignite (have I mentioned, I LOVE Ignite, awesome product), we can see the recent history of this problem query. As before, the red bars are bad. Also as before, the query in question was normally well-behaved, but would occasionally throw a bad plan and bring the server to a crawl.
This latest occurrence happened on a weekend, and nobody was paying attention. The server cried for mercy for almost three days before somebody noticed. At that time, I decided to implement a similar auto-flush mechanism (like a plumber, with less butt cleavage). This one’s a little different though. I can’t flush individual plans, so I don’t care about plan handles. I’m also really only looking for one bad query. The following chunk of code, run every 5 minutes, has brought this sometimes smelly query under control.
DECLARE @Command VARCHAR(MAX); SELECT @Command = STUFF(( SELECT ';KILL ' + CAST(dm_exec_sessions.session_id AS VARCHAR(25)) FROM sys.dm_exec_sessions INNER JOIN sys.dm_exec_connections ON dm_exec_connections.session_id = dm_exec_sessions.session_id INNER JOIN sys.dm_exec_requests ON dm_exec_sessions.session_id = dm_exec_requests.session_id CROSS APPLY sys.dm_exec_sql_text(dm_exec_requests.sql_handle) AS query_text WHERE dm_exec_sessions.session_id>50 AND query_text.text LIKE '%SELECT DISTINCT TOP 201%' AND dm_exec_sessions.last_request_start_time < DATEADD(MINUTE, -5, GETDATE()) FOR XML PATH('')), 1, 1, ''); -- If bad queries detected, flush the cache before killing the processes IF @Command IS NOT NULL SET @Command = 'DBCC FREEPROCCACHE;' + @Command; -- Execute dynamic SQL statement EXECUTE(@Command);
Better Than Aspirin
There you have it. The assignment was to write about automation, so I wrote about automation. It may not be pretty, it may not be a best practice, but it’s automation. Should you do this in your environment? That’s your call, not mine. You need to understand the impact of flushing the procedure cache (hint, it involves compilations). You need to pursue the correct ways of solving the problem first (fixing the code, parameterizing, tuning, etc), but if you’re in an environment where you don’t control the code, and development is slow to respond, this might be an answer for you. It’s a bit cowboy, but it’s worked for me and has eliminated some headaches. Want to eliminate your own headaches? Consider these final words of advice: