August 2, 2006 at 7:32 pm
We've had a 64bit SQL Server 2005 Standard running on Windows 2003 x64 with 8 G of RAM and two dual-core 2.8 Ghz procs for some time now (since January or so). Things have generally been going pretty well. We have a Veritas tape backup process which regularly retrieves .BAK files from this server during off hours. Occasionally, that backup process lapses into normal hours, and the fun starts.
That process seems to drain SQL Server of its memory (prior to yesterday, when because of this issue we finally set the MIN memory to something very high). When that happens, the performance of the database understandably slows to a crawl. This has happened once before and we didn't learn our lesson (cutting off the tape job if it runs too long; setting the MIN memory). However, when the backup was stopped, things gradually returned to normal over the course of an hour or so.
This time, however, things didn't go so smoothly. Although SQL eventually reclaimed its memory, it never really acted properly until a reboot (the next morning; it had issues even after re-indexing and stats were run as part of the nightly maintenance). Unfortunately, although the reboot made the issues go away for the majority of the day, they recurred in the early evening with seemingly no provocation.
The symptoms are as follows. The amount of blocking suddenly dramatically increases, from almost nothing to many processes. The blocking and blocked processes, strangely, often are pure SELECTS, some of which even have the WITH (NOLOCK) keyword. The CPU also usage goes through the roof (from an average of about 5% to an average of about 95%).
The fact that things which shouldn't acquire or respect many locks were causing problems led us to believe they weren't normal data locks but perhaps compile or schema locks of some kind. Our suspicions seemed to be accurate as the stored procedure cache was nearly empty and new procedures were not being cached. This was confirmed by a number of queries as detailed below. Clearing the data cache (CHECKPOINT; DBCC DROPCLEANBUFFERS) seemed to do some good, but sometimes not, and usually not for long. The problem also occasionally goes away by itself for a while.
Mind you, this is all on the same server which was running fine for several months straight. Also, although new development is implemented every so often and of course the data changes all the time, there has been nothing new or unusual in the past few days (except for that backup issue) which we can imagine could cause this. Also, it's no longer a pure memory issue as SQL Server now consistently keeps nearly the full 8 G of RAM.
Needless to say, this is quite frustrating and painful to our operations staff. We are interested in resolving it quickly in any way short of rebooting the server every so often.
We have reviewed the recompilations in profiler and found a few procs causing a large number of compilations. They all involve (low rowcount) temp tables. Many have been "fixed" by using temp table variables, but some are nested, referring to a table created in a parent proc. Those will require a redesign using some sort of permanent table, assuming that ends up being a superior performance situation to the mandatory recompiles which happen for all procs which refer to temp tables created outside of the proc itself.
While these frequent recompilations are troublesome and we desire to eventually eradicate them (the temp table data is always very similar in structure so there is no real world need for the recompilations), none of these frequently-run procedures have changed in weeks or perhaps months (except for today during the attempts to fix them). Therefore, even though the performance monitor often shows a number of recompilations per second which exceeds the number of batches per second (remember '05 does statement-level recompilation so that's possible), we believe this has been going on for some time and is not the issue.
What seems to be the issue now is that occasionally the stored procedure cache empties out and is resistant to being refilled. Does anyone have any ideas as to why this might be, aside from someone running DBCC FREEPROCCACHE every so often? I understand that SQL Server dynamically allocates cache memory between data and procs. Is it possible that some large amount of data is being retrieved which pushes the procs out of the cache? Is there any way to monitor that situation? We've tried DBCC MEMORYSTATUS but it produces output which we don't understand.
I hesitate to mention this next bit of information, as I can't imagine it's relevant, but eerily around the time of one of the server's descents into madness, we were pulling up a job history using the management studio. We have a lot of history data, and it occasionally times out, as it did this time. We refreshed it a few times and it eventually retrieved it. Is it possible that pulling all of this data back caused the server's data cache to fill up and push out the procedure cache? Mind you, not all of the performance issues correspond with this sort of management studio activity.
To monitor the procedure cache, we run DBCC PROCCACHE. When things are running well, its "num proc buffs" is over 100,000 and is steady with each run of the DBCC command. When these issues are happening, it's less than 5,000 and fluctuates rapidly with each run of the DBCC command. The "proc cache size" is also lower-- down from over 2500 to less than 10 or so-- which is confirmed by querying syscacheobjects.
To confirm the lack of procedure cache, running the undocumented DBCC CACHESTATS produces very low values for "SQL Plans" (< 10%), as opposed to anywhere between 40% and 80% normally.
Sorry for the long-winded post, but I wanted to be sure to impart as much relevant info as I could think of.
Does anyone have any idea as to what's flushing our procedure cache?
Thanks for any insight any of you can provide.
August 3, 2006 at 3:10 am
OK so it's long post and I'm not too sure what is actually causing the problem for you.
If you rebuild indexes and/or stats then you will get blocking on your database during this time - also this action will probably mark all your procs for recompilation.
You mention a veritas tape backup - I assume this is collecting .bak files created by native sql server backups - or I hope so. If your backup drive is shared with any of your data or log drives then expect performance issues whilst the .bak files are read - i/o will probably peak, this will slow down your server and increase blocking.
Some backup software lock drives and folders exclusively, I won't mention names but one big player just about destroys a sql server if it gets to run it's agent. ( god how I hate tape backup software!!! ) Without dount the best way to achive sql backups is to use native backup and use robocopy to move your backups to a repository and allow the tape backup to take the files from there - believe me this method is just so much more efficient. Of course all these agents need memory to run and they may spawn multithreaded, this will devour memory.
I suspect your problem isn't sql server but your backup process.
I'd really suggest you set your sql server memory to be fixed, I never use any form of dynamic memory above the std 2Gb .
To check your disks for issues perfmon disk i/o completion time and disk usage.
PS. If you re-read your post you answer yourself in the first para . " Occasionally, that backup process lapses into normal hours, and the fun starts. "
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
August 3, 2006 at 7:55 am
I guess I should have stated "and then the fun started". The backups no longer run during business hours yet the issues still happen. (They only ran during business hours twice. The first time it was problematic but then improved once it was stopped. This last time it was problematic but hasn't fully recovered.)
Yes, you are correct in that they're native BAK files being sucked off by the tape backup software, but no agent is involved.
We plan to move to a file-copy-then-backup to tape methodology, but we haven't yet.
I suppose we could do fixed memory, but the problem is no longer that the memory itself is removed from SQL Server-- just that SQL Server decides to use it for something other than procedure caching.
August 6, 2006 at 6:02 pm
This is still happening. Reboots and service restarts have not permanently solved the problem, though they generally help a lot for a while.
The backups no longer run during business hours.
We're at a loss as to what's going on. Every so often, the data cache shoves everything out the procedure cache and we get cascading compile locks.
Any other ideas?
August 6, 2006 at 11:15 pm
No sure whats going on with your server.
Few things which i would like to check
1. Do you have Lock Pages in Memory enabled for SQL Server?
2. When you are seeing heavy blocking, check head blocker and see its waittype/waittime.
3. I would check for the query which is fetching large number of data (check in execution plan) which is flushing out PROCCAHE from the memory.
4. I am not sure if backup process was just a co-incidence of "the fun"
If possible, open a case with Microsoft SQL Server Support as they have better tools to narrow down the problem.
HTH
-B
August 7, 2006 at 3:37 am
The symptons you describe don't really match the conclusions. How do you establish that your procedure cache is being cleared ? Are you using a maint plan for your backups which includes index rebuilds and stats updates as either of these can cause objects to be marked for recompilation .. note that auto stats can also cause objects to be recompiled. I'm not too sure that this event would cause blocking, it's only the first call of the proc which will cause the recompile, thereafter the plans will be resident. It sounds more likely your data cache is being flushed, this is very easy to do with a poor ( high i/o ) query. Just to illustrate this point , which many dba's appear to fail to grasp - 750k i/o in a query, and don't laugh I've seen quite simple looking reports hit the 2 or 3 million i/o, will flush nearly 5.75 GB of data from the cache, flushing your data cache will increase physical i/o and as the disks are the slowest part of your system - this might be your problem. I suggest you profile for high i/o queries on your system.
You might like to try/buy sql diagnostic manager ( http://www.idera.com ) this will collect details of such queries in the background for you, easier than setting up and maintaining profiler traces.
As before I'd recommend locking down the server by setting the min and max memory for sql server to be the same - you don't need dynamic memory on a dedicated box. Remove/disable any unwanted/unneeded services and such which are running on your box. Review your maint plans and procedures to make sure out of date stats aren't causing scans - you can use perfmon to monitor full scans to see if this counter rises with your problems times.
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
August 7, 2006 at 8:52 am
We establish the procedure cache is being cleared by querying sys.dm_exec_cached_plans, which when running well has a large number of records but when running poorly has nearly no records (we're talking several thousand versus under 5). As far as we can tell, that shows what's in the procedure cache, and it's effectively being "cleared" insofar as it goes from full to empty.
Yes, we do nightly maintenance which includes indexes and stats, which will cause the cache to be effectively cleared, as will restarting the server. However, in both of those cases, performance does not suffer as plans stay in memory after the event. It's only at various times during the day (when no stats/index job is running, though autostats might be) when things get crazy. So rebooting doesn't slow things down due to a need for recompilation; rather, it forces everything to recompile once and then it holds onto the plans, which causes a net increase in performance. We can make our problems go away by restarting the SQL Service on a regular basis, but that's obviously ridiculous.
SQL Server's memory is locked at 7936 MB (insofar as that's the min & max). We've seen no issues (after the first one almost a week ago) with respect to the SQL server's overall memory use changing. We've seen it stick at that 8 gig consistently throughout the day. It's just the cache internals which seem to be at issue.
Basically nothing else is running on this box. It's likely not a box performance issue. Nothing is occupying any CPU time except for SQL Server itself (since last reboot, CPU time idle = 858:22:20; sqlserver.exe 24:00:00; MsDtsSrvr 2:12:44; System 1:41:30; services.exe 1:24:32; lsass.exe 1:23:10; everything else below 31:00).
We don't have any specialized performance analyzing tools. We'll probably end up paying for a consultant to help us out and learning from that process.
August 7, 2006 at 9:26 am
I'll do my quick plug for sql diagnostic manager, go to http://www.idera.com and get a demo/buy a copy and attach to your server. This will trend your box over the day(s) and you can examine such things as cache memory and so on. Might be cheaper than a consultant.
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
August 7, 2006 at 9:42 am
1. Since we originally implemented AWE (but no longer rely on that as far as a SQL Server setting is concerned) we did enable the SQL Server service to Lock Pages in Memory. We left that setting even after disabling AWE in general, which seems consistent with the reasoning in this Microsoft blog entry:
http://blogs.msdn.com/slavao/archive/2005/08/31/458545.aspx
2. Will do, but I think I checked before generally and they were all compile locks. That's why things with "WITH (NOLOCK)" were still blocking.
3. Unfortunately on this system there are several queries which could be returning a large amount of data. In the past, it hasn't been an issue. Something has changed, but it's not a setting and I don't think it's any database code.
4. It seemed to set off this rash of issues, but I don't think it's still the root cause.
We might have to try Microsoft support, but that's such a lenthy, laborious process.
August 7, 2006 at 10:24 am
2. The lead blocker typically has no wait time. These are "rolling blocks"-- that is, they aren't blocked for very long. Compiling is a serialized deal, I think, and since it isn't saving the plans, it will just keep putting everyone in a long line.
Here's something extremely frustrating:
We're trying to get an idea of WHY things are being removed from the cache, so we're tracing SP:CacheInsert and SP:CacheRemove. Well, despite the cache clearing out, there are NO CacheRemove events! We know the trace is configured correctly, because right after the nightly maintenance, it logged CacheRemoves for the cache flush that it does (automatically) after the indexes & stats are re-run. We also saw a smattering of removals later in the early morning. But when the crap hits the fan, there are NO removals, despite the cache nearly clearing out. Very, very frustrating.
August 7, 2006 at 11:14 am
Here's an example of this happening. Here are the results from DBCC CACHESTATS and DBCC PROCCACHE, about 3 minutes apart. There was no blocking and low CPU usage before, and blocking and high CPU usage after. And not a single CacheRemove event in the profiler! What the heck is going on here?
Before:
Object Type Hit Ratio Use Count Pages Used
---------------------------------- ----------------------------------------------------- ----------- -----------
Object Plans 1.0 8 67134
SQL Plans 0.8359375 9 13695
Bound Trees 0.67153284671532842 0 834
Extended Stored Procedures 0.86940298507462688 0 8
Temporary Tables & Table Variables 0.89073634204275531 5 8
Summary 0.86508633304067895 22 81679
num proc buffs num proc buffs used num proc buffs active proc cache size proc cache used proc cache active
-------------------- -------------------- --------------------- -------------------- -------------------- --------------------
83935 950 950 2567 16 16
After:
Object Type Hit Ratio Use Count Pages Used
---------------------------------- ----------------------------------------------------- ----------- -----------
Object Plans 0.49100257069408743 14 1723
SQL Plans 0.17615658362989323 1 6
Bound Trees 0.59138187221396732 0 1354
Extended Stored Procedures 0.84988452655889146 0 2
Temporary Tables & Table Variables 0.89807162534435259 9 3
Summary 0.62884882108183082 24 3088
(6 row(s) affected)
num proc buffs num proc buffs used num proc buffs active proc cache size proc cache used proc cache active
-------------------- -------------------- --------------------- -------------------- -------------------- --------------------
3618 1204 1204 15 15 15
August 7, 2006 at 12:22 pm
In case this helps, this is what appear to be some relevant (i.e. different) results from DBCC MEMORYSTATUS from yet another one of these incidents which lead to us restarting the service.
Before:
Buffer Counts Buffers
------------------------------ --------------------
Committed 1015808
Target 1015808
Hashed 783828
Stolen Potential 933476
External Reservation 2344
Min Free 256
Visible 1015808
Available Paging File 350727
(8 row(s) affected)
Procedure Cache Value
------------------------------ -----------
TotalProcs 306
TotalPages 22644
InUsePages 1514
MEMORYBROKER_FOR_CACHE Value
-------------------------------- --------------------
Allocations 25792
Rate 1121
Target Allocations 774577
Future Allocations 0
Last Notification 1
(5 row(s) affected)
MEMORYBROKER_FOR_STEAL Value
-------------------------------- --------------------
Allocations 3269
Rate -459
Target Allocations 750474
Future Allocations 0
Last Notification 1
After:
Buffer Counts Buffers
------------------------------ --------------------
Committed 1015808
Target 1015808
Hashed 786686
Stolen Potential 955724
External Reservation 0
Min Free 256
Visible 1015808
Available Paging File 357014
(8 row(s) affected)
Procedure Cache Value
------------------------------ -----------
TotalProcs 18
TotalPages 3223
InUsePages 1730
MEMORYBROKER_FOR_CACHE Value
-------------------------------- --------------------
Allocations 4764
Rate -8858
Target Allocations 776878
Future Allocations 0
Last Notification 1
(5 row(s) affected)
MEMORYBROKER_FOR_STEAL Value
-------------------------------- --------------------
Allocations 4554
Rate -4045
Target Allocations 777387
Future Allocations 0
Last Notification 1
(5 row(s) affected)
MEMORYBROKER_FOR_RESERVE Value
-------------------------------- --------------------
Allocations 0
Rate -1952
Target Allocations 814212
Future Allocations 187630
Last Notification 1
August 11, 2006 at 10:54 am
We ended up contacting Microsoft (phone) support about this issue.
We chatted with the technical router who took some rudimentary notes about our issue and determined that it was of a high priority-- such that Microsoft would work "24 hours per day until it's resolved", and they expected the same from us. She then checked the queue and noticed that it wasn't very active and thought we would get right through to someone. After arranging for our $250 credit card payment, she put us in touch with a tech.
The tech ("SQL Server Support Engineer") read through the notes and then verified and established contact information by emailing us his info. He then asked for more detail about the issue, including exactly which Server & OS version we were using. We told him how things had been working fine since January on this platform and we didn't think there were any configuration changes which would have caused this. We have him our hunch about the procedure cache being emptied out due to some sort of memory pressure, but noted that we had set the minimum memory for the Server so it should have plenty of breathing room.
He seemed to think setting that high of a minimum wasn't a good idea and shouldn't have helped. He seemed most intrigued by the CPU usage greatly increasing during our issues. He asked us if we had tired the DBCC FREEPROCCACHE command, and we said we had, but that didn't do anything (since the effect of that-- clearing the proc cache-- was what we were witnessing which itself seemed to be an artifact of the problem). We told him that the DROPCLEANBUFFERS seemed to help occasionally, but not reliably.
He asked a lot of questions about what applications were accessing the Server and whether those applications were throwing any errors. Because we write the applications, they'll throw whatever errors we tell them to, so I didn't see how this was very relevant. I asked him whether he meant to ask if SQL Server was throwing any errors, and told him that it was not. I told him our application was only throwing occasional access or timeout errors depending on whether the issue was happening or we were restarting the services. He also, of course, asked about what changes would have been made recently. I told him all of the significant changes (such as changing temp tables to table variables) were made since the issues started (in attempts to fix the issues). He also had me show him the applications we were using to access the server, and had me perform some operations in them to get a feel for the speed.
He also asked-- several times, seeming not to remember between each time we asked-- how many CPUs we had and of what kinds. I explained that we had two physical processors, each with two cores, each of which had HyperThreading enabled. This resulted in 8 virtual processors. I tried to explain that a few times. (Interestingly, SQL Server reported our processors as AMD, but I'm fairly certain they're Intels, what with the option of HyperThreading and all.)
On his request, we initiated a LiveMeeting session with my desktop. (I have 3 monitors so I was surprised that it even worked, but it seemed to work well.) I of course had Management Studio and various performance monitors running on my machine. I showed him the procedure cache and other numbers in the performance monitor we had running for SQL Server. Luckily (?), the issue occurred while he was on the phone, and he was able to see the high CPU and low procedure cache numbers. (Our SQL Plan cache hit ratio seems to always be low-- under 10, which we know to be bad and want to investigate, but the cache size when there were no issues was high, and the ratio was still above 5 or so. During the issues, all of the numbers were pretty much 0.) I also demonstrated how the applications become basically unresponsive during these times.
While this was happening (and I was getting nervous, what with our applications being basically unresponsive), he had us methodically walk through all of our Server settings (properties in the Management Studio). He made the following changes:
Server Memory Min From 7936 to 5120
Server Memory Max From 7936 to 6144
Remove Checkbox on Boost SQL Server Priority
Max Degree of Parallelism from 0 to 4
After we applied our changes, the issue seemed to go away. Only rarely has the issue gone away by itself (without us restarting the services), and only after a while. It had been several minutes since the issue had started, but not long enough that a self-stoppage would have been consistent with past behavior. This led us to believe that our changes had a positive effect on the issue.
He seemed to think that the parallelism issue change was likely the key. (A setting of "0" means use all available processors.) He said that it's recommended that the max be set to the maximum number of physical processors. He was somewhat muddy, especially with the AMD processors which he though we had, as to whether it was cores or slots/sockets which mattered, but it seems that cores is sufficient.
I tried to get an answer from him as to how this setting would result in the behaviors we witnessed. He tried to explain it, something about thread overhead and such, but we really didn't understand what he was saying. I did ask if we should disable HyperThreading, as I've read here and there might be a good idea, and he said it should not be necessary.
I'm still a little confused, but I can see how there could be overhead to spreading out processes across multiple processors, and that maybe with "virtual" processors, it's not worth the overhead. I just don't see how that slightly sub-optimal work distribution could cause such a drastic utilization spike, and what it has to do with the procedure cache. I can see how this can just crop up, when it had been working fine with the original setting for so long, since changes in data table size (which happens, of course, over time) can cause different plans to be created, some of which might take more advantage of parallelism, especially on a large query which might be the acute trigger of these episodes. I still find it strange, though, that an arbitrary number like the number of physical processors fixes the load distribution-- after all, it doesn't know which are real and which are virtual, so it might still pick 4 virtual processors which are all on the same physical CPU.
I guess, though, we shouldn't look a gift horse in the mouth. For now, the issue appears to be solved, and it only cost us $250. I just wish we had called sooner. I'm somewhat surprised that it ended up being such a quick investigation.
Thanks for those who provided insight for us on this forum, and hopefully the information in this thread will be helpful for others.
August 11, 2006 at 4:38 pm
I was going to post that you were not allowing any memory for the o/s - then I read your post from microsoft.
I'll be blunt, you'd not configured memory correctly and you had checked the boost priority ( which you hadn't mentioned ) this is known to cause problems, learning how to configure a server is important and you have to learn I agree but you've caused your own problems by "playing" with the config.
I struggle to understand that you don't know the config of your server, do you have intel or amd procs ? and exactly how many?
just out of interest compile locks that you mention are oft caused by non qualification of objects and usually by naming user stored procs sp_
Check out kb, this practice causes serialisation of procs - I've fought to right this previously on client sites.
performance tuning is an art I guess and I don't really know where to point you to assist in your understanding, I would say that you need to understand your server and how to configure it first, not allowing the o/s any memory would have screwed your box big time as you found out to your cost. sorry for the rant but this would never happen with Oracle or DB2.
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
August 14, 2006 at 9:51 am
We were allowing some physical memory for the OS, just not much. And the OS can always use the spare virtual memory-- we have something like a 4 gig swap file. OS memory pressure or other OS tomfoolery would explain general performance issues, but this was a spike in the CPU usage of the SQL server process itself. Wouldn't we see those sorts of issues with kernel processes, not application processes?
As far as the settings, if you read the post again, you'll note that we did not change any settings UNTIL this started happening. We'd been running for 6 months with "boost priority" with no issues. It's not like we were fiddling around one day and then scratching our heads when things stopped working.
When we spec'd out the server, we knew what kind of procs we were getting. That was 6 months ago. I remember them being 2.6 or 2.8 Ghz Intel Xeon Dual Cores with HyperThreading. I didn't remember exactly, nor for sure, and since I didn't have the spec's in front of me when talking with Microsoft nor writing this post, I didn't want to insist we had something that we didn't. I'd never noticed that SQL Server thought them to be AMDs, so that shook my confidence in my memory.
But as far as how many procs, that I know, and that I conveyed several times to the Microsoft support engineer and in this thread. We have two physical slots occupied each by a dual-core processor which has HyperThreading enabled, for an effective 8 processors. As to exactly how many "processors" we have, that's somewhat a matter of interpretation/definition.
Yes, we've read the kb articles about "sp_" prefixing and object qualification. We do not preface any procs with "sp_". However, we also do not specify the database owner ("dbo" for most of them) in all cases, which we understand will lead to potentially decreased cache usage and/or unnecessary compile locking. However, again, we've been running this way for 6 months-- and 2 years before that on SQL 7-- with no issues of this sort. There had to be something else going on besides the style of calling our procedures.
We didn't ask you to point us anywhere to assist our understanding of performance tuning. Performance tuning is about incremental (though perhaps significant) improvements in resource utilization or runtime. This was firefighting, not performance tuning.
You say that we "found out to [our] cost" that misconfiguration can cause problems. First, recall that this configuration change was made in response to what we perceived as a memory starvation issue. SQL Server at the time it started to misbehave was using only 2 gig of memory-- its minimum. This corresponded to significantly decreased performance. It seemed a logical-- if cavalier-- step to force feed SQL Server more memory. We didn't make the problem any worse, so there wasn't any "cost" to that. Furthermore, the Microsoft tech didn't say that the memory settings were the cause of our issue, so we didn't "[find] out" anything either. Second, if you thought these configuration settings were so important and relevant to our issue, why didn't you ask to hear more about them before making guesses as to what's going on? And when you recommended changing the memory settings in the way that we had (setting min=max), why didn't you mention approximately how much should be left over for the OS?
Your comment about Oracle/DB2 seems like a troll. If not, it's an unsubstantiated boast. Mixed with the condescension in the rest of your post, it almost seems as if you wish to imply that DB2 and Oracle, being more sophisticated platforms, require commensurately more sophisticated DBAs, which of course never have any misconfigurations nor have any need for community nor vendor assistance. Maybe, then, you're in the wrong forum.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply