Slowdown After Moving to New Virtual Machine

  • Hi there! I'm having a bit of a conundrum with one of our servers currently. Last night, we moved our SQL box from its previous virtual machine to a new one, and provisioned some extra resources for it (more CPU, slightly faster disks).

    However, for some reason, everything's moving really sloooooowly on the server now. The SQL instance has grabbed up all the resources it usually does (28GB of RAM), and the CPU is spinning at about 25-40% usage, with occasional spikes up to 100% for a few seconds if things get heavy. All settings and so on are as they were previously, and these readings match what I usually see. However, one of our heavier procedures that usually takes about 5 minutes has now been running for around 40 minutes without finishing.

    Because of this, I'm not really sure what's slowing the server down; I'm not seeing any memory pressure from DMV queries, and I've checked to make sure the power-saving CPU limiter is off, so I don't have much of an idea as to what else I could tune. Is there something special that needs to be done when moving to a new virtual machine? I wasn't involved with the move, so I'm not entirely certain what transpired during it, but I don't think it should've shredded our performance quite this badly.

    Thank you for any assistance you can provide!

    - 😀

  • Personally I would start by looking at the waits on the system and trying to identify if there is any particular wait type that is causing the problem, this will usually give you an idication if you are waiting on a particular resource souch as the disks, CPU etc etc.

    I would also run a server side trace for a few miutes during the performance issues and start looking at the queries that are taking up the most resources \ running the longest.

    From there you can check the execution plans to see if there are any scans etc taking place.

    It may just be coincidence since moving servers that the problem has occured. Do you have any performance metric baseline figures you can compare to prior to the move?

    MCITP SQL 2005, MCSA SQL 2012

  • Did you reindex the db and update stats before going live?

  • RTaylor2208 (10/9/2013)


    Personally I would start by looking at the waits on the system and trying to identify if there is any particular wait type that is causing the problem, this will usually give you an idication if you are waiting on a particular resource souch as the disks, CPU etc etc.

    I would also run a server side trace for a few miutes during the performance issues and start looking at the queries that are taking up the most resources \ running the longest.

    From there you can check the execution plans to see if there are any scans etc taking place.

    It may just be coincidence since moving servers that the problem has occured. Do you have any performance metric baseline figures you can compare to prior to the move?

    Wait stats seem to be about the same as usual; CX_PACKET as our highest wait, with reasonable amounts of time spent waiting, and various lock and I/O waits scattered around it. Granted, this could very well be a coincidence, as you're suggesting; the team that uses the procedure that experienced a slowdown said that a larger-than-usual workload was being fed to it this time. Even then, the longest the routine's ever run was for 25 minutes on an unusual workload, so this time was quite odd (the procedure finished after two hours... Painful!). Other procedures running on the server are taking quite some time, too; I use sp_whoisactive to monitor things here and there, and on occasion, its execution takes a few seconds, as opposed to the usual sub-second executions, though I don't know if that's a reliable measure of server pain.

    Is a statistics update necessary after making a virtual machine move? I wasn't aware of that being necessary, but if it is, that would definitely explain a few things. I'll speak with the team when they get in today; they'll have to run the procedure again, and our nightly statistics update has gone through, so there should be a noticeable difference in how everything's running today if that was needed.

    Thanks for the suggestions! Back to hammering this out 🙂

    - 😀

  • In theory the statistics and indexes should be fine at a logical level as they have not changed when the VM was moved. The physical disk allocation may be different depending on what happened during the move.

    Out of data statistics would be pretty obvious in the execution plan. You mention that there is a change in the data. This could very well be out of date statistics as the distribution of your data may have changed.

    MCITP SQL 2005, MCSA SQL 2012

  • Huh... Seems like this might have expanded into a more general VM issue than a SQL Server issue. I hammered out the procedure that was running too long; it actually had pretty horrible RBAR in it (WHERE clause that concatenated two columns together to compare against another table that also concatenated two columns together for comparisons, whee!), and fixing that made everything run much better.

    However, the server's still running abysmally slowly on most queries, and other activities on the server itself are quite a bit slower, too. It seems to be a global issue for that particular virtual server; I'll have the systems team see what they can find out about it, but is there something that could be making an entire virtual machine with (supposedly) better hardware function much more slowly than it did on weaker hardware?

    - 😀

  • You brought up an interesting point. I have seen where badly written SQL has been masked by the never ending upgrading of hardware, faster processors, more memory & faster disks.

    I've seen this in a number of work locations where by hardware was always the first line of defense in resolving performance issues.

    Then introduce a hardware performance issue and the crap hits the fan. Typically my course of action when taking over new systems is to evaluate what is running poorly on the box. I look at Average Disk IOs, Average CPU as a start. I also look at execution time. These simple indicators will ferret out poorly designed code.

    Back to your present problem. Where else have you looked in your VM environment and your SQL installation?

    I've seen major issues when Net Apps were improperly configured and/or provisioned. I've seen improper allocation of drives to a VM machine sharing the same LUN. The list goes on. I've also seen network latency as a major issue on performance.

    Bottom line is there are so many moving parts to a VM environment to get it to work correctly. When it work well, all is great. However when issues arise it can take quite a bit of time drilling into the problem and resolving it.

    My view, from the DBA standpoint, is to make sure that SQL Server has been set up properly, and everything that could be done to maintain performance has been done (index & statistic management, memory allocation, configuration of data, logs & tempdb done to SOP, etc) . If there are still performance issues then I'd be the first one to pass the buck back to the Operations boys to have them dig into the VM environment to take it to the next step.

    Kurt

    Kurt W. Zimmerman
    SR DBA
    Lefrak Organization
    New York, NY

    http://www.linkedin.com/in/kurtwzimmerman

  • Admittedly, I haven't looked at the VM setup much at all; I don't have much sway in that area, so I'm more or less blocked from examining the VM itself for issues. I wasn't consulted when the move was done, so I pretty much walked in the next day to a host of problems without warning :-D.

    The monitoring I usually do for unusual wait amounts or queries being blocked isn't showing anything overly out of the ordinary, though I do note that IO_COMPLETION has become a top 10 wait, which is unusual. I may be able to have a more detailed meeting with the systems administrator tomorrow; if I can get clearance to look at the VM settings more closely then, I'll do some digging to make sure things are functioning properly. Thanks for the pointers!

    - 😀

  • Did you find a resolution? Curious, as we are facing a very similar issue.

  • So far, we haven't found a resolution. I've been able to isolate the problem to the CPU, but I'm not certain why the CPU would be more strained now than previously. I've reduced MAXDOP by a bit, and increased the cost threshold for parallelism, but that hasn't helped. Power-saving CPU output reduction is disabled, and our workload (aside from the original RBAR problem that was fixed) has remained as it usually is. No word from the sysadmins, either... Still hoping to speak with them about it eventually. In the meantime, it's kinda rough :hehe:

    - 😀

  • I'm with Kurt - with a VM, you are sharing resources. So someone needs to be able to look at the whole machine to understand if there is some issue there that might be causing your issue.

    They should be able to prove what resources are reserved (or not) for your SQL instance.

    And see if there is a higher level bottleneck you are unable to see.

  • Well, I was able to talk with one of the sysadmins this morning, and I got a look at the VM specs. The new physical box it's on has the same hardware as our old box, and it's hosting less resource-hungry applications than our old box (the old box held our mail server, intranet web server, and application frontend; the new box just has a low-usage domain controller on it).

    The CPU is still the sticking point in performance, but no cores are being reserved for anything on the box, and the domain controller is using less than 1% of the CPU. Performance is still rather horrid; sp_whoisactive is taking up to 30 seconds to produce a result set, but previously the longest it would take (during a heavy loading process) was maybe 3 seconds. Various reports along with the front-end app are seeing consistently longer execution times, by up to a factor of five times more than they were previously.

    Apparently, the move to this machine was done shortly after I started here, too; however, everyone saw the same performance issues that we're currently experiencing, so the server was moved back to the old box, where everything sped back up to a good pace again. I'm fairly baffled at this point; there doesn't seem to be a logical explanation for the terrible perfomance from what I can see :crazy:

    - 😀

  • You mention 28 GB of RAM for SQL. Out of how much total?

    I take it that they have checked and the OS can see all the RAM (std vs. ent for OS), and that memory is of the correct configuration. Some machines have very specific configurations for optimal performance (rank, bank, etc.).

    Also making sure that paging is not occuring, and that there is 'free' memory.

    Make sure you check the basics, it could be some simple setup was overlooked ) i.e. - memory not capped for example, or lock pages in memory is not set.

    28 GB might be a bit high if machine is setup with 32 GB.

  • Huh, interesting! Since you mentioned the server memory settings, I took a look again. For whatever reason, the max memory usage was dropped from 28GB to 24GB. I adjusted it up to 26GB; if it can hold stable at that, I'll try to increase it further, back to its original 28GB. I'm not quite sure why this change was made, but I'll speak with the head sysadmin about it.

    The server has 40GB of memory assigned to it, but the OS says it has 32GB available to it on the server instance. Originally, when we were on the old box, I was seeing RAM usage of around 30-31GB with the 28GB allocation for SQL Server. There didn't seem to be any problems with memory usage back then, so I'd say restoring it to that amount shouldn't be an issue, but I'll take the gradual change-and-observe approach.

    Thanks for the suggestions! I don't know if this will fix the problems, but it definitely gives me something to go on. And I'll politely ask that the sysadmin tells me when core SQL Server operating values are changed from now on, and why that's happening :w00t:

    - 😀

  • hisakimatama (10/17/2013)


    Huh, interesting! Since you mentioned the server memory settings, I took a look again. For whatever reason, the max memory usage was dropped from 28GB to 24GB. I adjusted it up to 26GB; if it can hold stable at that, I'll try to increase it further, back to its original 28GB. I'm not quite sure why this change was made, but I'll speak with the head sysadmin about it.

    The server has 40GB of memory assigned to it, but the OS says it has 32GB available to it on the server instance. Originally, when we were on the old box, I was seeing RAM usage of around 30-31GB with the 28GB allocation for SQL Server. There didn't seem to be any problems with memory usage back then, so I'd say restoring it to that amount shouldn't be an issue, but I'll take the gradual change-and-observe approach.

    Thanks for the suggestions! I don't know if this will fix the problems, but it definitely gives me something to go on. And I'll politely ask that the sysadmin tells me when core SQL Server operating values are changed from now on, and why that's happening :w00t:

    Enterprise OS vs. Std OS is likely the memory issue. Std can only 'see' 32 GB.

    Consider upgrade as over time, you will likely need it.

    Even if you can get along today.

    Being able to add memory quickly is a good option to have.

    SQL likes it, and it can buy you some critical time to solve other issues as they come up.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply