I/O requests taking more than 15 seconds

  • Hi Sandra

    What disk speeds are on ur SAN, is its 7.5 or 15 K disks??, how many Virtual machines are you running?? are you using VMWare or Windows for Virtual environment.

  • Are you seeing any errors in the Application or system event logs at the same time as the errors in sql server log? I am seeing the same errors in my SQL server log and at the same times I sometimes see a error from vmscsi driver or com+ error from one of the apps. These errors all occur only during high I/O operations like checkdb or reindexing.

    The issue may be with virtual hardware/drivers and not physical hardware or configuration.

    quick questions:

    do you have the vmtools installed on the server?

    which VMware software are you using? ESX,ESXi, or server?

    Bob
    -----------------------------------------------------------------------------
    How to post to get the best help[/url]

  • Thanks all for reviewing my perfmon numbers.

    I talked with the SAN team some more today. There are no errors to correlate. The network backups run from 6PM to 4AM, which is when many of the I/O errors occur (the spikes). However, the virtual server in question is not being backed up by this network job.

    Here is more info on the SAN:

    Disk speed (most likely 15K)

    Drive type: SASS

    20 drives with dual parity, so 18

    Separate Volume for this server, 500GB

    We are using VMWare, and yes, the vmwaretools are installed on the server.

    The vm software we're using is ESC version 3.5 update 3

    The underlying configuration is a RAID 6 for this volume

    There are no other virtual machines sharing this volume

    Unfortunately there are no error messages on either the database or application servers 🙁

    Thank you all for your help with my questions. I am trying very hard to uncover this problem because I need to build this server out to support many more databases due to a consolidation project, and I want to ensure we have a stable, and productive environment.

    Bless you all!

    Sandy

  • Gail and Colin have you going down the right path. I'd continue to focus on the I/Os at the SAN level. You say that your DB instance has a dedicated volume, not shared with any other process, but contention happens at the physical disk layer, not logical volume. Many SAN admins don't understand that database files should have their own physical disks. They think that the SAN's caching abilities make up for striping their LUNs accross physical disks and then sharing those physical disks with other applications.

    If indeed your DB has dedicated physical disks, maybe check at the disk controller level. I had an experience where our client was using a SAN and they were experiencing high queu lengths and I/O wait times. I checked out the specs on thier SAN drives and found that they were more than enough to handle the I/O throughput that my DB was pushing to it. It ended up being that the disk controller was being overloaded by a combination of my DB's traffic and traffic from other applications. While each app had thier own disks, the disks shared the same controller. Worth noting.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thanks John, I will follow up on that question. I have to laugh a little here because you are right about the dedicated disks 🙂

    More to come - and of course, Gail and Colin I REALLY appreciate your help!

  • Hence my question earlier about the volume group (set of disks) being dedicated. Hopefully that is the configuration. If not you get phantom IO issues to chase down and the fun of trying to correlate that between multiple instances, etc.

    Sandy - it is great watching you dig through this as the posts continue. 🙂

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • Sandra Skaar (1/15/2009)


    I talked with the SAN team some more today. There are no errors to correlate. The network backups run from 6PM to 4AM, which is when many of the I/O errors occur (the spikes). However, the virtual server in question is not being backed up by this network job.

    Perhaps not, but that the backups and the IO errors correlate seems to indicate that something along the IO path is shared with servers that are backed up.

    HBAs, fibre, switch (I've seen that one before), SAN controllers, cache, the disks themselves

    You're going to need to sit with the SAN admins and ensure that the disks, the controllers, the fibre and the switches are dedicated (not shared). As long as you're sharing some (any) part of the IO path with another server, you risk having unexpected and unpredictable slowdowns.

    As a friend is fond of saying, "There's no such thing as magic SAN dust". A SAN has to be configured and laid out with the same (or more) care as direct attached storage.

    (http://www.sqldownunder.com/SDU34FullShow.mp3)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You also mentioned raid 6, I don't think you could possibly have a worse raid level for writes; 2 parity disks; you might as well use a floppy and I'd still expect it to be quicker!

    Transation logs must be on raid1/10 not on raid 5 or 6. Sadly my latest blog post on san contention didn't format very well - such are the joys of blogging! However I've finished trying to benchmark a SAN but essentially have failed becuase despite the hype it's pretty obvious from the wildly differing run times from the tests there are problems. One point with a fibre channel network is that the backups tend to be pulled back across the same network and switches as you're using from sql server and this will show as increased latency. I'd say the figures you have clearly show why you shouldn't use raid 6! Essentially if your log writes are slow/erratic in performance/latency then your sql server will suffer. this is my blog post on my web site where it stays almost correctly formatted - don't you just love html! http://www.grumpyolddba.co.uk/infrastructure/TrackingSANContention.htm

    I'm hoping to summarize my san testing on my blog this weekend.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • .... you might as well use a floppy and I'd still expect it to be quicker!

    Now that's FUNNY :D:D:D:D


    * Noel

  • RAID 6?! I missed that.

    That, by itself, is going to give you horrendous write times, especially for a tran log

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • only if you haven't been using the san I've been testing!!!!

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • I have to admit that the SAN I have worked with in the past were managed by people really trained for it and they were TOP Vendors (HP, HITACHI).


    * Noel

  • colin Leversuch-Roberts (1/16/2009)


    only if you haven't been using the san I've been testing!!!!

    There are worse ways to kill a SAN's performance than RAID 6.

    Share the OLTP system's fibre switch with a datawarehouse

    Slice the disks so that SQL's log is sharing drives with the Exchange server

    Misconfigure the HBAs so that they're running 1/4 the speed they're capable of

    Do synchronous mirroring of the SAN across a 128kb line to a secondary data centre 17 km away (minimum latency 200ms)

    noeld (1/16/2009)


    I have to admit that the SAN I have worked with in the past were managed by people really trained for it and they were TOP Vendors (HP, HITACHI).

    You're lucky. I've worked with a couple 'storage engineers' who didn't really know what they were doing, but knew that they knew more than anyone else. Net result, the list above

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • The link Colin provided above didn't work for me but this one did. Thanks for sharing - looking forward to the read.

    http://sqlblogcasts.com/blogs/grumpyolddba/archive/2009/01/14/tracking-contention-on-the-san-testing-times.aspx

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • try dropping from this menu

    http://www.grumpyolddba.co.uk/Infrastructure/Infrastructure.htm

    it has details of the tests I've been using. Sadly I cannot name the san I've been testing. shared luns are a problem, however many sans now are using virtual luns which don't map to drives so every lun technically shares.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply