Disk Config issues

  • I found the following errors in our SQL Server.

    I/O error 1450(Insufficient system resources exist to complete the requested service.) detected during read at offset 0x0000017e358000 in file 'E:\Shared\Microsoft SQL Server\MSSQL\Data\2004Data.NDF'..

    I/O error 1450(Insufficient system resources exist to complete the requested service.) detected during read at offset 0x000001ec8ce000 in file 'F:\Shared\Microsoft SQL Server\MSSQL\Data\2004Q1Data2.NDF'..

    The SQL Server is a Dell 4 processor system with 4 GB RAM, SAN attached (~600 GB storage on 15 Drive RAID 5). The server hosts around 15 databases with an average of 15GB. These errors show up when there is heavy load on the server. So far no data has corrupted yet. These databases have a history of pounding the server they were hosted on before with similar errors. That prompted us to move to the SAN. But our vendor insisted that RAID 5 is good enough ( I wanted a RAID 10 and more smaller drives) with 76 GB drives. Now, 6 months after migration to the new server, I find myself facing the same errors. One obvious solution is to ask for more drives, and may be RAID 10. Can you think of any thing else?

  • I'd look to increase the cache size on the SAN as well.



    Shamless self promotion - read my blog http://sirsql.net

  • I'd get MS to work with the SAN vendor on this. SQL is reporting a lower level OS error.

  • You might want to make sure that the disk are on different controllers. and also how many HBA's do you have. We have HBA's with a throughput of 100 mb/S but according to SAN experts, the best throughput is achieved at 50 MB/s.

  • We have exactly the same problems, running

    SQL2000 with 2GB memory, Windows 2000 AS, 4 CPUs. We are using a SAN storage connected by 2 Fibre cards. The databases range from 10GB to 400GB in size for decision support applications.

    Looking at:

    http://support.microsoft.com/default.aspx?scid=kb;en-us;828339

    it should be caused by a lower level OS-related problem, possibly to do with

    http://support.microsoft.com/default.aspx?scid=kb;en-us;555068

    Cause

    Having the /3GB and /PAE switch enabled in the boot.ini on a Windows 2000 Advanced Server with over 4GB of RAM and running SQL Server in a cluster with AWE enabled.

    Resolution

    Use only the /PAE and AWE setting by removing the /3GB switch from the boot.ini. As an additional precaution make sure to set the maximum memory available to SQL Server to prevent it from starving the operating system.

    Please also refer to

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;247904

    which describes what resources fall(System Page Table Entry , SystemPTE) below a certain threshold.

    You might want to monitor your SystemPTE levels while the box is on heavy loads.

  • If the server has 2GB of memory why would you have /3GB, /PAE and AWE enabled? You should only use /3GB when the server has 4GB of memory, and /PAE & AWE once you get above that.



    Shamless self promotion - read my blog http://sirsql.net

  • I suppose my quote was misleading. I tried to point to areas that influence the System PTEs available to the box. With 4GB memory, the /3GB switch impact the number of System PTEs available.

    Right now, we don't have the /3GB switch enabled.

  • Thanks to all those replied for your valuable input. Here is some

    additional info on this system configuration:

    I am using the /3GB switch only since I have 3.5 GB RAM and the system

    is a dedicated SQL Server. I am not using the /PAE switch since I am

    not

    over 4 GB. I ran the perfmon SQL Server is using ~2.7 GB,  and there is

    ~ 500 MB of available memory. So I am discouning the possibility of the

    OS starving for memory. Nevertheless, I will remove the /3 GB switch

    and

    see if that makes any difference.

    Last week I added 6 additional volumes to the system. Each volume is 65

    GB carved on a 7 drive RAID, so I effectively have 40 drives. Since

    this

    is a SAN, I am not the only user on those RAIDS. In fact each of those

    RAID volumes is about 420 GB ( 72 GB drives ) and I am allocated 65 GB.

    The SAN is IBM Shark. It has 6 GB read cache and 2 GB write cache. Our

    SAN administrator has checked the SAN and drives and found no issues on

    his side.

    I started collecting some additional disk stats, and here those are :

    I did a fn_Virtualfilestats against the database, while I was running

    the index recreation job that generated those insufficient disk

    resources error. Then I took the delta between the disk stats for

    before

    and after the indexing job.

    The read volume was  15788867584 bytes in a 15 minute period translates

    into an average of only 17 Megabytes/sec.

    The write volume was 18536996352 bytes in a 15 minute period translates

    into an average of only 20 Megabytes/sec.

    Then I worked with our SAN admin. He checked the stats on the fibre

    channels. We have 2 2GB fibre channels. A 2Gbit Fibre Channel can

    handle

    a maximum transfer rate of about 120Megabytes per second. Each fibre

    channel has  ~ 21 MB/sec in  and 41 MB/sec out. The Shark has 2 fibre

    adapters, shared by three servers, has about 22 MB/Sec in and 43 MB/Sec

    out. These are less than 20 % of the capacity (120 MB/sec) I am told.

    Hence the finger points back at SQL. We also did throughput testing on

    those RAID volumes using IOMeter and Microsoft disk stressing tool, and

    it does ~ 60 MB/sec for READ using 64 K blocks. So, we seem to be well

    withing the capacity of all the sub systems.

  • I would be very surprised if SQL is the bottle neck.  We have an EMC SAN connected to a 4way 2gig processor machine with 8 gig of ram, win2k, 2x2gigabit fibre channel with load balancing software (powerpath).  We get a continuous throughput of 143megabytes/sec and I have seen 153megabytes/sec. 

    To get this throughput we use SQL Litespeed to backup a large database to another disk on the SAN.  When Litespeed backs up the database if compresses the file to 25% of the original size, this allows us not to be constrained by writes back to disk.  On a HP disk array we have had a throughput of 166megabytes/sec on a 8 way machine.  If I back up a smaller database twice in a row the first one's throughput is 143 megabytes/sec and the second is at over 300 megabytes/sec.  We backup a 100gig database in a little over 10 minutes, So I can confirm SQL Server is more than capable of high throughput.

    When we originally set up the SAN we were getting very similar performance to you.  Our configuration was 4 x 8gig stripes to make up a 32 gig LUN.  We now have 8 x 4 gig stripes to make up a 32 gig LUN and we then stripe either 4 or 5  LUN's together at the OS.  You need to make your disk stripe smaller and across more disks on the SAN.

    Get your SAN administrator to check the performance of the individual disks on the SAN which your database uses.  You will probably find you are exceeding the throughput capacity of the individual disks.

    See some other posts.

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=65&messageid=107712#bm111783

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=5&messageid=111959&p=2

  • Can you check the perfmon counter Disk Queue length and see whether it is not exceeding 2*no. of processors. Regarding the microsoft stress test utility,  below is what microsoft has to say.

    One thing it is definitely good for is determining if there are problems

    with a caching controller. If the controller loses any reads or writes

    in it's cache we have problems.

     

  • Something which you might want to check with the SAN team is whether you have striped or Concatenated volumes. Also see if they see any hotspots in SAN.

  • I agree that SQL Server software cannot be the bottleneck. We have other server with low end IBM Fastt with 40 * 17GB drives in two raid 5 volumes, (each with 20 drives effective = 320 GB) and I have a 250 GB DB and 50 GB tables, and I reorganize them every week. Each 50 GB tables takes about 3 hours. Never times out. That SAN is dedicated to that server only. Ironically that SAN is over 4 years old, has older drives and probably other older components too. The drives on the Shark are 10 K RPM, 76 GB drives. The disk queue gets to about 20 during high load (for a RAID volume with 6 or 7 drives) so it is not terrible. Another thing to note is that the MS IO testing tool reported 60 MB /Sec IO. As I have already mentioned my indexing job is throwing about 37 MB (17 MB read and 20 MB write) once agin below the capacity. I would be happy if it took longer or ran a bit slower, what puzzles me is that timing out with the insufficient resources errors. I am cautious not to hold the SAN responsible, since the other systems sharing the SAN are not reporting any of these problems.

    Mark: You mentioned that you do striping at the OS level of the volumes presented by the SAN. I did not do that, since our OS admin and I thought that might impose some additional load (software striping). So, I am presenting each lun to the DB with database files equally spread on those volumes. Since ours is DW server, I can fairly spread the data evenly.

     

    Next week I am probably going to call MS.. (reaching a point where $250/hour may be worth it), it does seem to be the last resort.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply