February 19, 2003 at 10:54 pm
We had a similiar scenerio on our SAN. The SAN support folks said 'The array is nearly idle'. Win2k clearly seemed to be waiting on the SAN. It turns out that there seemed to be a bottleneck in sending all I/Os to a single LUN. The SAN was 'presenting' one 'large' LUN that Windows assigned a single drive letter. We could only get around 260 I/Os per second against a large RAID 10 array before queing started. We often saw >1000 queued I/Os. We had 16 GB of cache and two FC adapters on the server (8 way, 8 GB), but weren't even seeing the kind of performance we would expect against a similiar number of ordinary scsi drives. When we did nothing but seperate the array into two LUNs (either by presenting two LUNS from the SAN and assigning two drive letters OR by using dynamic disk to assign a single drive letter) we doubled our throughput. All of our vendors just pointed fingers at each other. We were fortunate enough to be able to drastically decrease the load the application was putting on the SAN, so the issue went away. As we left it with our vendors, it was last thought to be some sort of limitation in the Win2k scsi port drivers or how the FCs interacted with them. If you have any luck with MS, please let us know! A couple thoughts that don't mean much: Generally, we like to see queued I/O at <2 per physical drive for any sustained period of time. In this case you wouldn't seem to be that out of line. You may also want to check the '% Disk Idle Time' instead of '% Disk Busy Time' as the second one is usually incorrect in a SAN environment. However, if your end users are seeing slow response times and there appears to be no other bottleneck, then I would guess you are likely seeing the same issues we saw. Good luck!
February 20, 2003 at 1:32 am
I ran into a problem where one drive in our Raid-5 array was bad, but not in such a way that the controller would fail the drive. The drive was responding very slowly to write requests and would tell the controller that it needed more time to service a request so the conroller never threw a timeout error. In fact, it never did throw an error, it was just slow. During troubleshooting, we ended up becoming suspicious after disbanding the array and creating just a bunch of disks and writing 2GB files to them. One drive was noticeably slower. So we tried the diagnostic from the manufacturer and it *failed*. So I learned that sometimes only the mfr's tests can detect a bad drive. For my Compaq, that meant running Seagate tools: http://www.seagate.com/support/seatools/index.html
February 20, 2003 at 7:53 am
Thanks everyone for the insight on this issue.
I have spoken to the SAN Administrator(s) and we have decided to try the multiple LUN approach and see if that alleviates some of our issues.
I will post the results here.
February 22, 2003 at 10:34 am
I am using similar technology but on a smaller scale. We are running HP\Compaq’s (whatever) MSA1000. A couple of things that I watch very carefully are that the disk queue never exceeds 2x the physical disk count and that my average I\O rate never exceeds 80% of the physical disks capacity. Compaq 10K drives are rated at about 130 IOs\sec. The other issue we faced when creating our disk subsystem was the difference between Raid 5 and Raid 10. With Raid 5 it was often better to have multiple smaller physical arrays for performance and fault tolerance, but with raid 10, massive drive counts in a single physical array blew the raid 5 arrays off the map in terms of performance. This might be a consideration for you.
February 24, 2003 at 2:17 am
We have had nearly the same issue with Compaq MA8000 and Proliant servers running W2k and SQL2k Ent.
It came out that the W2k SP3 was buggy if your environment had Compaq disk arrays (some hot fixes are missed in SP3 what was issued for SP2).
After we came back to SP2 (n the W2K) the problem disappared.
I would suggest to check if some of your upgrade couldn't cause this issue.
As a test I would come back to the previouse versions of upgrades step by step to fine out wich one was buggy.
Gabor
Bye
Gabor
February 25, 2003 at 12:06 pm
After reviewing the previous suggestions to add more LUN's to the environment, I decided to test spreading one database across two LUNS, then three and four.
The results were very definitive. I will post below, the Performance Monitor metrics for each phase so you can see:
Phase 1: One LUN for Data as represented in Production Environment
Averages
PhysicalDisk CountersDriver Letter: D
Avg. Disk Queue Length1.548
Avg. Disk Read Queue Length0.145
Avg. Disk Write Queue Length1.403
Current Disk Queue Length1.581
Maximums
PhysicalDisk CountersDriver Letter: D
Avg. Disk Queue Length30.988
Avg. Disk Read Queue Length3.327
Avg. Disk Write Queue Length30.988
Current Disk Queue Length97.000
Phase 2: Two LUNS for Data
Averages
PhysicalDisk CountersDriver Letter: DJ
Avg. Disk Queue Length1.0000.986
Avg. Disk Read Queue Length0.0000.000
Avg. Disk Write Queue Length1.0000.986
Current Disk Queue Length0.9740.795
Maximums
PhysicalDisk CountersDriver Letter: DJ
Avg. Disk Queue Length20.57420.610
Avg. Disk Read Queue Length0.0000.000
Avg. Disk Write Queue Length20.57420.610
Current Disk Queue Length53.00052.000
Phase 3: Three LUNS for Data
Averages
PhysicalDisk CountersDriver Letter: DJK
Avg. Disk Queue Length0.4240.4210.421
Avg. Disk Read Queue Length0.0000.0000.000
Avg. Disk Write Queue Length0.4240.4210.421
Current Disk Queue Length0.3230.2130.210
Maximums
PhysicalDisk CountersDriver Letter: DJK
Avg. Disk Queue Length8.5678.8568.844
Avg. Disk Read Queue Length0.0000.0000.000
Avg. Disk Write Queue Length8.5678.8568.844
Current Disk Queue Length25.00029.00027.000
Phase 4: Four LUNS for Data
Averages
PhysicalDisk CountersDriver Letter: DJKL
Avg. Disk Queue Length0.4260.4230.4230.422
Avg. Disk Read Queue Length0.0000.0000.0000.000
Avg. Disk Write Queue Length0.4260.4230.4230.422
Current Disk Queue Length0.4430.4290.4510.463
Maximums
PhysicalDisk CountersDriver Letter: DJKL
Avg. Disk Queue Length8.6818.4468.8628.697
Avg. Disk Read Queue Length0.0000.0000.0000.000
Avg. Disk Write Queue Length8.6818.4468.8628.697
Current Disk Queue Length19.00023.00021.00021.000
Viewing 6 posts - 16 through 20 (of 20 total)
You must be logged in to reply to this topic. Login to reply