Serious Performance Issues with Compaq SAN and SQL

Question

Serious Performance Issues with Compaq SAN and SQL

Viewing 6 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Dan Skutt SSC Rookie Points: 31 More actions · Answer 1

We had a similiar scenerio on our SAN. The SAN support folks said 'The array is nearly idle'. Win2k clearly seemed to be waiting on the SAN. It turns out that there seemed to be a bottleneck in sending all I/Os to a single LUN. The SAN was 'presenting' one 'large' LUN that Windows assigned a single drive letter. We could only get around 260 I/Os per second against a large RAID 10 array before queing started. We often saw >1000 queued I/Os. We had 16 GB of cache and two FC adapters on the server (8 way, 8 GB), but weren't even seeing the kind of performance we would expect against a similiar number of ordinary scsi drives. When we did nothing but seperate the array into two LUNs (either by presenting two LUNS from the SAN and assigning two drive letters OR by using dynamic disk to assign a single drive letter) we doubled our throughput. All of our vendors just pointed fingers at each other. We were fortunate enough to be able to drastically decrease the load the application was putting on the SAN, so the issue went away. As we left it with our vendors, it was last thought to be some sort of limitation in the Win2k scsi port drivers or how the FCs interacted with them. If you have any luck with MS, please let us know! A couple thoughts that don't mean much: Generally, we like to see queued I/O at <2 per physical drive for any sustained period of time. In this case you wouldn't seem to be that out of line. You may also want to check the '% Disk Idle Time' instead of '% Disk Busy Time' as the second one is usually incorrect in a SAN environment. However, if your end users are seeing slow response times and there appears to be no other bottleneck, then I would guess you are likely seeing the same issues we saw. Good luck!

rherguth SSC Enthusiast Points: 114 More actions · Answer 2

I ran into a problem where one drive in our Raid-5 array was bad, but not in such a way that the controller would fail the drive. The drive was responding very slowly to write requests and would tell the controller that it needed more time to service a request so the conroller never threw a timeout error. In fact, it never did throw an error, it was just slow. During troubleshooting, we ended up becoming suspicious after disbanding the array and creating just a bunch of disks and writing 2GB files to them. One drive was noticeably slower. So we tried the diagnostic from the manufacturer and it *failed*. So I learned that sometimes only the mfr's tests can detect a bad drive. For my Compaq, that meant running Seagate tools: http://www.seagate.com/support/seatools/index.html

marbos SSC Journeyman Points: 83 More actions · Answer 3

Thanks everyone for the insight on this issue.

I have spoken to the SAN Administrator(s) and we have decided to try the multiple LUN approach and see if that alleviates some of our issues.

I will post the results here.

mcdonep SSC Journeyman Points: 87 More actions · Answer 4

I am using similar technology but on a smaller scale. We are running HP\Compaq’s (whatever) MSA1000. A couple of things that I watch very carefully are that the disk queue never exceeds 2x the physical disk count and that my average I\O rate never exceeds 80% of the physical disks capacity. Compaq 10K drives are rated at about 130 IOs\sec. The other issue we faced when creating our disk subsystem was the difference between Raid 5 and Raid 10. With Raid 5 it was often better to have multiple smaller physical arrays for performance and fault tolerance, but with raid 10, massive drive counts in a single physical array blew the raid 5 arrays off the map in terms of performance. This might be a consideration for you.

Gabor Nyul SSCrazy Eights Points: 9459 More actions · Answer 5

We have had nearly the same issue with Compaq MA8000 and Proliant servers running W2k and SQL2k Ent.

It came out that the W2k SP3 was buggy if your environment had Compaq disk arrays (some hot fixes are missed in SP3 what was issued for SP2).

After we came back to SP2 (n the W2K) the problem disappared.

I would suggest to check if some of your upgrade couldn't cause this issue.

As a test I would come back to the previouse versions of upgrades step by step to fine out wich one was buggy.

Gabor

Bye
Gabor

marbos SSC Journeyman Points: 83 More actions · Answer 6

After reviewing the previous suggestions to add more LUN's to the environment, I decided to test spreading one database across two LUNS, then three and four.

The results were very definitive. I will post below, the Performance Monitor metrics for each phase so you can see:

Phase 1: One LUN for Data as represented in Production Environment

Averages

PhysicalDisk CountersDriver Letter: D

Avg. Disk Queue Length1.548

Avg. Disk Read Queue Length0.145

Avg. Disk Write Queue Length1.403

Current Disk Queue Length1.581

Maximums

PhysicalDisk CountersDriver Letter: D

Avg. Disk Queue Length30.988

Avg. Disk Read Queue Length3.327

Avg. Disk Write Queue Length30.988

Current Disk Queue Length97.000

Phase 2: Two LUNS for Data

Averages

PhysicalDisk CountersDriver Letter: DJ

Avg. Disk Queue Length1.0000.986

Avg. Disk Read Queue Length0.0000.000

Avg. Disk Write Queue Length1.0000.986

Current Disk Queue Length0.9740.795

Maximums

PhysicalDisk CountersDriver Letter: DJ

Avg. Disk Queue Length20.57420.610

Avg. Disk Read Queue Length0.0000.000

Avg. Disk Write Queue Length20.57420.610

Current Disk Queue Length53.00052.000

Phase 3: Three LUNS for Data

Averages

PhysicalDisk CountersDriver Letter: DJK

Avg. Disk Queue Length0.4240.4210.421

Avg. Disk Read Queue Length0.0000.0000.000

Avg. Disk Write Queue Length0.4240.4210.421

Current Disk Queue Length0.3230.2130.210

Maximums

PhysicalDisk CountersDriver Letter: DJK

Avg. Disk Queue Length8.5678.8568.844

Avg. Disk Read Queue Length0.0000.0000.000

Avg. Disk Write Queue Length8.5678.8568.844

Current Disk Queue Length25.00029.00027.000

Phase 4: Four LUNS for Data

Averages

PhysicalDisk CountersDriver Letter: DJKL

Avg. Disk Queue Length0.4260.4230.4230.422

Avg. Disk Read Queue Length0.0000.0000.0000.000

Avg. Disk Write Queue Length0.4260.4230.4230.422

Current Disk Queue Length0.4430.4290.4510.463

Maximums

PhysicalDisk CountersDriver Letter: DJKL

Avg. Disk Queue Length8.6818.4468.8628.697

Avg. Disk Read Queue Length0.0000.0000.0000.000

Avg. Disk Write Queue Length8.6818.4468.8628.697

Current Disk Queue Length19.00023.00021.00021.000