Perf Mon counters for SAN

Question

Perf Mon counters for SAN

SkyBox

SSCarpal Tunnel

Points: 4288
More actions
June 24, 2011 at 4:44 am

#238485

My hardware team and SAN vendor is telling me that our SAN controllers are saturated and adding spindles won't help. Our vendor is proposing we purchase a 2nd SAN. We are running about 3TB's of DB's against the current SAN as well as Exchange.
So now I need to dig in and see this with my own eyes.
I am running 2 DB servers, one with Server 2003 and one with 2008. I set up identical perf mon counters on each box, per Db drive and log drive:
- Avg Disk/Sec reads
- Avg Disk/Sec writes
- Avg Disk queue length
Anything else that I should be looking at? I considered using SQLIO, but from what I understand - it's not recommended to be run on a production system.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Ant-Green SSC Guru Points: 113509 More actions · Answer 1

You need to be doing this at the SAN level if the SAN controllers are maxed, that way it should be able to tell you which RAID groups and LUN's are pushing the most IO through the controllers, to which you can then pinpoint back to a particular host to which you can then look into individual databases, or if you have split LUNs per database file pinpoint it back to the database without having to monitor the host.

The SAN vendor is correct, adding more spindles will only push more IO through the controllers so the bottleneck is not disk it is the actual SAN.

SkyBox SSCarpal Tunnel Points: 4288 More actions · Answer 2

anthony.green (6/24/2011)
You need to be doing this at the SAN level if the SAN controllers are maxed, that way it should be able to tell you which RAID groups and LUN's are pushing the most IO through the controllers, to which you can then pinpoint back to a particular host to which you can then look into individual databases, or if you have split LUNs per database file pinpoint it back to the database without having to monitor the host.
The SAN vendor is correct, adding more spindles will only push more IO through the controllers so the bottleneck is not disk it is the actual SAN.

Got ya. My team doesn't have a SAN level monitoring utility. We sent the vendor encrypted trace data from one of their utilities and they formatted it into a packet of reports - based on a 3 day sample.

Can you recommend an inexpensive or possibly free utility for monitoring a SAN?

Ant-Green SSC Guru Points: 113509 More actions · Answer 3

Unfortunately no, each SAN is different, my background with SAN's as a DBA is with an EMC CX4-960 which has its own monitoring tools built in, to where you go to a website running on the SAN which give you the information on it.

Could you not ask the vendor to give you an IO breakdown from the data you send them? They should be able to tell you where abouts the issue is arising from.

They should be able to say this RAID group/this LUN/this host is say 90% of the total IO etc.

Oliiii SSCertifiable Points: 5328 More actions · Answer 4

Seeing that kind of problem is not easy, first of all because you would need a baseline to compare it to and you would need a load that can be reproduced.

While monitoring the IO of SQL you have too many variables, that's why its easier to just find some time to use SQLIO.

SQLIO can be used on a production environment as long as you keep in mind it'll hit the LUN as hard as it can, so either use an unused LUN (backup disk for instance) or have the SAN team lend you an extra device.

A single SQLIO on a single LUN should not impact the SAN enough to be noticed.

Make a batch with different SQLIO run and save all the infos (perfmon and SQLIO summary), run it at different time of the day (when everyone log in, at backup time, at big job time, at noon, etc...), with different configuration (different Tier, different number of spindles, etc...) and start comparing.

Make sure you compare sequential and random reads and writes and the SQLIO data file is big enough to avoid being in the cache (although in this case it doesn't matter).

Focus on the number of IO/sec and the latency, on perfmon it should flat-line or at least be very regular. If it's jumping everywhere then you've already found a problem (or something using your device).

If the FA is saturated on the SAN, you should see higher latency time for a given IO/sec and increasing the number of affected LUN (or spindle on your LUN) doesn't help.

On heavily saturated FA you'll get horrible latency at low IO/sec, like 200ms for less than 30 IO/sec.

If you have no baseline to compare your numbers to, you can just run your batch at a time when you have as few things as possible running (most likely during the night or at noon).

Whatever happen, keep your baseline and you results, that's always useful.

Ant-Green SSC Guru Points: 113509 More actions · Answer 5

While I agree with Oliiii using SQLIO is a good way to get a baseline, I doubt you will want to do that at this moment in time as it wont make a difference in this case as its not a particular server which is having IO issues, its the storage controllers controlling the SAN which aren't up to scratch.

It could be a mixture of a number of things, say you have 3 servers and they all run at 33% IO across the SAN, adding a forth server will max the IO and kill the SAN.

I think it is a case that the use of the SAN has increased quicker than what was expected and more and more stuff has been put on it without thinking about the SAN.

I would ask the vendor to give you a break down of IO based on LUN which is tied back to a RAID group which is also tied back to a host, that way you can see which host is throwing the most IO at the SAN so you can then try and find alternative methods of where to store that particular host etc.

SkyBox SSCarpal Tunnel Points: 4288 More actions · Answer 6

anthony.green (6/24/2011)
I think it is a case that the use of the SAN has increased quicker than what was expected and more and more stuff has been put on it without thinking about the SAN.
I would ask the vendor to give you a break down of IO based on LUN which is tied back to a RAID group which is also tied back to a host, that way you can see which host is throwing the most IO at the SAN so you can then try and find alternative methods of where to store that particular host etc.

That's exactly how it happened, very common I suspect. I am waiting on more reports from the vendor - from the 3 day sample report, everything points back to one host. Trying to tone down as much activity as possible on those db's as well as heavy index maintenance this weekend.

Also changed the file allocation size from 16kb to 64kb on the SAN a couple nights ago. If none of this helps, then we are looking into DAS for the host that's overloading the system.