SAN Performance Issue??

Question

SAN Performance Issue??

leonp

SSCommitted

Points: 1962
More actions
May 7, 2008 at 4:28 pm

#123720

Hi,
We are experiencing database write slowdowns in our SQL Server 2000 (Standard) / Win2K3 server environment in Production. There is a corresponding Test instance which seems to be doing much better. So the problem seems to be specific to Production. In looking at the perfmon logs, it seems that the disk write latencies for the log files disk and the database file disk have increased from 6 ms to 24 ms. I am told that perfmon is not very accurate when it comes to reporting SAN utilization, but I am inclined to going by these numbers since that is the only way I can explain the progressive degradation in processing time for my application from 30 hours to 65 hours to 90+ hours now. There have been no application changes since the progressive degradation set in.
Of course IBM SAN support is all for disproving my claim, but honestly, since I administer the application (Informatica) on a Solaris server and the SQL 2000 Database on the Windows server, i can safely attest that there were no changes made to the application during this time. Is there something I could do to conclusively prove that SAN is not performing up to expectation?
Your help/input is greatly appreciated!
Leon

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

RBarryYoung SSC Guru Points: 143329 More actions · Answer 1

What kind of SAN is it? Are there other servers using it also?

If so, then your slowdown could be caused by contention with those other servers for the same physical resources. Alternatively, you could be getting clobbered by some maintenance task (like backup?) on the SAN itself.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

leonp SSCommitted Points: 1962 More actions · Answer 2

Barry

Thanks! Its a Fast T. I guess it is being used by other servers too. So you are thinking contention could potentially cause this. I will bring this up with the IBM team tomorrow.

Leon

leonp SSCommitted Points: 1962 More actions · Answer 3

I dont understand something with these perfmon numbers here .....

During the actual production scheduled run, disk sec/write were around 27 ms for the database and the write throughput was 180 KBps. When I tried to recreate this for the IBM SAN team to investigate, the throughput remained around 220 however, the disk write latency was back to 6 ms. How is that possible?

I was planning to take this back to the SAN support telling them that I could still see disk latencies in the range of 20+ ms. The disk latencies seem to have returned to normal values, but not the throughput. Any insights???

Leon

leonp SSCommitted Points: 1962 More actions · Answer 4

I guess I overlooked something .. the number of write operations per second .. they have increased from 8 (when this ran per schedule) to 20. does look like contention? doesn't it?

RBYoung Right there with Babe Points: 742 More actions · Answer 5

leonp (5/7/2008)
I dont understand something with these perfmon numbers here .....
During the actual production scheduled run, disk sec/write were around 27 ms for the database and the write throughput was 180 KBps. When I tried to recreate this for the IBM SAN team to investigate, the throughput remained around 220 however, the disk write latency was back to 6 ms. How is that possible?
Leon

Throughput(KB/s) is probably less important that IO Rate(IO/s).

It seems to me that there may be a contending task/job/workload on another server that is scheduled to run at the same time as your production run's schedule.

leonp SSCommitted Points: 1962 More actions · Answer 6

I am inclined to believe so too .. Is there any reason why the sql server will be responsible for this slowdown? I want to make sure we (SQL team) are not missing on anything before we start pursuing this with the SAN support

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

leonp (5/8/2008)
I am inclined to believe so too .. Is there any reason why the sql server will be responsible for this slowdown? I want to make sure we (SQL team) are not missing on anything before we start pursuing this with the SAN support

The easiest way to differentiate between in-box vs. off-box likely causes is to compare the system performance profiles wrt the Disk for the good runs vs. the bad runs. This should include: total-run-time, total-io to that disk(reads & writes), io/s to that disk(reads & writes), throughput(R & W), Disks Idle%, and IO response time.

The purpose of this is to discriminate between two possibilites: am I(meaning the machine that this executes on) sending more IO's to the disk or is the disk just slower with the same amount of IO? The former indicates an in-box problem, the latter indicates an off-box problem.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

leonp (5/8/2008)
I am inclined to believe so too .. Is there any reason why the sql server will be responsible for this slowdown? I want to make sure we (SQL team) are not missing on anything before we start pursuing this with the SAN support

In addition to Barry's comments - most HBA's have their own way to monitor traffic (for example - on the ones we use - you can see the # of queued IO's and see how long they take to clear out). In an ideal scenario, you'd have NO queued IO's, but most of the time, they queue and should dissipate just about instantaneously. Sustained periods with high queued IO's can often mean perf issue @ the SAN.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

leonp SSCommitted Points: 1962 More actions · Answer 9

One of the findings IBM reported was that the Production server was connected to the SAN through a single attached switch .... (not sure if I am saying this right) ...

The Test server which is performing to expectation (and actually much better than Production) we is dual attached to the SAN

We are requesting a change/correction to see the amount of improvement it yields

Leon

Harry G. Newbie Points: 5 More actions · Answer 10

Leon,

What the solution for your problem?

We have a similar case where our application/DB performance has degraged after new SAN was installed which has 8X450GB drives vs 16X75GB drives.

Thanks

Harry