Do I have a disk bottle-neck?

Question

Post reply

Do I have a disk bottle-neck?

bryanasaurus

SSC Veteran

Points: 221
More actions
June 14, 2012 at 10:20 am

#271127

I'm trying to improve the performance of our SQL environment - For some reason, the disk throughput we see in production is not what I'm expecting. I'm not hugely experienced in this kind of analysis... can anyone help me please 🙂
Background
I've used SQLIO to test the throughput capability of our SAN storage. The results are pretty good (see attached) - For sequential I/O I get over 400 MB/s. The strange thing is that we hardly ever get this rate of throughput in production.
If I run a test query (see attached perfmon stats) which reads data from a large table and then writes it back to another table, I only see 30-40MB/s during the operation. However, when there is a checkpoint I see good throughput of about 400MB/s. NB. The table I'm writing to is in a freshly created database with pre-sized MDF/LDF files. The table I'm writing to does not have indexes.
I've collected SQL wait stats for the duration of the operation (see top stats by wait_time_ms below). From what I've read, these indicate a disk bottle-neck but I know the disks are not being fully utilised!?
My thoughts
I wondered about a possible CPU bottle-neck but the wait stats dont indicate this and I wouldn't expect this to be a CPU intensive operation. I don't suspect memory as there is no paging and Page Life Expectancy is very high.
We have separate LUNs presented for Data (vRAID5), Logs (vRAID1) and TempDB (vRAID1). All these LUNs reside on a single pool of disks on our SAN (ie, they all share the same spindles) so I wondered about contention. The thing is, the behaviour is consistant and the log file writes are so uniform. During the log write, there is no other disk I/O going on... It make me think the bottle-neck is elsewhere
Does anyone have any suggestions? Why aren't my reads and log writes faster?
Top Waits Stats taken during query
[font="Courier New"]wait_type_______________________waiting_tasks_count__wait_time_ms___max_wait_time_ms___signal_wait_time_ms
WRITELOG________________________2385_________________7530___________85_________________37
PREEMPTIVE_OS_WRITEFILEGATHER___789__________________3280___________85_________________0
PREEMPTIVE_OS_FILEOPS___________789__________________2770___________213________________0
SLEEP_BPOOL_FLUSH_______________1246_________________926____________30_________________20
PAGEIOLATCH_SH__________________45___________________430____________84_________________1
SOS_SCHEDULER_YIELD_____________13413________________331____________3__________________326
PAGEIOLATCH_UP__________________52___________________242____________66_________________0
PAGEIOLATCH_EX__________________35___________________219____________26_________________0
PREEMPTIVE_OS_GETDISKFREESPACE__789__________________127____________1__________________0
PREEMPTIVE_OS_FLUSHFILEBUFFERS__789__________________92_____________6__________________0
WRITE_COMPLETION________________12___________________63_____________14_________________0
LATCH_SH________________________1____________________24_____________24_________________0
PAGELATCH_UP____________________2____________________9______________7__________________0[/font]

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

okbangas SSChampion Points: 11773 More actions · Answer 1

I would strongly recommend having a look at How to prove it's a SAN problem[/url] by Brent Ozar.

Ole Kristian Velstadbråten Bangås - Virinco - Facebook - Twitter

Concatenating Row Values in Transact-SQL[/url]

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 2

When you run SQLIO tests it is important to make the test file size large enough that you get a true test, especially for random read and random write, to overcome the impact of the SAN cache.

For example, if your database will be 50 GB, use a 50 GB SQLIO test file.