Do I have a disk bottle-neck?

  • I'm trying to improve the performance of our SQL environment - For some reason, the disk throughput we see in production is not what I'm expecting. I'm not hugely experienced in this kind of analysis... can anyone help me please 🙂

    Background

    I've used SQLIO to test the throughput capability of our SAN storage. The results are pretty good (see attached) - For sequential I/O I get over 400 MB/s. The strange thing is that we hardly ever get this rate of throughput in production.

    If I run a test query (see attached perfmon stats) which reads data from a large table and then writes it back to another table, I only see 30-40MB/s during the operation. However, when there is a checkpoint I see good throughput of about 400MB/s. NB. The table I'm writing to is in a freshly created database with pre-sized MDF/LDF files. The table I'm writing to does not have indexes.

    I've collected SQL wait stats for the duration of the operation (see top stats by wait_time_ms below). From what I've read, these indicate a disk bottle-neck but I know the disks are not being fully utilised!?

    My thoughts

    I wondered about a possible CPU bottle-neck but the wait stats dont indicate this and I wouldn't expect this to be a CPU intensive operation. I don't suspect memory as there is no paging and Page Life Expectancy is very high.

    We have separate LUNs presented for Data (vRAID5), Logs (vRAID1) and TempDB (vRAID1). All these LUNs reside on a single pool of disks on our SAN (ie, they all share the same spindles) so I wondered about contention. The thing is, the behaviour is consistant and the log file writes are so uniform. During the log write, there is no other disk I/O going on... It make me think the bottle-neck is elsewhere

    Does anyone have any suggestions? Why aren't my reads and log writes faster?

    Top Waits Stats taken during query

    [font="Courier New"]wait_type_______________________waiting_tasks_count__wait_time_ms___max_wait_time_ms___signal_wait_time_ms

    WRITELOG________________________2385_________________7530___________85_________________37

    PREEMPTIVE_OS_WRITEFILEGATHER___789__________________3280___________85_________________0

    PREEMPTIVE_OS_FILEOPS___________789__________________2770___________213________________0

    SLEEP_BPOOL_FLUSH_______________1246_________________926____________30_________________20

    PAGEIOLATCH_SH__________________45___________________430____________84_________________1

    SOS_SCHEDULER_YIELD_____________13413________________331____________3__________________326

    PAGEIOLATCH_UP__________________52___________________242____________66_________________0

    PAGEIOLATCH_EX__________________35___________________219____________26_________________0

    PREEMPTIVE_OS_GETDISKFREESPACE__789__________________127____________1__________________0

    PREEMPTIVE_OS_FLUSHFILEBUFFERS__789__________________92_____________6__________________0

    WRITE_COMPLETION________________12___________________63_____________14_________________0

    LATCH_SH________________________1____________________24_____________24_________________0

    PAGELATCH_UP____________________2____________________9______________7__________________0[/font]

  • I would strongly recommend having a look at How to prove it's a SAN problem[/url] by Brent Ozar.



    Ole Kristian VelstadbrĂĄten BangĂĄs - Virinco - Facebook - Twitter

    Concatenating Row Values in Transact-SQL[/url]

  • When you run SQLIO tests it is important to make the test file size large enough that you get a true test, especially for random read and random write, to overcome the impact of the SAN cache.

    For example, if your database will be 50 GB, use a 50 GB SQLIO test file.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply