Disk IO Performance Degredation after Server Upgrade

  • We recently upgrade our SQL server to new hardware, and our Storage to a new SAN

    Server: HP 580 with 4 hex-core CPU's and 128 GB RAM (112 GB Available).

    SAN: HP LeftHand SAN 3-node cluster

    The Server and SAN communicate via iSCSI.

    There are 2 databases hosted on the server.

    DB1: 650GB

    DB2: 700GB

    Each database has multiple data files in multiple filegroups on multple SAN volumes.

    Both databases are using SQL Transparent Data Encryption. The issues existed BEFORE TDE was implemented.

    We are seeing very long IO times when performing disk operations, such as filegrowth. In one instance it took 5 minutes to add 10 GB to a data file. This caused applicaiton timeouts and downtime until I could manually extend the file. (In actuality I extended the file, disabled autogrowth, and added a new file)

    Today we are seeing disk opperation alerts ( 12 ms Reads, 13 ms Writes, etc.) According to my NetOps team, there is nothing wrong with the storage, server, or network configuration.

    Admitedly, there is high memory utilization on the server right now. > 95% OS memory. > 90% SQL Memory. A reboot is in order.

    I need to tap into the configuration expertise in the group. Where in SQL Server, Windows, Network, or Storage configuration can I look for potential issues?

    Thanks,

    Greg


    Greg Roberts

  • what upgrade strategy you use, side-by-side or in-place?


    [font="Times New Roman"]rfr.ferrari[/font]
    DBA - SQL Server 2008
    MCITP | MCTS

    remember is live or suffer twice!
    the period you fastest growing is the most difficult period of your life!
  • side by side.

    we installed new hardware with a clean os and SQL server install. Restored a previous backup to the server and ran load tests. Everything appeared OK. After the Load tests we stopped SQL Server on the original server and moved the database files to the new hardware. Same volume and disk configuration as we conducted load tests on.

    Original Server: HP 380 Performance

    2x 4 Core Xeon CPU

    64 GB RAM

    2x 10/100/1000 NIC (Teamed @ 1GB each)

    New Server: HP 580 G6 Performance

    4x 6 Core Xeon CPU

    128 GB RAM (112 usable)

    6 x 10/100/1000 NICs

    - 2x @ 1 GB public netowkr

    - 4x @ 1 GB storage network


    Greg Roberts

  • what has changed on the new SAN....are you running the same number of spindles.....more hopefully?

    what about cache on the SAN, is set to favor READ or WRITE? and that number diff then before

    and did the CACHE on the new SAN increase or decrease

    and what about disk speeds increase or decrease

    and last but not least did you run SQLIO on the new SAN before you turned it on? this should have told you what the SAN was capable off.

  • 456789psw (6/20/2011)


    what has changed on the new SAN....are you running the same number of spindles.....more hopefully?

    Went from 2-node to 3-node cluster

    what about cache on the SAN, is set to favor READ or WRITE? and that number diff then before

    don't know

    and did the CACHE on the new SAN increase or decrease

    increase

    and what about disk speeds increase or decrease

    increase

    and last but not least did you run SQLIO on the new SAN before you turned it on? this should have told you what the SAN was capable off.

    no. did not have time to run SQLIO


    Greg Roberts

  • Greg,

    sorry I missed the 12 and 13 ms part. net team is right should have any issues there.

    what was the auto growth set to grow at? if it was a low number and you asked for a big growth then that means it had to stop each time before it grew again..makes things very slow!!!

    are you having any other issues that are slow? if so, Have you check your wait types best place to start.

    as far as memory, not sure what you mean both are using in the 90 percent tile? I would check your SQL Server: Buffer Manager: Page Life Expectancy make sure you are at least above 300....which is 5 min

  • What OS are you on ? ( + sp/hf please )

    What's the version info of your sqlinstance ?

    Can you elaborate on the topology of your lefthand solution ?

    Are the luns of your sqlinstance placed on disks that also host other luns ?

    Was there a change in raid level definitions of your luns ?

    By enabling DTE on a system that already suffered IO, chances are you've made it worse.

    Did you enable instant file initialisation ( windows level grant ! )

    You jumped into a dark pit by not running SQLIO.

    That thing is produced to predict and prevent IO problems before it hurts you !

    IMO Lefthand is more advisable for file system solutions.

    Top performance db systems shouldn't be placed on it, but it is a sales argument !

    Of course it can host db systems, but how can you control it and make it performing well enough.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • First of all check the SAN Cache and then RAID Groups,If All disk on the same RAID Group that might be a problem

    Page-Life-Expect..

    as asked by the @Old Hand

    For Disk Delays

    ===============

    select DB_NAME(database_id) DB_NAME, di.file_id,df.name,io_stall_read_ms ,num_of_reads

    ,cast(io_stall_read_ms/(1.0+num_of_reads) as numeric(10,1)) as 'avg_read_stall_ms'

    ,io_stall_write_ms,num_of_writes

    ,cast(io_stall_write_ms/(1.0+num_of_writes) as numeric(10,1)) as 'avg_write_stall_ms'

    ,io_stall_read_ms + io_stall_write_ms as io_stalls

    ,num_of_reads + num_of_writes as total_io

    ,cast((io_stall_read_ms+io_stall_write_ms)/(1.0+num_of_reads + num_of_writes) as numeric(10,1)) as 'avg_io_stall_ms'

    from sys.dm_io_virtual_file_stats(null,null) di inner join sys.database_files df on df.file_id = di.file_id

    where DB_NAME(database_id) = 'your database name'

    order by avg_io_stall_ms desc

    IO pendings

    ===========

    select

    database_id,

    file_id,

    io_stall,

    io_pending_ms_ticks,

    scheduler_address

    from sys.dm_io_virtual_file_stats(NULL, NULL)t1,

    sys.dm_io_pending_io_requests as t2

    where t1.file_handle = t2.io_handle

    For Memory Presure

    ==================

    select * from sys.dm_os_performance_counters

    where counter_name like 'page life%'

    For Queries Utilizations

    ========================

    SELECT ST.TEXT,SP.SPID,WAITTIME,LASTWAITTYPE,CPU,PHYSICAL_IO,STATUS,HOSTNAME,PROGRAM_NAME,CMD,LOGINAME FROM SYS.SYSPROCESSES SP

    CROSS APPLY SYS.DM_EXEC_SQL_TEXT(SP.SQL_HANDLE) ST

    WHERE STATUS !='SLEEPING'

    ORDER BY CPU DESC

    For CPU Usage

    =============

    select

    scheduler_id,

    current_tasks_count,

    runnable_tasks_count

    from

    sys.dm_os_schedulers

    where

    scheduler_id < 255

    For System Info

    =============

    Select * from sys.dm_os_sys_info

    provide these queries result

    Regards,
    Syed Jahanzaib Bin Hassan
    BSCS | MCTS | MCITP | OCA | OCP | OCE | SCJP | IBMCDBA

    My Blog
    www.aureus-salah.com

  • The Server and SAN communicate via iSCSI.

    Greg,

    before change of servers, you communicate using iSCSI?


    [font="Times New Roman"]rfr.ferrari[/font]
    DBA - SQL Server 2008
    MCITP | MCTS

    remember is live or suffer twice!
    the period you fastest growing is the most difficult period of your life!
  • please supply more details of the iSCSI setup, hardware or software?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I agree about not running SQLIO. Unfortunately the SAN did not arrive until the day before deployment, and the business refused to let us delay to properly configure and test.


    Greg Roberts

  • Greg, is it just the file growth that is giving you issue?

  • The SAN is dedicated to this SQL Server. There are no other servers accessing LUNs on the SAN.

    As far as SAN config is concerned, I'm a bit in the dark there. I know that it is a 3-node cluster. I don't recall how the array(s) are configured that are hosting the LUNs. My SAN Admin is on vacation,so I can't find out right now.

    Another issue I am seeing, but not sure it really is an issue, is the Average disk milliseconds per Read/Write/ or Transfer is hovering in the 10 - 20 range on all SAN volumes.

    Regarding the previous server, yes it was running iSCSI. Not my favorite, but not budget for a good fibre channel.

    I'll dig into the WAIT stats a bit more and post what I find. Thank you all for your assitance.


    Greg Roberts

  • 456789psw (6/21/2011)


    Greg, is it just the file growth that is giving you issue?

    That was the big smashed thumb. We have also noticed a degradation in our backup times, even before TDE. We saw thruput drop from an average of 115MB/sec to 100MB/sec. But that only equated to 15 extra minutes to the backup job. I'm pretty convinced there is a SAN config issue, but our storage admin instists it is working correctly. And when he runs LAN Sped Test from the DB server to the SAN volumes, performance is good.

    Regarding the growth settings, the files are set to grow at 5GB. This reduces the number of growth events to approximately 1 / week.


    Greg Roberts

  • I would agree pointing more and more to the SAN...he may be right it is working correctly...but it may not be working optimally.( it could be set to run faster or maybe there is less Drives and more RPM resulting in less through put)

    shoot us out the Wait stats...maybe that will help

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply