Memory lacking causing disk issues?

  • I have a sneaky suspicion that our disk stats are poor because of excessive paging due to a lack of enough memory. Below are some stats. Is there a way to prove that paging activity is the cause of excessive disk activity and if so, how do I determine how much more ram to buy. We're running SQL 2008 R2 x64 Enterprise on Windows Server 2008 R2 x64. We have 64 GB ram and are a high I/O OLTP database environment. The previous admin gave 32GB of the RAM to SQL and left 32GB to the OS and others presumably to help with the handling of all the customer files they send us (we backup and restore customer files all day long)

    The only reason I'm not convinced already is because the disk (RAID1 set) that has the pagefile and OS on it has decent stats. It's the RAID sets with the DB on them that perform poorly. Maybe there's another stat I should be looking at to help determine this?

    Average disk sec/Trans

    P drive (RAID10 - 14 drives) 0.014

    G drive (RAID10 - 14 drives) 0.014

    K drive (where DB is, RAID10 - 14 drives) 0.041

    F drive (where tempdb, tlog are, RAID10 - 4 drives) 0.007

    C drive (OS, pagefile, RAID1) 0.005

    Average Disk Queue Length

    P drive (RAID10 - 14 drives) 0.68

    G drive (RAID10 - 14 drives) 6.02

    K drive (where DB is, RAID10 - 14 drives) 71.22

    F drive (where tempdb, tlog are, RAID10 - 4 drives) 2.57

    C drive (OS, pagefile, RAID1) 0.04

    % DiskTime

    P drive 67.6

    G drive (RAID10 - 14 drives) 596.0

    K drive (where DB is, RAID10 - 14 drives) 7358.7

    F drive (where tempdb, tlog are, RAID10 - 4 drives) 261.0

    C drive (OS, pagefile, RAID1) 5.6

    Other stats

    Checkpoint pages/sec 1246.618

    Page File Usage 0.739

    Page life expectancy 429.733

    Page reads/sec (SQL:BM) 1672.008

    Page Splits/sec 65.997

    Page writes/sec (SQL:BM) 1371.892

    Pages/sec 534.441

    Lazy writes/sec 27.735

    Target Server Memory (KB) 33554432

    Total Server Memory (KB) 33554432

    Total Available Memory (MBytes) 13166.208

    Virtual Memory 58881281741

    Full scans 77.165

    Batch Requests/sec 643.659

    Latch waits 1238.704

  • I might check on a few things here.

    1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.

    2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.

    A few links

    - http://msmvps.com/blogs/bradley/archive/2009/01/15/how-to-troubleshoot-sql-server-memory-related-issues-part-1.aspx

    - http://sqlblog.com/blogs/lara_rubbelke/archive/2008/04/18/memory-pressure-on-64-bit-sql-server-2005.aspx

    - http://www.sqlservercentral.com/articles/Performance+Tuning/analyzingmemoryrequirementsforsqlserver/2498/

  • Steve Jones - SSC Editor (1/7/2011)


    I might check on a few things here.

    1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.

    2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.

    A few links

    - http://msmvps.com/blogs/bradley/archive/2009/01/15/how-to-troubleshoot-sql-server-memory-related-issues-part-1.aspx

    - http://sqlblog.com/blogs/lara_rubbelke/archive/2008/04/18/memory-pressure-on-64-bit-sql-server-2005.aspx

    - http://www.sqlservercentral.com/articles/Performance+Tuning/analyzingmemoryrequirementsforsqlserver/2498/

    I'm not sure i understand. Most of our servers like this average between 1000 and 1500 page life expectancy. Some as low as 150 and some in the 3000-4000 range. We are a very high IO environment with lots of small bits of data. I have to explain in technical detail to my boss why it's worth spending thousands on more RAM.

    The BCH ratio is pretty good: (98% on average)

    below are all stats we're collecting besides disk stats

    Batch Requests/sec598.562

    Buffer cache hit ratio98.967

    Checkpoint pages/sec1627.684

    Create-Open File Failure Events32.614

    Current Connections Counter5.425

    DeletedVersions Row Count19871.55

    FileInfo Row Count76621872.08

    FOREIGN KEY Events0.258

    Full scans108.559

    Latch waits1343.628

    Lazy writes/sec14.55

    LOCK Errors1.381

    Page File Usage0.72

    Page life expectancy404.392

    Page reads/sec (SQL:BM)2074.087

    Page Splits/sec84.822

    Page writes/sec (SQL:BM)1572.879

    Pages/sec512.189

    Target Server Memory (KB)33554432

    Total Available Memory (MBytes)13456.167

    Total Server Memory (KB)33554432

    User Connections638.817

    VersionFileInfo Row Count2770363000

    VersionFileLocalName Row Count319386266.6

    Virtual Memory58524989850

    The previous admin took the 64GB ram and gave 32 to SQL and the rest is for apps and data, etc. I need to determine if the rest really needs 32 or if we can give it less. Any idea how to determine that? is availableMBytes showing what of the 32 is left since sql is taking all of it's 32?

  • Well the GB left is the GB left. As you know, sql will grab all it can.

    So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

    That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

  • Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.

    The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?

    Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.

  • Ninja's_RGR'us (1/7/2011)


    Well the GB left is the GB left. As you know, sql will grab all it can.

    So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

    That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

    SQL can't grab that extra ram if we give it the 32gb limit though right?

    As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

  • Steve Jones - SSC Editor (1/7/2011)


    Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.

    The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?

    Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.

    what do I look at besides those stats to see if the pagefile is active?

    here's some more stats:

    Memory\Page Faults/sec 61372.297

    Processor(_Total)\% Processor Time 30.203

    Processor(_Total)\% Privileged Time 10.582

    Process(sqlservr)\% Processor Time 316.959

    Process(sqlservr)\Page Faults/sec 8743.258

    Process(BSC#1)\Page Faults/sec 30.032

    Process(BSC)\Page Faults/sec 29.511

    Process(BBS)\Page Faults/sec 28.466

    Process(DS)\Page Faults/sec 28.165

    Process(EDS)\Page Faults/sec 197.336

    Process(RBKVC#1)\Page Faults/sec 2878.647

    Process(RBKVC)\Page Faults/sec 2785.940

    Process(RBKDC)\Page Faults/sec 2429.607

    Process(AH)\Page Faults/sec #DIV/0!

    Process(ecs#9)\Page Faults/sec 668.772

    Process(ecs#8)\Page Faults/sec 2314.329

    Process(ecs#7)\Page Faults/sec 2232.120

    Process(ecs#6)\Page Faults/sec 300.365

    Process(ecs#5)\Page Faults/sec 948.523

    Process(ecs#4)\Page Faults/sec 84.651

    Process(ecs#3)\Page Faults/sec 2478.006

    Process(ecs#2)\Page Faults/sec 2246.302

    Process(ecs#11)\Page Faults/sec 2908.669

    Process(ecs#10)\Page Faults/sec 2559.737

    Process(ecs#1)\Page Faults/sec 2820.349

    db is split into file groups each on different raid sets. as follows

    high IO non-clustered indexes - raid1 set

    high IO clustered index - raid 10 set (4 disk)

    rest of db - raid1 set

    transaction logs - raid1 set

    tempdb - raid1 set

    os/pagefile/sql-system dbs - raid1 set

    all 15K SAS 256 MB cache. dell PERC6 or 7 cards

  • shifty1981 (1/7/2011)


    Ninja's_RGR'us (1/7/2011)


    Well the GB left is the GB left. As you know, sql will grab all it can.

    So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

    That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

    SQL can't grab that extra ram if we give it the 32gb limit though right?

    As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

    Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.

    I'm not too familiar with those stats and best practices.

    All I can tell you is that you need to compare those values with a previously taken "normal baseline".

    That way you can compare with yourself and see what's normal for you.

  • Ninja's_RGR'us (1/7/2011)


    shifty1981 (1/7/2011)


    Ninja's_RGR'us (1/7/2011)


    Well the GB left is the GB left. As you know, sql will grab all it can.

    So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

    That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

    SQL can't grab that extra ram if we give it the 32gb limit though right?

    As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

    Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.

    I'm not too familiar with those stats and best practices.

    All I can tell you is that you need to compare those values with a previously taken "normal baseline".

    That way you can compare with yourself and see what's normal for you.

    Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".

  • Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.

    The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.

    1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.

    Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.

    I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.

    I'd look more into wait stats for the server. Are they I/O related?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • shifty1981 (1/7/2011)


    Ninja's_RGR'us (1/7/2011)


    shifty1981 (1/7/2011)


    Ninja's_RGR'us (1/7/2011)


    Well the GB left is the GB left. As you know, sql will grab all it can.

    So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

    That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

    SQL can't grab that extra ram if we give it the 32gb limit though right?

    As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

    Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.

    I'm not too familiar with those stats and best practices.

    All I can tell you is that you need to compare those values with a previously taken "normal baseline".

    That way you can compare with yourself and see what's normal for you.

    Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".

    Ok, I'd then go with GSquared train of thaughts and check the waits. What does this return?

    WITH Waits AS

    (SELECT

    wait_type,

    wait_time_ms / 1000.0 AS WaitS,

    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

    signal_wait_time_ms / 1000.0 AS SignalS,

    waiting_tasks_count AS WaitCount,

    100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN (

    'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

    'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

    'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')

    )

    SELECT

    W1.wait_type AS WaitType,

    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

    W1.WaitCount AS WaitCount,

    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

    FROM Waits AS W1

    INNER JOIN Waits AS W2

    ON W2.RowNum <= W1.RowNum

    GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

    HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold

    GO

  • GSquared (1/7/2011)


    Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.

    The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.

    1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.

    Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.

    I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.

    I'd look more into wait stats for the server. Are they I/O related?

    Thanks gsquared. I'll see if we can allocate more. I'll probably need to track available MBytes and gather from that how much we can afford to allocate without risking a bottleneck.

    As for the wait stats, can you explain specifically what stat(s) you're suggesting I track and how to determine if they are I/O related? Thanks.

  • Run the query Ninja posted. It might help to read the articles Steve linked. Start from there.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • And don't bother allocating more RAM till you've checked those things.

    If the waits are I/O related, or CPU related, then there's no point in allocating more RAM. I'm betting they're I/O related. Ninja's query will help you determine if that's the case.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply