Memory lacking causing disk issues?

Question

Post reply

Memory lacking causing disk issues?

shifty1981

Ten Centuries

Points: 1071
More actions
January 7, 2011 at 7:42 am

#245148

I have a sneaky suspicion that our disk stats are poor because of excessive paging due to a lack of enough memory. Below are some stats. Is there a way to prove that paging activity is the cause of excessive disk activity and if so, how do I determine how much more ram to buy. We're running SQL 2008 R2 x64 Enterprise on Windows Server 2008 R2 x64. We have 64 GB ram and are a high I/O OLTP database environment. The previous admin gave 32GB of the RAM to SQL and left 32GB to the OS and others presumably to help with the handling of all the customer files they send us (we backup and restore customer files all day long)
The only reason I'm not convinced already is because the disk (RAID1 set) that has the pagefile and OS on it has decent stats. It's the RAID sets with the DB on them that perform poorly. Maybe there's another stat I should be looking at to help determine this?
Average disk sec/Trans
P drive (RAID10 - 14 drives) 0.014
G drive (RAID10 - 14 drives) 0.014
K drive (where DB is, RAID10 - 14 drives) 0.041
F drive (where tempdb, tlog are, RAID10 - 4 drives) 0.007
C drive (OS, pagefile, RAID1) 0.005
Average Disk Queue Length
P drive (RAID10 - 14 drives) 0.68
G drive (RAID10 - 14 drives) 6.02
K drive (where DB is, RAID10 - 14 drives) 71.22
F drive (where tempdb, tlog are, RAID10 - 4 drives) 2.57
C drive (OS, pagefile, RAID1) 0.04
% DiskTime
P drive 67.6
G drive (RAID10 - 14 drives) 596.0
K drive (where DB is, RAID10 - 14 drives) 7358.7
F drive (where tempdb, tlog are, RAID10 - 4 drives) 261.0
C drive (OS, pagefile, RAID1) 5.6
Other stats
Checkpoint pages/sec 1246.618
Page File Usage 0.739
Page life expectancy 429.733
Page reads/sec (SQL:BM) 1672.008
Page Splits/sec 65.997
Page writes/sec (SQL:BM) 1371.892
Pages/sec 534.441
Lazy writes/sec 27.735
Target Server Memory (KB) 33554432
Total Server Memory (KB) 33554432
Total Available Memory (MBytes) 13166.208
Virtual Memory 58881281741
Full scans 77.165
Batch Requests/sec 643.659
Latch waits 1238.704

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 1

I might check on a few things here.

1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.

2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.

A few links

- http://msmvps.com/blogs/bradley/archive/2009/01/15/how-to-troubleshoot-sql-server-memory-related-issues-part-1.aspx

- http://sqlblog.com/blogs/lara_rubbelke/archive/2008/04/18/memory-pressure-on-64-bit-sql-server-2005.aspx

- http://www.sqlservercentral.com/articles/Performance+Tuning/analyzingmemoryrequirementsforsqlserver/2498/

shifty1981 Ten Centuries Points: 1071 More actions · Answer 2

Steve Jones - SSC Editor (1/7/2011)
I might check on a few things here.
1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.
2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.
A few links
- http://msmvps.com/blogs/bradley/archive/2009/01/15/how-to-troubleshoot-sql-server-memory-related-issues-part-1.aspx
- http://sqlblog.com/blogs/lara_rubbelke/archive/2008/04/18/memory-pressure-on-64-bit-sql-server-2005.aspx
- http://www.sqlservercentral.com/articles/Performance+Tuning/analyzingmemoryrequirementsforsqlserver/2498/

I'm not sure i understand. Most of our servers like this average between 1000 and 1500 page life expectancy. Some as low as 150 and some in the 3000-4000 range. We are a very high IO environment with lots of small bits of data. I have to explain in technical detail to my boss why it's worth spending thousands on more RAM.

The BCH ratio is pretty good: (98% on average)

below are all stats we're collecting besides disk stats

Batch Requests/sec598.562

Buffer cache hit ratio98.967

Checkpoint pages/sec1627.684

Create-Open File Failure Events32.614

Current Connections Counter5.425

DeletedVersions Row Count19871.55

FileInfo Row Count76621872.08

FOREIGN KEY Events0.258

Full scans108.559

Latch waits1343.628

Lazy writes/sec14.55

LOCK Errors1.381

Page File Usage0.72

Page life expectancy404.392

Page reads/sec (SQL:BM)2074.087

Page Splits/sec84.822

Page writes/sec (SQL:BM)1572.879

Pages/sec512.189

Target Server Memory (KB)33554432

Total Available Memory (MBytes)13456.167

Total Server Memory (KB)33554432

User Connections638.817

VersionFileInfo Row Count2770363000

VersionFileLocalName Row Count319386266.6

Virtual Memory58524989850

The previous admin took the 64GB ram and gave 32 to SQL and the rest is for apps and data, etc. I need to determine if the rest really needs 32 or if we can give it less. Any idea how to determine that? is availableMBytes showing what of the 32 is left since sql is taking all of it's 32?

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Well the GB left is the GB left. As you know, sql will grab all it can.

So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.

That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 4

Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.

The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?

Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.

shifty1981 Ten Centuries Points: 1071 More actions · Answer 5

Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.
So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.

SQL can't grab that extra ram if we give it the 32gb limit though right?

As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

shifty1981 Ten Centuries Points: 1071 More actions · Answer 6

Steve Jones - SSC Editor (1/7/2011)
Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.
The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?
Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.

what do I look at besides those stats to see if the pagefile is active?

here's some more stats:

Memory\Page Faults/sec 61372.297

Processor(_Total)\% Processor Time 30.203

Processor(_Total)\% Privileged Time 10.582

Process(sqlservr)\% Processor Time 316.959

Process(sqlservr)\Page Faults/sec 8743.258

Process(BSC#1)\Page Faults/sec 30.032

Process(BSC)\Page Faults/sec 29.511

Process(BBS)\Page Faults/sec 28.466

Process(DS)\Page Faults/sec 28.165

Process(EDS)\Page Faults/sec 197.336

Process(RBKVC#1)\Page Faults/sec 2878.647

Process(RBKVC)\Page Faults/sec 2785.940

Process(RBKDC)\Page Faults/sec 2429.607

Process(AH)\Page Faults/sec #DIV/0!

Process(ecs#9)\Page Faults/sec 668.772

Process(ecs#8)\Page Faults/sec 2314.329

Process(ecs#7)\Page Faults/sec 2232.120

Process(ecs#6)\Page Faults/sec 300.365

Process(ecs#5)\Page Faults/sec 948.523

Process(ecs#4)\Page Faults/sec 84.651

Process(ecs#3)\Page Faults/sec 2478.006

Process(ecs#2)\Page Faults/sec 2246.302

Process(ecs#11)\Page Faults/sec 2908.669

Process(ecs#10)\Page Faults/sec 2559.737

Process(ecs#1)\Page Faults/sec 2820.349

db is split into file groups each on different raid sets. as follows

high IO non-clustered indexes - raid1 set

high IO clustered index - raid 10 set (4 disk)

rest of db - raid1 set

transaction logs - raid1 set

tempdb - raid1 set

os/pagefile/sql-system dbs - raid1 set

all 15K SAS 256 MB cache. dell PERC6 or 7 cards

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 7

shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.
So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?

Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.

I'm not too familiar with those stats and best practices.

All I can tell you is that you need to compare those values with a previously taken "normal baseline".

That way you can compare with yourself and see what's normal for you.

shifty1981 Ten Centuries Points: 1071 More actions · Answer 8

Ninja's_RGR'us (1/7/2011)
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.
So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.
I'm not too familiar with those stats and best practices.
All I can tell you is that you need to compare those values with a previously taken "normal baseline".
That way you can compare with yourself and see what's normal for you.

Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".

GSquared SSC Guru Points: 260824 More actions · Answer 9

Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.

The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.

1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.

Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.

I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.

I'd look more into wait stats for the server. Are they I/O related?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 10

Good thought, Gus.

A few wait stat posts:

http://www.sqlservercentral.com/blogs/glennberry/archive/2009/11/8/sql-server-wait-type-repository.aspx

http://www.sqlservercentral.com/blogs/sqlrnnr/archive/2010/12/13/wait-stats-resources.aspx

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 11

shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.
So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.
I'm not too familiar with those stats and best practices.
All I can tell you is that you need to compare those values with a previously taken "normal baseline".
That way you can compare with yourself and see what's normal for you.
Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".

Ok, I'd then go with GSquared train of thaughts and check the waits. What does this return?

WITH Waits AS

(SELECT

wait_type,

wait_time_ms / 1000.0 AS WaitS,

(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

signal_wait_time_ms / 1000.0 AS SignalS,

waiting_tasks_count AS WaitCount,

100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

FROM sys.dm_os_wait_stats

WHERE wait_type NOT IN (

'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')

)

SELECT

W1.wait_type AS WaitType,

CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

W1.WaitCount AS WaitCount,

CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

FROM Waits AS W1

INNER JOIN Waits AS W2

ON W2.RowNum <= W1.RowNum

GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold

GO

shifty1981 Ten Centuries Points: 1071 More actions · Answer 12

GSquared (1/7/2011)
Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.
The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.
1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.
Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.
I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.
I'd look more into wait stats for the server. Are they I/O related?

Thanks gsquared. I'll see if we can allocate more. I'll probably need to track available MBytes and gather from that how much we can afford to allocate without risking a bottleneck.

As for the wait stats, can you explain specifically what stat(s) you're suggesting I track and how to determine if they are I/O related? Thanks.

GSquared SSC Guru Points: 260824 More actions · Answer 13

Run the query Ninja posted. It might help to read the articles Steve linked. Start from there.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 14

And don't bother allocating more RAM till you've checked those things.

If the waits are I/O related, or CPU related, then there's no point in allocating more RAM. I'm betting they're I/O related. Ninja's query will help you determine if that's the case.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon