January 7, 2011 at 7:42 am
I have a sneaky suspicion that our disk stats are poor because of excessive paging due to a lack of enough memory. Below are some stats. Is there a way to prove that paging activity is the cause of excessive disk activity and if so, how do I determine how much more ram to buy. We're running SQL 2008 R2 x64 Enterprise on Windows Server 2008 R2 x64. We have 64 GB ram and are a high I/O OLTP database environment. The previous admin gave 32GB of the RAM to SQL and left 32GB to the OS and others presumably to help with the handling of all the customer files they send us (we backup and restore customer files all day long)
The only reason I'm not convinced already is because the disk (RAID1 set) that has the pagefile and OS on it has decent stats. It's the RAID sets with the DB on them that perform poorly. Maybe there's another stat I should be looking at to help determine this?
Average disk sec/Trans
P drive (RAID10 - 14 drives) 0.014
G drive (RAID10 - 14 drives) 0.014
K drive (where DB is, RAID10 - 14 drives) 0.041
F drive (where tempdb, tlog are, RAID10 - 4 drives) 0.007
C drive (OS, pagefile, RAID1) 0.005
Average Disk Queue Length
P drive (RAID10 - 14 drives) 0.68
G drive (RAID10 - 14 drives) 6.02
K drive (where DB is, RAID10 - 14 drives) 71.22
F drive (where tempdb, tlog are, RAID10 - 4 drives) 2.57
C drive (OS, pagefile, RAID1) 0.04
% DiskTime
P drive 67.6
G drive (RAID10 - 14 drives) 596.0
K drive (where DB is, RAID10 - 14 drives) 7358.7
F drive (where tempdb, tlog are, RAID10 - 4 drives) 261.0
C drive (OS, pagefile, RAID1) 5.6
Other stats
Checkpoint pages/sec 1246.618
Page File Usage 0.739
Page life expectancy 429.733
Page reads/sec (SQL:BM) 1672.008
Page Splits/sec 65.997
Page writes/sec (SQL:BM) 1371.892
Pages/sec 534.441
Lazy writes/sec 27.735
Target Server Memory (KB) 33554432
Total Server Memory (KB) 33554432
Total Available Memory (MBytes) 13166.208
Virtual Memory 58881281741
Full scans 77.165
Batch Requests/sec 643.659
Latch waits 1238.704
January 7, 2011 at 9:03 am
I might check on a few things here.
1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.
2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.
A few links
January 7, 2011 at 9:37 am
Steve Jones - SSC Editor (1/7/2011)
I might check on a few things here.1. Check the buffer cache hit ratio, and see if it is high. Below 90%, I'd definitely be thinking memory.
2. Check page life expectency. If it is less than minutes, pages are being tossed out of the buffer to make room for new pages.
A few links
I'm not sure i understand. Most of our servers like this average between 1000 and 1500 page life expectancy. Some as low as 150 and some in the 3000-4000 range. We are a very high IO environment with lots of small bits of data. I have to explain in technical detail to my boss why it's worth spending thousands on more RAM.
The BCH ratio is pretty good: (98% on average)
below are all stats we're collecting besides disk stats
Batch Requests/sec598.562
Buffer cache hit ratio98.967
Checkpoint pages/sec1627.684
Create-Open File Failure Events32.614
Current Connections Counter5.425
DeletedVersions Row Count19871.55
FileInfo Row Count76621872.08
FOREIGN KEY Events0.258
Full scans108.559
Latch waits1343.628
Lazy writes/sec14.55
LOCK Errors1.381
Page File Usage0.72
Page life expectancy404.392
Page reads/sec (SQL:BM)2074.087
Page Splits/sec84.822
Page writes/sec (SQL:BM)1572.879
Pages/sec512.189
Target Server Memory (KB)33554432
Total Available Memory (MBytes)13456.167
Total Server Memory (KB)33554432
User Connections638.817
VersionFileInfo Row Count2770363000
VersionFileLocalName Row Count319386266.6
Virtual Memory58524989850
The previous admin took the 64GB ram and gave 32 to SQL and the rest is for apps and data, etc. I need to determine if the rest really needs 32 or if we can give it less. Any idea how to determine that? is availableMBytes showing what of the 32 is left since sql is taking all of it's 32?
January 7, 2011 at 9:44 am
Well the GB left is the GB left. As you know, sql will grab all it can.
So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
January 7, 2011 at 9:59 am
Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.
The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?
Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.
January 7, 2011 at 12:41 pm
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
January 7, 2011 at 12:43 pm
Steve Jones - SSC Editor (1/7/2011)
Your Target server memory is what SQL wants, and it matches your total. So I don't think you have memory pressure here.The high I/Os might just be the sign of a busy system. Perhaps you need better/more disks?
Is the Windows pagefile seeing a lot of activity? It can be a sign that the other apps need more memory.
what do I look at besides those stats to see if the pagefile is active?
here's some more stats:
Memory\Page Faults/sec 61372.297
Processor(_Total)\% Processor Time 30.203
Processor(_Total)\% Privileged Time 10.582
Process(sqlservr)\% Processor Time 316.959
Process(sqlservr)\Page Faults/sec 8743.258
Process(BSC#1)\Page Faults/sec 30.032
Process(BSC)\Page Faults/sec 29.511
Process(BBS)\Page Faults/sec 28.466
Process(DS)\Page Faults/sec 28.165
Process(EDS)\Page Faults/sec 197.336
Process(RBKVC#1)\Page Faults/sec 2878.647
Process(RBKVC)\Page Faults/sec 2785.940
Process(RBKDC)\Page Faults/sec 2429.607
Process(AH)\Page Faults/sec #DIV/0!
Process(ecs#9)\Page Faults/sec 668.772
Process(ecs#8)\Page Faults/sec 2314.329
Process(ecs#7)\Page Faults/sec 2232.120
Process(ecs#6)\Page Faults/sec 300.365
Process(ecs#5)\Page Faults/sec 948.523
Process(ecs#4)\Page Faults/sec 84.651
Process(ecs#3)\Page Faults/sec 2478.006
Process(ecs#2)\Page Faults/sec 2246.302
Process(ecs#11)\Page Faults/sec 2908.669
Process(ecs#10)\Page Faults/sec 2559.737
Process(ecs#1)\Page Faults/sec 2820.349
db is split into file groups each on different raid sets. as follows
high IO non-clustered indexes - raid1 set
high IO clustered index - raid 10 set (4 disk)
rest of db - raid1 set
transaction logs - raid1 set
tempdb - raid1 set
os/pagefile/sql-system dbs - raid1 set
all 15K SAS 256 MB cache. dell PERC6 or 7 cards
January 7, 2011 at 12:45 pm
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.
I'm not too familiar with those stats and best practices.
All I can tell you is that you need to compare those values with a previously taken "normal baseline".
That way you can compare with yourself and see what's normal for you.
January 7, 2011 at 12:52 pm
Ninja's_RGR'us (1/7/2011)
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.
I'm not too familiar with those stats and best practices.
All I can tell you is that you need to compare those values with a previously taken "normal baseline".
That way you can compare with yourself and see what's normal for you.
Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".
January 7, 2011 at 12:58 pm
Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.
The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.
1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.
Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.
I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.
I'd look more into wait stats for the server. Are they I/O related?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 7, 2011 at 1:06 pm
January 7, 2011 at 1:07 pm
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
shifty1981 (1/7/2011)
Ninja's_RGR'us (1/7/2011)
Well the GB left is the GB left. As you know, sql will grab all it can.So if you have plenty of ram left after the server's been up for a few days / weeks. Then it seems pretty safe to send more ram to sqlserver.
That being said 98% cache hit ratio is not all that bad with over 1H page life expectancy.
SQL can't grab that extra ram if we give it the 32gb limit though right?
As for BCHR and PLE, how do I compare them and make a sound judgement? I've heard people tell me those stats are bad that PLE should be closer to tens of thousands. but for a high IO system with tons of small changes, is that really the case?
Yes you're correct, SQL can't grab more than you allow it to. But if you have 32 GB for windows and 25 of it is left unused then you have room to play with. However Steve already told you that this is not your problem.
I'm not too familiar with those stats and best practices.
All I can tell you is that you need to compare those values with a previously taken "normal baseline".
That way you can compare with yourself and see what's normal for you.
Thanks. unfortunately for us, normal means "poorly performing". That's part of the reason I was hired. so our baseline will not be what we want to shoot for. the baseline they've given me to meet is "best practice or recommended values by MSFT".
Ok, I'd then go with GSquared train of thaughts and check the waits. What does this return?
WITH Waits AS
(SELECT
wait_type,
wait_time_ms / 1000.0 AS WaitS,
(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
signal_wait_time_ms / 1000.0 AS SignalS,
waiting_tasks_count AS WaitCount,
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')
)
SELECT
W1.wait_type AS WaitType,
CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
W1.WaitCount AS WaitCount,
CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO
January 7, 2011 at 1:13 pm
GSquared (1/7/2011)
Page Life Expectancy is the number of seconds data is left in RAM before SQL Server needs to move it out to make room for more data.The Buffer Hit Ratio is how often data is found in RAM instead of having to go to disk.
1,000 seconds is a LONG time for data to sit in RAM on a highly transactional system, and the high hit ratio means that almost all the data that's needed is in there.
Now, that could be misleading if every query is 90% the same, and the remaining 10% is what really matters.
I wouldn't worry, right now, about more RAM. Even if you need it, you can up the allocation to SQL Server from 32 to anything up to about 48 without impacting the OS and other applications, per your data on free memory. Allocating more that you already have will have the same effect as buying more, and doesn't cost anything.
I'd look more into wait stats for the server. Are they I/O related?
Thanks gsquared. I'll see if we can allocate more. I'll probably need to track available MBytes and gather from that how much we can afford to allocate without risking a bottleneck.
As for the wait stats, can you explain specifically what stat(s) you're suggesting I track and how to determine if they are I/O related? Thanks.
January 7, 2011 at 1:17 pm
Run the query Ninja posted. It might help to read the articles Steve linked. Start from there.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 7, 2011 at 1:18 pm
And don't bother allocating more RAM till you've checked those things.
If the waits are I/O related, or CPU related, then there's no point in allocating more RAM. I'm betting they're I/O related. Ninja's query will help you determine if that's the case.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 15 posts - 1 through 15 (of 35 total)
You must be logged in to reply to this topic. Login to reply