January 22, 2013 at 4:07 am
On a production system we have a PLE which is low at some moments.
So since yesterday afternoon we have run a monitoring batch registering the PLE every two minutes and the cache size for each database.
Most cache is used by only one database.
Backup does not influence the PLE.
CheckDB clears the complete cache.
CheckDB is started at 04:00.
The lowest size of the cache we see for the specific databas is
just over 7000 Mb.
The lowest PLE we see is 7.
My conclusion is that 7 seconds is the average PLE. So the complete cache been replaces in the last 14 seconds. So that would be 0.5 Gb per second. (7 Gb in 14 seconds gives an average PLE of 7 seconds)?
This seems a lot to me, is my reasoning in error?
CheckDB effectively destroys the cache. (Why?).
Another thing I saw.
(delta PLE) / (delta time) is often 0 (zero), 0.5 or 1
Why is this ?
The number 1 is logical, during a delta time, the ple can gain the same time, not more.
I cannot explain the 0.5 and the 0. I see periods of 25 minutes where the PLE is gaining 1 minute every two minutes.
Outside the checkDB timeframe the larges drops I see in the PLE is from 6952 to 3830 in two minutes, from 4103 to 1547 in two minutes and from 3650 to 2058 in two minutes. (Cache size around 8000 Mb).
Can I calculate how much MB is read in the two minutes from this ?
(My estimate is that abouth half of the 8 Gb must have been read in those 2 minutes to between the drop of PLE. Large question to me).
Before and after the drop the PLE often did not change at all for some timeperiods. I do not understand this behavior.
(PLE long time no change, significant drop, then again long time no change).
Any suggestions how to catch the processes which cause the drop in PLE ?
Today I will be monitoring, now a PLE is registered every 30 seconds.
Yesterday during the day the PLE was as low as about 200. (no timing, no registering of that). Today we hope to catch the situation and the cullprit.
Any advise is welcome.
Thanks for your time and attention.
Ben Brugman
January 22, 2013 at 4:42 am
ben.brugman (1/22/2013)
Any suggestions how to catch the processes which cause the drop in PLE ?
it means you have buffer cache issue and IO contentions
see these links
http://sqlblog.com/blogs/buck_woody/archive/2010/06/29/the-windows-page-file-and-sql-server.aspx
http://support.microsoft.com/kb/889654/en-us
and take help from these
/**********************************************************
* top procedures memory consumption per execution
* (this will show mostly reports & jobs)
***********************************************************/
SELECT TOP 100 *
FROM
(
SELECT
DatabaseName = DB_NAME(qt.dbid)
,ObjectName = OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
,DiskReads = SUM(qs.total_physical_reads) -- The worst reads, disk reads
,MemoryReads = SUM(qs.total_logical_reads) --Logical Reads are memory reads
,Executions = SUM(qs.execution_count)
,IO_Per_Execution = SUM((qs.total_physical_reads + qs.total_logical_reads) / qs.execution_count)
,CPUTime = SUM(qs.total_worker_time)
,DiskWaitAndCPUTime = SUM(qs.total_elapsed_time)
,MemoryWrites = SUM(qs.max_logical_writes)
,DateLastExecuted = MAX(qs.last_execution_time)
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
GROUP BY DB_NAME(qt.dbid), OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
) T
ORDER BY IO_Per_Execution DESC
/**********************************************************
* top procedures memory consumption total
* (this will show more operational procedures)
***********************************************************/
SELECT TOP 100 *
FROM
(
SELECT
DatabaseName = DB_NAME(qt.dbid)
,ObjectName = OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
,DiskReads = SUM(qs.total_physical_reads) -- The worst reads, disk reads
,MemoryReads = SUM(qs.total_logical_reads) --Logical Reads are memory reads
,Total_IO_Reads = SUM(qs.total_physical_reads + qs.total_logical_reads)
,Executions = SUM(qs.execution_count)
,IO_Per_Execution = SUM((qs.total_physical_reads + qs.total_logical_reads) / qs.execution_count)
,CPUTime = SUM(qs.total_worker_time)
,DiskWaitAndCPUTime = SUM(qs.total_elapsed_time)
,MemoryWrites = SUM(qs.max_logical_writes)
,DateLastExecuted = MAX(qs.last_execution_time)
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
GROUP BY DB_NAME(qt.dbid), OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
) T
ORDER BY Total_IO_Reads DESC
/**********************************************************
* top adhoc queries memory consumption total
***********************************************************/
SELECT TOP 100 *
FROM
(
SELECT
DatabaseName = DB_NAME(qt.dbid)
,QueryText = qt.text
,DiskReads = SUM(qs.total_physical_reads) -- The worst reads, disk reads
,MemoryReads = SUM(qs.total_logical_reads) --Logical Reads are memory reads
,Total_IO_Reads = SUM(qs.total_physical_reads + qs.total_logical_reads)
,Executions = SUM(qs.execution_count)
,IO_Per_Execution = SUM((qs.total_physical_reads + qs.total_logical_reads) / qs.execution_count)
,CPUTime = SUM(qs.total_worker_time)
,DiskWaitAndCPUTime = SUM(qs.total_elapsed_time)
,MemoryWrites = SUM(qs.max_logical_writes)
,DateLastExecuted = MAX(qs.last_execution_time)
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid) IS NULL
GROUP BY DB_NAME(qt.dbid), qt.text, OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
) T
ORDER BY Total_IO_Reads DESC
/**********************************************************
* top adhoc queries memory consumption per execution
***********************************************************/
SELECT TOP 100 *
FROM
(
SELECT
DatabaseName = DB_NAME(qt.dbid)
,QueryText = qt.text
,DiskReads = SUM(qs.total_physical_reads) -- The worst reads, disk reads
,MemoryReads = SUM(qs.total_logical_reads) --Logical Reads are memory reads
,Total_IO_Reads = SUM(qs.total_physical_reads + qs.total_logical_reads)
,Executions = SUM(qs.execution_count)
,IO_Per_Execution = SUM((qs.total_physical_reads + qs.total_logical_reads) / qs.execution_count)
,CPUTime = SUM(qs.total_worker_time)
,DiskWaitAndCPUTime = SUM(qs.total_elapsed_time)
,MemoryWrites = SUM(qs.max_logical_writes)
,DateLastExecuted = MAX(qs.last_execution_time)
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid) IS NULL
GROUP BY DB_NAME(qt.dbid), qt.text, OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
) T
ORDER BY IO_Per_Execution DESC
AND
--Bpool stats
select (bpool_committed * 8192)/ (1024*1024) as bpool_committed_mb
, (cast(bpool_commit_target as bigint) * 8192) / (1024*1024) as bpool_target_mb,
(bpool_visible * 8192) / (1024*1024) as bpool_visible_mb
from sys.dm_os_sys_info
go
-- Get me physical RAM installed
-- and size of user VAS
select physical_memory_in_bytes/(1024*1024) as phys_mem_mb,
virtual_memory_in_bytes/(1024*1024) as user_virtual_address_space_size
from sys.dm_os_sys_info
go
--
-- Get me other information about system memory
--
select total_physical_memory_kb/(1024) as phys_mem_mb,
available_physical_memory_kb/(1024) as avail_phys_mem_mb,
system_cache_kb/(1024) as sys_cache_mb,
(kernel_paged_pool_kb+kernel_nonpaged_pool_kb)/(1024) as kernel_pool_mb,
total_page_file_kb/(1024) as total_virtual_memory_mb,
available_page_file_kb/(1024) as available_virtual_memory_mb,
system_memory_state_desc
from sys.dm_os_sys_memory
go
-- Get me memory information about SQLSERVR.EXE process
-- GetMemoryProcessInfo() API used for this
-- physical_memory_in_use_kb
select physical_memory_in_use_kb/(1024) as sql_physmem_inuse_mb,
locked_page_allocations_kb/(1024) as awe_memory_mb,
total_virtual_address_space_kb/(1024) as max_vas_mb,
virtual_address_space_committed_kb/(1024) as sql_committed_mb,
memory_utilization_percentage as working_set_percentage,
virtual_address_space_available_kb/(1024) as vas_available_mb,
process_physical_memory_low as is_there_external_pressure,
process_virtual_memory_low as is_there_vas_pressure
from sys.dm_os_process_memory
go
select * from sys.dm_os_ring_buffers
where ring_buffer_type like 'RING_BUFFER_RESOURCE%'
go
select memory_node_id as node, virtual_address_space_reserved_kb/(1024) as VAS_reserved_mb,
virtual_address_space_committed_kb/(1024) as virtual_committed_mb,
locked_page_allocations_kb/(1024) as locked_pages_mb,
single_pages_kb/(1024) as single_pages_mb,
multi_pages_kb/(1024) as multi_pages_mb,
shared_memory_committed_kb/(1024) as shared_memory_mb
from sys.dm_os_memory_nodes
where memory_node_id != 64
go
with vasummary(Size,reserved,free) as ( select size = vadump.size,
reserved = SUM(case(convert(int, vadump.base) ^ 0) when 0 then 0 else 1 end),
free = SUM(case(convert(int, vadump.base) ^ 0x0) when 0 then 1 else 0 end)
from
(select CONVERT(varbinary, sum(region_size_in_bytes)) as size,
region_allocation_base_address as base
from sys.dm_os_virtual_address_dump
where region_allocation_base_address <> 0x0
group by region_allocation_base_address
UNION(
select CONVERT(varbinary, region_size_in_bytes),
region_allocation_base_address
from sys.dm_os_virtual_address_dump
where region_allocation_base_address = 0x0)
)
as vadump
group by size)
select * from vasummary
go
-- Get me all clerks that take some memory
--
select * from sys.dm_os_memory_clerks
where (single_pages_kb > 0) or (multi_pages_kb > 0)
or (virtual_memory_committed_kb > 0)
go
-- Get me stolen pages
--
select (SUM(single_pages_kb)*1024)/8192 as total_stolen_pages
from sys.dm_os_memory_clerks
go
-- Breakdown clerks with stolen pages
select type, name, sum((single_pages_kb*1024)/8192) as stolen_pages
from sys.dm_os_memory_clerks
where single_pages_kb > 0
group by type, name
order by stolen_pages desc
go
-- Get me the total amount of memory consumed by multi_page consumers
--
select SUM(multi_pages_kb)/1024 as total_multi_pages_mb
from sys.dm_os_memory_clerks
go
-- What about multi_page consumers
--
select type, name, sum(multi_pages_kb)/1024 as multi_pages_mb
from sys.dm_os_memory_clerks
where multi_pages_kb > 0
group by type, name
order by multi_pages_mb desc
go
-- Let's now get the total consumption of virtual allocator
--
select SUM(virtual_memory_committed_kb)/1024 as total_virtual_mem_mb
from sys.dm_os_memory_clerks
go
-- Breakdown the clerks who use virtual allocator
--
select type, name, sum(virtual_memory_committed_kb)/1024 as virtual_mem_mb
from sys.dm_os_memory_clerks
where virtual_memory_committed_kb > 0
group by type, name
order by virtual_mem_mb desc
go
-- Is anyone using AWE allocator?
--
select SUM(awe_allocated_kb)/1024 as total_awe_allocated_mb
from sys.dm_os_memory_clerks
go
-- Who is the AWE user?
--
select type, name, sum(awe_allocated_kb)/1024 as awe_allocated_mb
from sys.dm_os_memory_clerks
where awe_allocated_kb > 0
group by type, name
order by awe_allocated_mb desc
go
-- What is the total memory used by the clerks?
--
select (sum(multi_pages_kb)+
SUM(virtual_memory_committed_kb)+
SUM(awe_allocated_kb))/1024
from sys.dm_os_memory_clerks
go
--
-- Does this sync up with what the node thinks?
--
select SUM(virtual_address_space_committed_kb)/1024 as total_node_virtual_memory_mb,
SUM(locked_page_allocations_kb)/1024 as total_awe_memory_mb,
SUM(single_pages_kb)/1024 as total_single_pages_mb,
SUM(multi_pages_kb)/1024 as total_multi_pages_mb
from sys.dm_os_memory_nodes
where memory_node_id != 64
go
--
-- Total memory used by SQL Server through SQLOS memory nodes
-- including DAC node
-- What takes up the rest of the space?
select (SUM(virtual_address_space_committed_kb)+
SUM(locked_page_allocations_kb)+
SUM(multi_pages_kb))/1024 as total_sql_memusage_mb
from sys.dm_os_memory_nodes
go
--
-- Who are the biggest cache stores?
select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024
as cache_size_mb
from sys.dm_os_memory_cache_counters
where type like 'CACHESTORE%'
group by name, type
order by cache_size_mb desc
go
--
-- Who are the biggest user stores?
select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024
as cache_size_mb
from sys.dm_os_memory_cache_counters
where type like 'USERSTORE%'
group by name, type
order by cache_size_mb desc
go
--
-- Who are the biggest object stores?
select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024
as cache_size_mb
from sys.dm_os_memory_clerks
where type like 'OBJECTSTORE%'
group by name, type
order by cache_size_mb desc
go
select mc.type, mo.type from sys.dm_os_memory_clerks mc
join sys.dm_os_memory_objects mo
on mc.page_allocator_address = mo.page_allocator_address
group by mc.type, mo.type
order by mc.type, mo.type
go
-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)
January 22, 2013 at 5:28 am
ben.brugman (1/22/2013)
CheckDB effectively destroys the cache. (Why?).
Because it reads every single allocated page in the database and, like all other operations, reads them through the buffer pool. Unless you happen to have a buffer pool far larger than the database, running CheckDB will throw just about all other pages out of the buffer pool as it runs.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 22, 2013 at 10:07 am
Additional, we now saw the PLE drop from a few thousand to 7 within the 30 second time frame.
Size of the cache at that moment was 8 Gb.
This means that SQL-server expects that within 14 seconds the total memory will be refressed.
the process that was running at that time was running for 13 seconds.
(This was probably the process wich did cause the 'effect').
But how is it possible that SQL-server Expects that the memory can be refressed (8Gb of it) in 14 seconds ?
I do not think normal disk IO (SAN) can handle that amount of IO.
How is the PLE determined?
Thanks for your time and attention.
ben brugman
sorry for shouting.:-D
January 22, 2013 at 4:09 pm
ben.brugman (1/22/2013)
But how is it possible that SQL-server Expects that the memory can be refressed (8Gb of it) in 14 seconds ?I do not think normal disk IO (SAN) can handle that amount of IO.
How is the PLE determined?
Why don't you stop inventing your own definition of PLE and read the documentation?
http://msdn.microsoft.com/en-us/library/ms189628.aspx
Page life expectancy - Indicates the number of seconds a page will stay in the buffer pool without references.
And based on the above please try to explain what PLE has to do with disk IO and SAN?
January 24, 2013 at 4:58 am
Alexander Suprun (1/22/2013)
ben.brugman (1/22/2013)
But how is it possible that SQL-server Expects that the memory can be refressed (8Gb of it) in 14 seconds ?I do not think normal disk IO (SAN) can handle that amount of IO.
How is the PLE determined?
Why don't you stop inventing your own definition of PLE and read the documentation?
http://msdn.microsoft.com/en-us/library/ms189628.aspx
Page life expectancy - Indicates the number of seconds a page will stay in the buffer pool without references.
The definition status that PLE is a prediction.
Predictions often are based on something from the past.
So I would like to know how PLE is determined.
And based on the above please try to explain what PLE has to do with disk IO and SAN?
I have seen very low PLE's starting this thread 7 was the lowest. (Now I have seen even lower). A Page Life Expectancy of 7 seconds for 8 Gb memory, suggest that the next 14 seconds all pages in Memory will be replaced by reading in new pages. (That is 8 Gb in 14 seconds this is over 500 Mb per second. In real life a disk (even a SAN) can not deliver this amount of data.
Other scenario's are possible, but for the other scenario's the number of bytes read per second is even higher. To get to an PLE of 7 an enourmous amount of data has to be replaces in a short time.
(I asume the PLE is some sort of average).
So to understand what PLE means, I would like to know how this prediction is calculated. (My expectation about the lowest a PL (average Page Life) can become is higher than 7)
Thanks for your reaction,
Ben Brugman
Additional, (It is a bit more complex)
The PLE of 7 actually does not expect a page to be replaced within an average of 7 seconds it expects a page to be replaced or touched (referenced) within those 7 (or is it 14) seconds. The text above ignores the fact that a large number of pages can be referenced within the timeperiod.
January 24, 2013 at 8:40 am
GilaMonster (1/22/2013)
ben.brugman (1/22/2013)
CheckDB effectively destroys the cache. (Why?).Because it reads every single allocated page in the database and, like all other operations, reads them through the buffer pool. Unless you happen to have a buffer pool far larger than the database, running CheckDB will throw just about all other pages out of the buffer pool as it runs.
So for checkDB you need a maintenance window, first because it uses resources, secondly because it potentially can 'destroy' performance for other queries.
Because the backup does not do that, I thought wrongly that checkDB would not do that either.
Thanks for setting me strait,
Ben Brugman
January 24, 2013 at 8:51 am
Bhuvnesh (1/22/2013)
ben.brugman (1/22/2013)
Any suggestions how to catch the processes which cause the drop in PLE ?it means you have buffer cache issue and IO contentions
see these links
http://sqlblog.com/blogs/buck_woody/archive/2010/06/29/the-windows-page-file-and-sql-server.aspx
http://support.microsoft.com/kb/889654/en-us
and take help from these
code removed
Thanks for the links and the script.
It did help to get some insight.
At the moment I am missing the link with the Windows Page file. (second and third link). Because I assume (correct me if I am wrong) that the SQL-server pages are not 'swapped' out and therefore not a part of the pagefile which can be swapped out.
(Server is a dedicated SQL-server server so almost all of the memory is for SQL-server. Although some usage of SSMS is possible).
Thanks,
Ben
January 24, 2013 at 9:18 am
ben.brugman (1/24/2013)
At the moment I am missing the link with the Windows Page file. (second and third link). Because I assume (correct me if I am wrong) that the SQL-server pages are not 'swapped' out and therefore not a part of the pagefile which can be swapped out.
They can be, but that's if Windows decides to move a portion of SQL's memory into the page file (memory pressure and SQL not releasing memory when requested). It isn't something that will affect PLE, which is a SQL measurement and solely related to SQL's internal memory management.
I'd personally ignore PLE during CheckDB execution. Because it reads the entire database and because it marks pages brought in as 'disfavoured', meaning they'll be the first to be thrown out of cache when space is needed, PLE will be low, it's kinda expected.
Now, if PLE is dropping that badly during normal operation when CheckDB isn't running, you may well have a problem of too much data being read. It doesn't mean you have an IO contention problem, it means you're reading more data into memory than memory can handle. The fix is more memory, optimise queries or both.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 24, 2013 at 9:52 am
GilaMonster (1/24/2013)
ben.brugman (1/24/2013)
I'd personally ignore PLE during CheckDB execution. Because it reads the entire database and because it marks pages brought in as 'disfavoured', meaning they'll be the first to be thrown out of cache when space is needed, PLE will be very, it's kinda expected.
So even if the PLE goes extremely low during CheckDB it doesn't mean that the cache gets completely 'refreshed', so effectively pages which are touched a lot remain in the cache ?
I am stil wondering how PLE is calculated.
Suppose four pages.
Where the second number is the number of seconds the page has not been referred.
1. 200
2. 300
3. 400
4 8000
The average page life (since referral) is 2225.
After reading a new page.
1. 200
2. 300
3. 400
5 0
The average page life (since referral) is 225.
So this is not used for calculating the PLE. If the PLE is calculated of pages leaving the memory (overwriting them). Then with a PLE of 7 seconds and 8Gb of memory. There must be an awfull lot of overwriting.
At the moment we are looking into the PLE as a preventive measure, not because we have problems, but we want to be able to signal problems before they occure.
As for improving one of my suggestions was Clustering on the most important key, for some tables. Within those tables the number of pages read would be 1/5 th of the pages now with the heap.
Yes more memory would help, and rewriting the queries would help as wel. Rewriting the queries would be very expensive. Clustering and optimising for that would be less expensive. (But I am not making the decisions :crying:)
Thanks for your explenation about DBcheck,
Ben
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply