December 22, 2016 at 1:39 pm
We have a monitoring tool, which is Solarwind DPA, that shows sudden drop of P.L.E. at different times. For example it can drop from 3,000 sec to about 100 sec. During that times, some applications and processes experience some delay and timeouts.
My goal is to find a process or procedure that causes this sudden drop. I tried to trace the server with setting filter to 5,000,000 mks and greater for duration, but could not find uniquely what causes it.
Any ideas how to find the reason for it ?
Thanks
December 22, 2016 at 2:57 pm
In DPA you can click on a day, then an hour, and then using Timeslice you can drill down to the processes that were running in a particular minute. I use that fairly often to identify problem queries & processes.
December 22, 2016 at 3:12 pm
I also tried this, but within each timeslice there are dozens of processes, and how can I identify the culprit?
December 22, 2016 at 3:15 pm
On the SQL tab they are usually sorted by the ones that took the longest/had the most Waits associated. I start there.
December 22, 2016 at 3:32 pm
I'll suggest that it's almost a waste of time. It's certainly an indication that something wasn't in memory when it was needed but that's not necessarily a problem especially if it's a temporary sudden drop for a high norm.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 22, 2016 at 3:41 pm
I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.
December 22, 2016 at 6:08 pm
SQL Guy 1 (12/22/2016)
I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.
I'm sure there will be those that disagree but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time. Oddly enough, if you run a cache line on Perfmon, you usually won't see a problem. I'd be more concerned about a large drop in the cache line.
If you are concerned about occasional (and they should only be occasional during a give 24 hour period) then take a look at the following article to figure out what you can do about it. First line of defense is to figure out which code is doing it and fix it. I regularly monitor the top 10 worst performers on my systems and fix them if they're actually a problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 23, 2016 at 3:01 am
It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 23, 2016 at 3:47 am
If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.
Chatting with the developers responsible will help narrow it down to a more precise cause.
Gails idea of a full table scan maybe being the culprit is also a very good place to start.
If nothing further can be done and other applications are being affected during this period of drop, do what my last employer did and throw more memory at it.
December 23, 2016 at 4:30 am
We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.
December 23, 2016 at 7:58 am
Jeff Moden (12/22/2016)
but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.
After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.
... then take a look at the following article to figure out what you can do about it.
Could you please give a link for the article? Thanks
GilaMonster (12/23/2016)
It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.
I like this idea. How can I catch the processes that are doing large table scans? Can I trace it? And should I include C.I. scans as well?
kevaburg (12/23/2016)
If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.
It is a recent migration from 2008 to 2014. We also had drops in PLE in our old system of course, but not such steep. What I suspect is probably in 2014 SQL Server overestimates the number of pages (probably for certain circumstances) that results in flushing of large areas in memory.
Tushar Kanti (12/23/2016)
We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.
This what I did in the first place. But this query is also resource consuming and does not return results immediately, we have about 120 GB of memory. And another problem is that my tool (DPA) displays reports with some delay, so if I apply this query after the fact, it is almost useless.
December 23, 2016 at 8:22 am
That is odd - I had to look to make sure that I didn't post this as I started to post similar ,earlier in the week.
I actually posted something related to this, but not specifically PLE, on Thwack.
So I am watching this with interest.............................
We too are running SQL 2014, using DPA for monitoring, and have almost exactly the same amount of Memory specified, (so throwing memory at this issue as somebody suggested is not an option as we are running Standard Edition).
I have been tracking low PLE values for protracted periods during the day for some months now, but have been at a loss to account for it.
I expected to see additional IO and an impact on Cache Hit Ratio, but this is not the case as far as I can tell.
Curious to know how long has the server (Hardware or SQL) been up, what size is the Buffer Cache reported by DPA now (as opposed to Target), and assuming it has been up for some time, does it show a decline over the Month view?
I run some DMV code (may even have originated from somebody on here) that shows what is in the Buffer Pool ,and have specifically run it on interval and immediately when I see this drop in PLE to see if the contents change (i.e. which Indexes or portions of are in ther ) and oddly it doesn't seem to change.
I too had expected the contents of the Bufferpool to be flushed to accommodate something else, but that seems not to be the case - unless it switches back before I can look at it?
What I can say is that I have been unable to attribute the low PLE values to any timeouts we see - in our case they seem to be caused bu Locking, Blocking and Deadlocks but not a low PLE value.
Steve O.
December 23, 2016 at 10:45 am
Another thing to keep in mind is that some of what you are describing can be totally normal and nothing to worry about. The low PLEs can show when the lazy writer goes through it's cycle and sweeps the cache - you may just be capturing the values at the beginning of this.
If it is caused by a query, you should have what you need in sys.dm_exec_query_stats and queries based off that. I'd look first at the physical reads. You'd also want to include the query plan and look at missing indexes with the high physical reads. There are several of those queries available - Glenn Barry's site has some. I think I got this off a post by Jonathan Kehayias:
SELECT TOP 50
qs.execution_count,
AvgPhysicalReads = isnull( qs.total_physical_reads/ qs.execution_count, 0 ),
MinPhysicalReads = qs.min_physical_reads,
MaxPhysicalReads = qs.max_physical_reads,
AvgPhysicalReads_kbsize = isnull( qs.total_physical_reads/ qs.execution_count, 0 ) *8,
MinPhysicalReads_kbsize = qs.min_physical_reads*8,
MaxPhysicalReads_kbsize = qs.max_physical_reads*8,
CreationDateTime = qs.creation_time,
SUBSTRING(qt.[text], qs.statement_start_offset/2, (
CASE
WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2
) AS query_text,
qt.[dbid],
qt.objectid,
tp.query_plan,
tp.query_plan.exist('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/MissingIndexes') missing_index_info
FROM
sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text (qs.[sql_handle]) AS qt
OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
ORDER BY AvgPhysicalReads DESC
Sue
December 23, 2016 at 1:10 pm
SQL Guy 1 (12/23/2016)
Jeff Moden (12/22/2016)
but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.
After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.
... then take a look at the following article to figure out what you can do about it.
Could you please give a link for the article? Thanks
Apologies for the missing link. Had several interruptions.
The first link is by Paul Randall where he explains some of the things going on and some of the reasons to either worry about low PLE or not.
https://sqlperformance.com/2014/10/sql-performance/knee-jerk-page-life-expectancy
A great article on just about everything one might think of that can affect PLE is found in the following article by Steve Hood, who has spoken on the subject many times (if you the chance to see Steve Hood's presentation on PLE, you won't be disappointed). For example, did you know that the size of TempDB can affect PLE?
https://simplesqlserver.com/2013/08/19/fixing-page-life-expectancy-ple/
Steve's article is a bit long but well structured and well worth the read.
Let me say that everything that SQL Server does has to go through memory in one form or another. One of the largest wastes of memory is performance challenged code which can waste huge amounts of memory for things like accidental many-to-many joins (Do you have DISTINCT in your code? Might be a bigger problem than you think), Triangular Joins, non-SARGable code, code that simply returns too many unused columns, code that does a lot of index scans, etc, etc. And don't forget that TempDB also start off in memory and only spills to disk if something get's too big.
The bottom line is, performance and a decent PLE is in the code... or not. Find and fix the performance challenged code and don't forget there are two types... batch code and normal everyday GUI code. Either can require too much memory to be used and, surprisingly, it's normally not the batch code. It frequently looks like it might be the batch code because it runs relatively infrequently but the normal GUI code might have a whole lot of wasted memory tied up because it wasn't written well and uses too many resources.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply