November 9, 2012 at 4:38 am
Can anyone explain why DbMail causes PLE to drop?
I have a server with DbMail using the default configuration of 600 seconds for Executable Lifetime. I have changed this to 24 hours, I will monitor the PLE this afternoon.
I have a job which runs on the hour, every hour, between 7AM and 7PM.
Starting at 7AM, and repeating on the hour until 7PM, I have the following:
2012/11/09 07:00:53 AM
Database Mail (Database Mail Log)
DatabaseMail process is started
2012/11/09 07:00:57 AM
SQL Server (Current - 2012/11/09 01:01:00 PM)
An error occurred in Service Broker internal activation while trying to scan the user queue 'msdb.dbo.InternalMailQueue' for its status. Error: 1222, State: 51. Lock request time out period exceeded. This is an informational message only. No user action is required.
2012/11/09 07:01:22 AM
SQL Server (Current - 2012/11/09 01:01:00 PM)
An error occurred in Service Broker internal activation while trying to scan the user queue 'msdb.dbo.InternalMailQueue' for its status. Error: 1222, State: 51. Lock request time out period exceeded. This is an informational message only. No user action is required.
2012/11/09 07:11:19 AM
Database Mail (Database Mail Log)
DatabaseMail process is shutting down
This corresponds with my Page Life Expectancy problems on the hour, as seen in the attached graph.
November 13, 2012 at 9:14 pm
Database Mail issue could be circumstantial, it may not be to blame, the lock timeout may just be a symptom of the root cause. PLE drops typically occur after a very large request requiring data not already available in the buffer pool that forces a large percentage of the pages in the pool to need to be replaced, but it can also be caused by external memory pressure forcing SQL Server to release memory to Windows.
Need more info...
What does the job do? What does it do with email specifically? Quantity of emails? Any attachments? How large? Based on a query?
How much memory in the machine? 32-bit OS? Bitness of SQL? How much memory is allocated to SQL (max memory)?
Anything else near these entries in the sql error log, e.g. Regarding "significant" amount of memory paged out?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
November 14, 2012 at 8:28 am
I have spent most of the day compiling a reply, using a few minutes here and there, and while getting the information you requested I think I have found the problem. I will post that reply so you can see what I found and tomorrow will monitor PLE to see if I have had the desired results.
Thanks for pushing me in the right direction.
November 14, 2012 at 8:33 am
opc.three (11/13/2012)
Thanks for looking into this. The server has improved with a few basic configurations that were not correct, but I am still seeing my PLE drop significantly
on the hour, every hour between 7AM and 7PM.
Database Mail issue could be circumstantial, it may not be to blame, the lock timeout may just be a symptom of the root cause. PLE drops typically
occur after a very large request requiring data not already available in the buffer pool that forces a large percentage of the pages in the pool to need to
be replaced, but it can also be caused by external memory pressure forcing SQL Server to release memory to Windows.
I completely agree with your sentiments, but I find it suspicious that the PLE behaviour and the mail job have the same schedule, so to speak.
Need more info...
What does the job do? What does it do with email specifically? Quantity of emails? Any attachments? How large? Based on a query?
How
much memory in the machine? 32-bit OS? Bitness of SQL? How much memory is allocated to SQL (max memory)?
Anything else near these entries in the sql error
log, e.g. Regarding "significant" amount of memory paged out?
Virtual server on VMWare
Windows 2008 R2 Enterprise
64 Bits - Machine, OS and SQL Server
16Gb Total Physical Memory
4 * Intel Xeon X7550 @ 2.00GHz
Microsoft SQL Server 2012 - 11.0.2100.60 (X64)
Enterprise Edition (64-bit) RTM (We are currently planning a roll-out of SP1 starting with our dev box)
Minimum Server Memory - 0Gb
Maximum Server Memory - 12Gb
Optimize for Ad hoc Workloads - True
What does the job do?
This job sends email to our customers. The body and subject of the email varies depending on various conditions. We open a number of cursors and conditionally send the mail depending on the values obtained. I have attached pseudo-code of this process.
What does it do with email specifically?
Dynamically compose the subject and body according to the current data row and data values, send the mail and update the status in the communication table.
Quantity of emails?
On average I send 50 mails each hour although this could be anything between 10 and 100.
Any attachments?
No attachments.
How large?
They are all text so a few Kb each.
Based on a query?
Based on several queries.
I ran the two steps excluding the send mail and the update and I see my communications table is responsible for most of the IO.
Step 1
======
59 seconds
Scan count 1, logical reads 179534, physical reads 0, read-ahead reads 178986
Scan count 1, logical reads 179534, physical reads 0, read-ahead reads 272
Scan count 1, logical reads 179533, physical reads 0, read-ahead reads 1
Scan count 1, logical reads 179533, physical reads 0, read-ahead reads 177
Scan count 1, logical reads 179533, physical reads 0, read-ahead reads 0
Scan count 1, logical reads 179533, physical reads 0, read-ahead reads 7
Scan count 1, logical reads 179533, physical reads 0, read-ahead reads 29
Scan count 1, logical reads 179534, physical reads 0, read-ahead reads 0
Scan count 1, logical reads 179534, physical reads 0, read-ahead reads 0
Scan count 1, logical reads 179534, physical reads 0, read-ahead reads 0
Scan count 2, logical reads 109527, physical reads 2, read-ahead reads 42597
Step 2
======
48 seconds
Scan count 1170, logical reads 11336, physical reads 7, read-ahead reads 3
Scan count 601, logical reads 6820, physical reads 5, read-ahead reads 11
Scan count 1102, logical reads 10581, physical reads 0, read-ahead reads 0
Scan count 180, logical reads 1721, physical reads 2, read-ahead reads 0
Scan count 382, logical reads 3641, physical reads 0, read-ahead reads 0
Scan count 3301, logical reads 36610, physical reads 43, read-ahead reads 1444
I am not happy with the number of read-ahead reads occurring. This table has 6,946,455 rows, data is 1.36 GB and indexes are 867.65 MB. The execution plan for these queries has a table scan on the offending table. I have now created a covering index for this query and low and behold, a massive improvement.
Before - Table 'Communication'. Scan count 1, logical reads 182701, physical reads 1, read-ahead reads 167764
After - Table 'Communication'. Scan count 8, logical reads 65, physical reads 0, read-ahead reads 0
SQL Server Execution Times:
Before - CPU time = 4275 ms, elapsed time = 10566 ms.
After - CPU time = 15 ms, elapsed time = 16 ms.
COST
Before - 168.343
After - 0.317
November 14, 2012 at 4:40 pm
Thanks for the detailed write-up! Are you confirming that the reduction in reads due to adding a new covering index has curbed the PLE drops?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
November 15, 2012 at 1:47 am
opc.three (11/14/2012)
Thanks for the detailed write-up! Are you confirming that the reduction in reads due to adding a new covering index has curbed the PLE drops?
I wasn't able to confirm as it was only installed at the end of the day and I was waiting to see what happened today. Unfortunately, I am getting the same behaviour.
November 15, 2012 at 6:37 am
OK, have to keep going 🙂
It would be convenient to know whether these drops were caused by internal or external memory pressure, but it could be a combination.
External - Do you have graphs for "OS : available megabytes" and "sql server : total server memory" ?
Internal - I would consider setting up a server-side trace to capture all "sql batch completed" and "sp completed" events to a file (including start and end time, cpu, reads and writes) across a wide enough period when you expect a PLE drop so you can analyze the activity and try to correlate the drop to a single or a few events. From there maybe you can isolate down to one or a few statements. Obligatory disclaimer: you know your instance, be careful running a trace in production.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
November 16, 2012 at 4:19 am
This is getting very complicated, I am finding more and more problems. I will feedback once I better understand what is going on. In the meantime, the memory usage in perf mon is consistent.
I have a script that shows me how much memory is being used by each database in the buffer. I setup a job to record this information every 5 minutes and then looked for databases that were taking over. I found an audit database doing just that.
The complication comes from two queries that select from this database, a dodgy update and a trigger.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply