Page life expectancy diving to sub-10 on a 128 GB server

  • Ok, does the same thing happen on each node of the cluster? Just trying to get a wider picture.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • sys.dm_os_wait_stats

    CXPACKET is at the top (15%), followed by...

    SP_SERVER_DIAGNOSTICS_SLEEP

    XE_TIMER_EVENT

    HADR_FILESTREAM_IOMGR_IOCOMPLETION

    DIRTY_PAGE_POLL

    BROKER_EVENTHANDLER

    XE_DISPATCHER_WAIT

    LOGMGR_QUEUE

    Most of those are meaningless waits, background processes that are supposed to be waiting. That list says nothing of any value.

    CheckDB is not run against this system, I run it on the DR site.

    What is your DR technology? How does the data get to DR?

    There are very few DR methods that allow you to run CheckDB against your DR server and effectively check your production DB.

    Virtual machine? If so, make sure that the memory's not overcommitted.

    Any entries in the error log about cache flushes?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (3/8/2013)


    sys.dm_os_wait_stats

    Most of those are meaningless waits, background processes that are supposed to be waiting. That list says nothing of any value.

    Yep, that's kinda the point I'm making by noting that iolatch is down at 12th under all of this. 🙂

    The cxpacket waits are also pretty standard for the environment, as in they also existed on the old server. It's a product of less than ideal code in the ERP database.

    What is your DR technology? How does the data get to DR?

    Any entries in the error log about cache flushes?

    The DR site is maintained by restores from production SQL Server backups. No VM or SAN magic at play. I set checkdb to run there last week after seeing this pattern of activity, normally it would run locally on sunday night after reindexing and before the weekly full backup (diffs daily at 9pm, hourly log). Using a wrapped up version of Ola Hallengren's stored proc based solution for maintenance.

    Ok, does the same thing happen on each node of the cluster? Just trying to get a wider picture.

    There's a second cluster that fails over in the other direction that runs on the secondary node (it's the "live training" system, enterprise licensed). It gets much less activity than the production ERP. That instance is holding a steady PLE of over 60,000.

  • allmhuran (3/8/2013)


    GilaMonster (3/8/2013)


    sys.dm_os_wait_stats

    Most of those are meaningless waits, background processes that are supposed to be waiting. That list says nothing of any value.

    Yep, that's kinda the point I'm making by noting that iolatch is down at 12th under all of this. 🙂

    You're missing my point. Those waits are ones you ignore completely. Saying that IOLatch waits are 12th under that list means they're the second top useful wait in the system. Those background waits are normal and expected to be high, they're from processes that spend most of their time doing nothing, so you just filter them out of any wait analysis.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (3/8/2013)


    allmhuran (3/8/2013)


    GilaMonster (3/8/2013)


    sys.dm_os_wait_stats

    Most of those are meaningless waits, background processes that are supposed to be waiting. That list says nothing of any value.

    Yep, that's kinda the point I'm making by noting that iolatch is down at 12th under all of this. 🙂

    You're missing my point. Those waits are ones you ignore completely. Saying that IOLatch waits are 12th under that list means they're the second top useful wait in the system. Those background waits are normal and expected to be high, they're from processes that spend most of their time doing nothing, so you just filter them out of any wait analysis.

    No, I get your point, I'm just making it explicit that there's nothing particularly unusual at the top end of the list.

    Regarding log entries: Yes, there are lots of cache flush messages, I understand this is now on by default for the checkpoint process to log without a trace flag.

    - "FlushCache: cleaned up 2037 bufs with 1673 writes in 86339 ms (avoided 1002 new dirty bufs) for db 7:0"

    - "average throughput: 0.18 MB/sec, I/O saturation: 3196, context switches 7021"

    But I'm curious where you might be going... are you thinking a recovery interval issue, or perhaps this?:

    http://blogs.msdn.com/b/joaol/archive/2008/11/20/sql-server-checkpoint-problems.aspx

    Edit:

    I might as well note here that I'm also getting appdomain unloads and messages indicating long IO requests, but I figured that was already a given based on numbers already posted.

  • allmhuran (3/8/2013)


    GilaMonster (3/8/2013)


    allmhuran (3/8/2013)


    GilaMonster (3/8/2013)


    sys.dm_os_wait_stats

    Most of those are meaningless waits, background processes that are supposed to be waiting. That list says nothing of any value.

    Yep, that's kinda the point I'm making by noting that iolatch is down at 12th under all of this. 🙂

    You're missing my point. Those waits are ones you ignore completely. Saying that IOLatch waits are 12th under that list means they're the second top useful wait in the system. Those background waits are normal and expected to be high, they're from processes that spend most of their time doing nothing, so you just filter them out of any wait analysis.

    No, I get your point, I'm just making it explicit that there's nothing particularly unusual at the top end of the list.

    You haven't made that clear because all you've posted are expected, normal waits that will have high wait times on any normal system and are absolutely irrelevant to anything. To draw any useful conclusions we need the list of top *relevant* waits.

    Those waits you've posted are to be completely ignored when doing wait analysis, they will be high, that's expected, that's normal, that's of no interest to anyone. If you want to make a point about the top waits on your system, you have to ignore the useless waits.

    Regarding log entries: Yes, there are lots of cache flush messages, I understand this is now on by default for the checkpoint process to log without a trace flag.

    - "FlushCache: cleaned up 2037 bufs with 1673 writes in 86339 ms (avoided 1002 new dirty bufs) for db 7:0"

    - "average throughput: 0.18 MB/sec, I/O saturation: 3196, context switches 7021"

    Not what I was talking about. Wondering if there's any explicit clears of the buffer pool, though thinking about it, that might not be logged, might have to check the default trace. Can't recall if SQL 2012 logs buffer pool clearing.

    Edit: SQL 2008 doesn't log when the buffer pool is manually cleared, just plan cache.

    But I'm curious where you might be going... are you thinking a recovery interval issue, or perhaps this?:

    Err... no.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    I appreciate your help, I get the impression you are finding this frustrating. Obviously I don't pretend to think that anyone here is obligated to provide assistance. I'm sensing a bit of peeved-ness in your "tone" (such as it can exist in text on the internet), which is not usual for you (I've been here a while). If I've offended you somehow then I apologise. If there's something specific in wait stats, or if you just want the whole shebang (everything with a wait_time_ms greater than 100ms) to be completely safe:

    wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms

    ------------------------------------------------------------ -------------------- -------------------- -------------------- --------------------

    LCK_M_SCH_S 3190 4137218 2571390 1083

    LCK_M_SCH_M 15335 1718013 138263 4585

    LCK_M_S 30453 259641306 621303 10052

    LCK_M_U 2955993 236862895 630221 1039438

    LCK_M_X 205484 39958282 3600012 30380

    LCK_M_IS 188 431166 30006 57

    LCK_M_IU 583 2517242 419993 181

    LCK_M_IX 106398 37221947 612063 13362

    LCK_M_RS_S 391 33500 4668 122

    LCK_M_RS_U 4 620148 598963 0

    LCK_M_RIn_NL 14 660299 314940 4

    LATCH_SH 2233431 10564487 32926 406152

    LATCH_EX 273750189 452244634 78913 43812716

    PAGELATCH_SH 339354133 28952028 41848 7287053

    PAGELATCH_UP 3143816 2908633 16597 526907

    PAGELATCH_EX 206255627 137615205 16888 4318131

    PAGEIOLATCH_SH 44593639 1813205109 58939 2430094

    PAGEIOLATCH_UP 130937 10172459 21594 9510

    PAGEIOLATCH_EX 8005632 289476694 59089 203365

    LAZYWRITER_SLEEP 18171572 3689896893 190424560 2672397

    IO_COMPLETION 5601119 80780956 22041 84509

    ASYNC_IO_COMPLETION 12517 105080747 3904073 2968

    ASYNC_NETWORK_IO 23453939 99058645 5571 2238681

    SLEEP_BPOOL_FLUSH 10496805 51177436 547 1007987

    CHKPT 1 1014 1014 0

    SLEEP_DBSTARTUP 12 1114 109 0

    SLEEP_MASTERDBREADY 1 1003 1003 0

    SLEEP_DCOMSTARTUP 1 216 216 0

    SLEEP_TASK 815847806 1012508052 1745 24064999

    SLEEP_SYSTEMTASK 1 1014 1014 0

    RESOURCE_SEMAPHORE 25 101790 8740 15

    DTC 21064 3963903 20812 6210

    OLEDB 443352172 76989842 600011 0

    THREADPOOL 3916 172557 250 1

    BROKER_RECEIVE_WAITFOR 112399 572925930 599885 20087

    SOS_SCHEDULER_YIELD 720091477 209117059 967 208381427

    SOS_PHYS_PAGE_CACHE 533193 1337458 48 118660

    ONDEMAND_TASK_QUEUE 202033 1673634559 714368038 98900

    LOGMGR_QUEUE 37138076 1847308731 749 1303212

    REQUEST_FOR_DEADLOCK_SEARCH 369714 1841173929 5322 1841173929

    CHECKPOINT_QUEUE 341672 1772953842 190423444 62335

    BACKUPBUFFER 4757411 27393437 2191 498126

    BACKUPIO 4981486 57645578 21720 199425

    BACKUPTHREAD 16988 6443112 4132442 2078

    IMPPROV_IOWAIT 77 4538 1467 1

    SQLTRACE_FILE_BUFFER 1121 180323 1015 323

    SQLTRACE_INCREMENTAL_FLUSH_SLEEP 254766 1021363011 4251 1311

    DTC_STATE 48 2399 1400 1

    BROKER_EVENTHANDLER 739608 1837650376 237319032 99197

    FCB_REPLICA_READ 20 226 34 1

    WRITELOG 37128853 356839690 14673 9631184

    EXCHANGE 2114 673 16 460

    CMEMTHREAD 1940400 342067 222 252202

    CXPACKET 604006616 5835421112 23347624 157697104

    WAITFOR 202590 1014390597 5649 25583

    EXECSYNC 670513 4853081 15810 87111

    MSQL_XP 613884 2205632 136873 0

    MSQL_DQ 6741 10106339 600012 0

    LOGBUFFER 47737 9160552 17304 7957

    TRANSACTION_MUTEX 17358 12971 1101 1948

    CLR_CRST 227 797 63 42

    CLR_MANUAL_EVENT 46 535 87 6

    CLR_AUTO_EVENT 18685 3685702326 488557665 2629

    SQLCLR_APPDOMAIN 35 4857 615 6

    PRINT_ROLLBACK_PROGRESS 3 9022 3010 0

    LOGMGR_FLUSH 516 4255 16 1

    DTC_ABORT_REQUEST 153352 461190530 3357 39964

    DEADLOCK_ENUM_MUTEX 35 762 39 2

    SNI_CRITICAL_SECTION 13774930 6586373 151 322695

    EE_PMOLOCK 2163 344 11 293

    QUERY_EXECUTION_INDEX_SORT_EVENT_OPEN 1520 207574 22951 421

    BROKER_TASK_STOP 905536 1500945320 10283 254172

    DISPATCHER_QUEUE_SEMAPHORE 3130441 1462934078 86388187 669793

    XE_SERVICES_MUTEX 56 233 9 8

    XE_BUFFERMGR_ALLPROCESSED_EVENT 15 449 346 2

    XE_TIMER_EVENT 418994 1848760530 5591 1848754374

    XE_DISPATCHER_WAIT 148793 1847377373 60851 0

    BROKER_TO_FLUSH 892368 916224723 1491 71338

    CXROWSET_SYNC 1514734 673908 73 145582

    PREEMPTIVE_OS_GENERICOPS 94 168 99 0

    PREEMPTIVE_OS_AUTHENTICATIONOPS 10326220 3381803 500 0

    PREEMPTIVE_OS_DECRYPTMESSAGE 1920815 44629 4 0

    PREEMPTIVE_OS_DELETESECURITYCONTEXT 2432526 1316725 145 0

    PREEMPTIVE_OS_QUERYCONTEXTATTRIBUTES 1495266 134121 81 0

    PREEMPTIVE_OS_AUTHORIZATIONOPS 2016831 302903 3087 0

    PREEMPTIVE_OS_AUTHZINITIALIZECONTEXTFROMSID 13 257 66 0

    PREEMPTIVE_OS_LOOKUPACCOUNTSID 813745 3317965 743594 0

    PREEMPTIVE_OS_REVERTTOSELF 1659505 159876 47 0

    PREEMPTIVE_OS_CLUSTEROPS 1 119 119 0

    PREEMPTIVE_CLUSAPI_CLUSTERRESOURCECONTROL 387 1095 16 0

    PREEMPTIVE_COM_QUERYINTERFACE 5953 50003 10716 0

    PREEMPTIVE_OS_CRYPTOPS 451057 130557 263 0

    PREEMPTIVE_OS_CRYPTACQUIRECONTEXT 512028 198168 278 0

    PREEMPTIVE_OS_CRYPTIMPORTKEY 435881 61518 28 0

    PREEMPTIVE_OS_NETVALIDATEPASSWORDPOLICY 389357 227993 236 0

    PREEMPTIVE_OS_NETVALIDATEPASSWORDPOLICYFREE 389357 38520 32 0

    PREEMPTIVE_OS_DTCOPS 2 4784 2482 0

    PREEMPTIVE_DTC_ABORTREQUESTDONE 406074 298873 77 0

    PREEMPTIVE_DTC_BEGINTRANSACTION 427137 111134 2402 0

    PREEMPTIVE_DTC_COMMITREQUESTDONE 21166 7768 30 0

    PREEMPTIVE_DTC_ENLIST 427240 682695 1017 0

    PREEMPTIVE_DTC_PREPAREREQUESTDONE 21166 7875 282 0

    PREEMPTIVE_OS_FILEOPS 9469 103928 2194 0

    PREEMPTIVE_OS_CREATEFILE 7137 14878 1155 0

    PREEMPTIVE_OS_DELETEFILE 38 240 228 0

    PREEMPTIVE_OS_FLUSHFILEBUFFERS 1227 9010 1197 0

    PREEMPTIVE_OS_GETDISKFREESPACE 1193 2193 1403 0

    PREEMPTIVE_OS_GETFILEATTRIBUTES 10582 1731 65 0

    PREEMPTIVE_OS_GETVOLUMEPATHNAME 4036 18045 436 0

    PREEMPTIVE_OS_GETVOLUMENAMEFORVOLUMEMOUNTPOINT 2645 2276 5 0

    PREEMPTIVE_OS_SETFILEVALIDDATA 971 3746 403 0

    PREEMPTIVE_OS_WRITEFILE 22433 1408413 17183 0

    PREEMPTIVE_OS_WRITEFILEGATHER 244 2827239 1216411 0

    PREEMPTIVE_OS_LIBRARYOPS 53734 24115 224 0

    PREEMPTIVE_OS_GETPROCADDRESS 613884 7283 9 0

    PREEMPTIVE_OS_LOADLIBRARY 37 320 37 0

    PREEMPTIVE_OLEDBOPS 1175 671 63 0

    PREEMPTIVE_OS_PIPEOPS 261 12159026 5533491 0

    PREEMPTIVE_OS_SECURITYOPS 87746 15455 30 0

    PREEMPTIVE_OS_REPORTEVENT 256376 156492 263 0

    PREEMPTIVE_OS_WAITFORSINGLEOBJECT 267256 137870 120 0

    PREEMPTIVE_OS_QUERYREGISTRY 34910 48667 90 0

    PREEMPTIVE_TRANSIMPORT 103 110 4 0

    PREEMPTIVE_CREATEPARAM 2 1634 1502 0

    PREEMPTIVE_XE_CALLBACKEXECUTE 325526 2206 3 0

    PREEMPTIVE_XE_DISPATCHER 6 975973276 975835711 0

    PREEMPTIVE_XE_GETTARGETSTATE 98604 717109 296 0

    WRITE_COMPLETION 7247 99182 3241 40837

    FT_IFTSHC_MUTEX 2 129 129 0

    FT_IFTS_SCHEDULER_IDLE_WAIT 30551 1833359526 60043 3828

    SQLTRACE_FILE_WRITE_IO_COMPLETION 14 161 59 0

    BROKER_TRANSMISSION_WORK 1074 252 22 89

    BROKER_TRANSMISSION_TABLE 6551 14673 4810 927

    PWAIT_ALL_COMPONENTS_INITIALIZED 6 1478 249 2

    XE_LIVE_TARGET_TVF 12 47887 8017 1

    SP_SERVER_DIAGNOSTICS_SLEEP 1859695 3695457383 299996 3695457383

    LOGPOOL_FREEPOOLS 10109 144723 795 2185

    LOGPOOL_REPLACEMENTSET 2517 7773 41 609

    HADR_FILESTREAM_IOMGR_IOCOMPLETION 3595095 1848726935 1040 362937

    DIRTY_PAGE_POLL 16928727 1848675675 744 157076

    SECURITY_CRYPTO_CONTEXT_MUTEX 77 165 18 5

    FFT_RECOVERY 14 666 99 0

    WAITFOR_PER_QUEUE 2034 310 16 220

    My intentions were and are only as stated. The important stat seems, given the counters, to be the iolatch, which is why I noted the position and wait percentage of that particular row. But I also wanted to cover the possibility of someone replying with "is there anything ususual right at the top of the results?". I'm trying to provide information without flooding the thread with huge amounts of numbers and then expecting everyone else to sift through them for me.

    Regarding other cache flushes: Only one set (bound trees, sql plans, object plans) from back when I was setting sp_configure options on the server.

  • Which Patch level of 2012 are you running?

    RC0

    RTM

    or SP1?

    I ran into an issue like this on a Win7 box with RC0 that was excessively frustrating. In the end, I wound up lowering min and max memory as well as moving to RTM (this was almost a year ago).

    Also, there was a question from Gail concerning VM or physical. I had been assuming physical, but the point is valid. If this is a VM, it could be a balloon driver issue.

    About three years ago, I also ran into a similar issue with new machines. Eventually we reseated the memory (all banks) and the issue was resolved. Some of the memory had worked itself loose and it was causing erratic behavior like you describe. Worth a shot.

    Can you tweak your memory settings first and see if that eases the pain?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (3/8/2013)


    Which Patch level of 2012 are you running?

    RC0

    RTM

    or SP1?

    No VM involved (mentioned earlier but easy to miss).

    Version is 2012 SP1. I will be installing CU2 this Saturday night.

    It's funny you mention the physical problem, I mentioned to our infrastructure guys that it almost looked like someone was sitting at the rack popping out RAM, giggling, and then sticking it back in.

    I've opened a premier support case on it, but I didn't want to 😛 I don't claim to be playing on the same field as Gail and a few other regulars here, but I have been working with SQL Server for a bit over 10 years and it's not often I get totally stumped like this.

  • Just an update for any curious observers:

    I've installed a hotfix because we were experiencing the msiexec issue ( http://support.microsoft.com/kb/2793634 ), so right now (after failover, install, reboot, failback), I have the ERP running on node 2 and training running on node 1 (ie, the reverse of before). I'm going to leave it like this for at least a few days, probably the whole week, and see if I instead see the PLE collapsing on training. If so I'll feel more confident diagnosing it as a potential hardware fault on node 1. If the PLE collapses follow the instance, then that obviously narrows it down to something wrong in the SQL instance itself. If the issue goes away completely... <shrug>... it's amazing what reboots can do...

  • I only wish I would have found your thread earlier 😉

    http://www.sqlservercentral.com/Forums/Topic1415833-2799-1.aspx

    We are experiencing the exact same behavior and have had a Sev-B case opened with MS on this for over a month now (currently escalated to the product team). Feel free to PM me offline if you would like the SR number.

    We have been a dozen or more PSSDiag collections and have been dumping the ring buffer outputs for RING_BUFFER_RESOURCE_MONITOR and they do not show any new entries around the time when PLE dropped. Moreover during this timeframe, Page Reads/sec value do not fluctuate much indicating that we did not have to read a bunch of new pages into memory. The query with max reads around this time frame was 32K.

  • Tommy Bollhofer (3/13/2013)


    I only wish I would have found your thread earlier 😉

    http://www.sqlservercentral.com/Forums/Topic1415833-2799-1.aspx

    We are experiencing the exact same behavior and have had a Sev-B case opened with MS on this for over a month now (currently escalated to the product team). Feel free to PM me offline if you would like the SR number.

    We have been a dozen or more PSSDiag collections and have been dumping the ring buffer outputs for RING_BUFFER_RESOURCE_MONITOR and they do not show any new entries around the time when PLE dropped. Moreover during this timeframe, Page Reads/sec value do not fluctuate much indicating that we did not have to read a bunch of new pages into memory. The query with max reads around this time frame was 32K.

    FYI, output of ring buffer entries during a PLE drop on node 001 from 30K to 33

    http://sdrv.ms/15MmVAF

  • Yes, it's an odd one eh. I'll let you know if we get anything out of premier support.

    My last action was to fail over the instances due to the hotfix install. PLE seemed to be more stable at first, but I'm seeing a similar pattern to what I was seeing with the instance on the original node. Since the PLE crashes have followed the instance it does not appear to be a hardware problem.

    I've set up an extended events session to see if I can pin down PLE crashes to specific queries. Unfortunately the EE GUI is awful. Nice in terms of UI functionality, but it routinely locks up SSMS for... well... I don't know how long, I terminate the program after 3 or 4 minutes. Awful awful awful.

    I also see batch compltion events with an indicated duration in the hundreds of thousands, which I know for a fact is wrong, because the application sending the calls has a 30 second timeout.

  • Hm, some more weirdness.

    SELECT

    database_id, COUNT_BIG(*) / 128 as bufferMB

    FROM sys.dm_os_buffer_descriptors

    GROUP BY database_id

    order by 2 desc

    On my 128 gig server with 100 gig allocated to the SQL instance, the above query gives me the following results:

    database_id bufferMB

    ----------- --------------------

    7 20145

    8 1349

    6 92

    4 68

    2 63

    32767 19

    5 5

    1 3

    10 0

    3 0

    12 0

    15 0

    9 0

    11 0

    A few moments ago, database 7 had a value of just 10,000. Database 7 is the 500 gig ERP database. This is on an instance with a min server memory of 50,000 and a max of100,000.

    So... why would SQL not consume all the memory allocated for its buffer? It's clear from the PLE values (and the timeouts reported by users) that it needs to be using the memory that we spent all that money on. But it... isnt...

    So yeah, it seems as though a dropcleanbuffers is happening randomly.

  • You have been going round and round with this for over 2 weeks now. May I suggest that you get a performance tuning professional on board to really help you diagnose and correct the problem(s) you are encountering? Seems somewhat silly to hunt-and-peck on a forum for production problems that a good tuner could pinpoint and resolve in a matter of hours (or at worst a day or two)!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 16 through 30 (of 55 total)

You must be logged in to reply to this topic. Login to reply