Checkpoints are essential in SQL Server to help with the durability and reliability of data persisted in the database. When done right, you barely even notice the checkpoint and performance is groovy. Sadly, checkpoints can be a symptom of a problem and poor performance.
Checkpointing in SQL Server can have an impact on performance. Most of the time, DBAs don’t bother to think about checkpoints or their behavior. This isn’t necessarily a bad thing. It’s just a fact of life.
Much like the DBA, most of the time, checkpoints have no reason to be in the foreground of thought. This just simply means that things are running smoothly and we can occupy ourselves with other tasks.
When odd behaviors start to pop up such as the dreaded checkpoint io storm or the Flushcache (here or here) error, you need to have some tools to help you try and troubleshoot the issue. This article will help you add at least one more tool to your SuperStar DBA Toolbox.
Let’s start with a bit of background on what a checkpoint is.
Checkpoint Primer
A checkpoint is basically the last known good mark (or point) from which SQL Server can start applying changes registered in the transaction log during recovery after an unexpected shutdown or crash.
What is it that the checkpoint does? The checkpoint process is the means by which the dirty pages (modified data pages in memory) are written and persisted to disk. This process also takes the transaction log information and persists that to disk, while also adding a note to the transaction log of what was done. Use of the checkpoint and writing to disk in this fashion is done in favor of performance since writes to disk would be far more costly if every transaction went straight to disk.
Types of Checkpoints
To help with the various needs for checkpointing, there are four types of checkpoints in SQL Server. These types are:
- Automatic – issued automatically in the background to meet the upper bound of the “recovery interval” server configuration. By default this is roughly every minute for active databases. This can be throttled if there are io issues are too many pending writes.
- Indirect – provide an user configurable alternative for each database over the automatic checkpoint option. This typically provides faster more predictable recovery times.
- Manual – issued by user for the current database connection. (Does require elevated permissions such as dbo, sysadmin, or backupoperator role membership.)
- Internal – performed during such operations as backups or snapshots to ensure consistency and recoverability.
Of these four types, we will be primarily focused on the first two types of checkpoints. Both of these are technically automatic background checkpoints, but with some potential performance impacts we would want to occasionally monitor (e.g. io storm and FlushCache issues).
Tracking and Monitoring
In order to capture checkpoint events, and begin to understand their impact on the system, we need to use Extended Events. To create such a session to capture the necessary events is rather easy.
USE master; GO -- Create the Event Session IF EXISTS ( SELECT * FROM sys.server_event_sessions WHERE name = 'MonitorCheckPointBehavior' ) DROP EVENT SESSION MonitorCheckPointBehavior ON SERVER; GO EXECUTE xp_create_subdir 'C:DatabaseXE'; GO CREATE EVENT SESSION [MonitorCheckPointBehavior] ON SERVER ADD EVENT sqlserver.checkpoint_begin (ACTION ( package0.event_sequence , sqlos.task_address , sqlserver.client_app_name , sqlserver.client_pid , sqlserver.nt_username , sqlserver.server_principal_name , sqlserver.database_id , sqlserver.session_id , sqlserver.database_name , sqlserver.client_hostname , sqlserver.sql_text , sqlserver.username , sqlserver.session_nt_username , sqlserver.context_info , sqlserver.client_connection_id --, sqlos.task_time --, sqlos.task_elapsed_quantum ) ) , ADD EVENT sqlserver.checkpoint_end (ACTION ( package0.event_sequence , sqlos.task_address , sqlserver.client_app_name , sqlserver.client_pid , sqlserver.nt_username , sqlserver.server_principal_name , sqlserver.database_id , sqlserver.session_id , sqlserver.database_name , sqlserver.client_hostname , sqlserver.sql_text , sqlserver.username , sqlserver.session_nt_username , sqlserver.context_info , sqlserver.client_connection_id --, sqlos.task_time --, sqlos.task_elapsed_quantum ) ) , ADD EVENT sqlserver.recovery_indirect_checkpoint (ACTION ( package0.event_sequence , sqlos.task_address , sqlserver.client_app_name , sqlserver.client_pid , sqlserver.nt_username , sqlserver.server_principal_name , sqlserver.database_id , sqlserver.session_id , sqlserver.database_name , sqlserver.client_hostname , sqlserver.sql_text , sqlserver.username , sqlserver.session_nt_username , sqlserver.context_info , sqlserver.client_connection_id --, sqlos.task_time --, sqlos.task_elapsed_quantum ) ), ADD EVENT sqlserver.error_reported (ACTION ( package0.event_sequence, sqlserver.tsql_stack, sqlserver.sql_text, package0.last_error, sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_id, sqlserver.database_name, sqlserver.is_system, sqlserver.session_id ) WHERE ( [message] LIKE '%FlushCache%' OR [message] LIKE '%average writes%' OR [message] LIKE '%last target%' ) ), ADD EVENT sqlserver.errorlog_written (ACTION ( package0.event_sequence, sqlserver.tsql_stack, sqlserver.sql_text, package0.last_error, sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_id, sqlserver.database_name, sqlserver.is_system, sqlserver.session_id ) WHERE ( [message] LIKE '%FlushCache%' OR [message] LIKE '%average writes%' OR [message] LIKE '%last target%' ) ) ADD TARGET package0.event_file (SET filename = N'C:DatabaseXEMonitorCheckPointBehavior.xel') WITH ( MAX_MEMORY = 4096KB , EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS , MAX_DISPATCH_LATENCY = 30 SECONDS , MAX_EVENT_SIZE = 0KB , MEMORY_PARTITION_MODE = NONE , TRACK_CAUSALITY = ON , STARTUP_STATE = ON ); ALTER EVENT SESSION MonitorCheckPointBehavior ON SERVER STATE = START; GO
Checkpoint Tracking Results
When running this session, I can expect to be able to track checkpoint run times as well as extra data related to indirect checkpoints. Depending on the version of SQL Server, your results may vary slightly due to what is delivered in the event payloads. In the following images, I will show some of the interesting payload and differences between SQL 2017 and SQL 2019.
And now for SQL 2019.
The is_trivial and last_oldest_page_lsn attributes in the event payload are highlighted in green. These are new in SQL 2019! The last_oldest_oage_lsn is an interesting tidbit of information as it relates to the oldest_page status in the log_reuse_wait_desc column that can indicate a growing log file. There will be more on that in a future article.
Checkpoint Monitor Session Parsing
If you don’t have much desire to try to figure out the total time for each checkpoint by manual calculations from the “View Target Data” output grid gui, here is a script to do that work for you!
/* parse event data */IF OBJECT_ID('tempdb..#xmlprocess') IS NOT NULL BEGIN DROP TABLE #xmlprocess; END IF OBJECT_ID('tempdb..#checkpointanal') IS NOT NULL BEGIN DROP TABLE #checkpointanal; END IF OBJECT_ID('tempdb..#checkpointagg') IS NOT NULL BEGIN DROP TABLE #checkpointagg; END SELECT CAST ([t2].[event_data] AS XML) AS event_data, t2.file_offset,t2.file_name, cte1.event_session_id INTO #xmlprocess FROM ( SELECT REPLACE(CONVERT(NVARCHAR(128),sesf.value),'.xel','*.xel') AS targetvalue, ses.event_session_id FROM sys.server_event_sessions ses INNER JOIN sys.server_event_session_fields sesf ON ses.event_session_id = sesf.event_session_id WHERE sesf.name = 'filename' AND ses.name = 'MonitorCheckPointBehavior' ) cte1 OUTER APPLY sys.fn_xe_file_target_read_file(cte1.targetvalue,NULL, NULL, NULL) t2 ; SELECT event_data.value('(event/@name)[1]', 'varchar(50)') AS event_name , event_data.value('(event/@timestamp)[1]', 'varchar(50)') AS [TIMESTAMP] ,event_data.value('(event/data[@name="database_id"]/value)[1]','bigint') AS database_id ,event_data.value('(event/action[@name="client_app_name"]/value)[1]', 'varchar(max)') AS client_app_name ,event_data.value('(event/action[@name="client_connection_id"]/value)[1]', 'uniqueidentifier') AS client_connection_id ,event_data.value('(event/action[@name="client_hostname"]/value)[1]', 'varchar(max)') AS client_hostname ,event_data.value('(event/action[@name="client_pid"]/value)[1]', 'int') AS client_pid ,event_data.value('(event/action[@name="context_info"]/value)[1]', 'varbinary(max)') AS context_info ,event_data.value('(event/action[@name="database_name"]/value)[1]', 'varchar(max)') AS database_name ,event_data.value('(event/action[@name="event_sequence"]/value)[1]', 'int') AS event_sequence ,event_data.value('(event/action[@name="nt_username"]/value)[1]', 'varchar(max)') AS nt_username ,event_data.value('(event/action[@name="server_principal_name"]/value)[1]', 'varchar(max)') AS server_principal_name ,event_data.value('(event/action[@name="session_id"]/value)[1]', 'int') AS session_id ,event_data.value('(event/action[@name="session_nt_username"]/value)[1]', 'varchar(max)') AS session_nt_username ,event_data.value('(event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS sql_text ,event_data.value('(event/action[@name="task_address"]/value)[1]', 'varchar(max)') AS task_address ,event_data.value('(event/action[@name="username"]/value)[1]', 'varchar(max)') AS username ,event_data.value('(event/action[@name="attach_activity_id"]/value)[1]', 'varchar(512)') AS ActivityGUID , -1 as ActivityID --placeholder to find the activity_id_sequence value ,x.file_offset INTO #checkpointanal FROM #xmlprocess x ORDER BY TIMESTAMP , event_data.value('(event/action[@name="event_sequence"]/value)[1]', 'varchar(max)'); UPDATE cpa SET ActivityID = REVERSE(LEFT(REVERSE(cpa.ActivityGUID), charindex('-',REVERSE(cpa.ActivityGUID))-1)) FROM #checkpointanal cpa; SELECT cpa.database_name as DBName , cpa.TIMESTAMP AS ChkptBegin , cpaend.TIMESTAMP AS ChkptEnd , DATEDIFF(ms, cpa.TIMESTAMP, cpaend.TIMESTAMP) AS ChkptDuration_ms , cpa.event_sequence as Begin_event_sequence , cpaend.event_sequence as End_event_sequence , cpa.database_id , cpa.client_app_name , cpa.client_connection_id , cpa.client_hostname , cpa.client_pid , cpa.context_info , cpa.event_sequence , cpa.nt_username , cpa.server_principal_name , cpa.session_id , cpa.session_nt_username , cpa.sql_text , cpa.task_address , cpa.username INTO #checkptagg FROM #checkpointanal cpa INNER JOIN #checkpointanal cpaend ON cpa.event_sequence = cpaend.event_sequence - 1 AND cpa.ActivityID = cpaend.ActivityID - 1 AND cpa.database_id = cpaend.database_id AND cpa.file_offset = cpaend.file_offset AND cpaend.TIMESTAMP > cpa.TIMESTAMP AND cpa.event_name = 'checkpoint_begin' AND cpaend.event_name = 'checkpoint_end' WHERE cpa.event_name = 'checkpoint_begin' ORDER BY ChkptBegin, cpa.event_sequence; SELECT cpa.DBName , AVG(cpa.ChkptDuration_ms) as AvgDuration_ms , MAX(cpa.ChkptDuration_ms) as MaxDuration_ms , MIN(cpa.ChkptDuration_ms) as MinDuration_ms FROM #checkptagg cpa Where cpa.ChkptBegin <= '2021-01-12T21:21:49.750Z' --UTC timestamp when indirect checkpoints enabled Group BY cpa.DBName ORDER BY AvgDuration_ms DESC, cpa.DBName; SELECT cpa.DBName , AVG(cpa.ChkptDuration_ms) as AvgDuration_ms , MAX(cpa.ChkptDuration_ms) as MaxDuration_ms , MIN(cpa.ChkptDuration_ms) as MinDuration_ms FROM #checkptagg cpa Where cpa.ChkptBegin > '2021-01-12T21:21:49.750Z' --UTC timestamp when indirect checkpoints enabled Group BY cpa.DBName ORDER BY AvgDuration_ms DESC, cpa.DBName; SELECT cpa.* FROM #checkptagg cpa ORDER BY cpa.ChkptBegin, cpa.Begin_event_sequence; DROP TABLE #xmlprocess; GO DROP TABLE #checkpointanal; GO DROP TABLE #checkptagg; GO
Using this script, I have been enabled to easily find checkpoint hotspots and problem databases. One recent client was experiencing 13 second plus checkpoints. Through the use of this session and parse script, I could identify the problems and then use that to leverage the need to change to indirect checkpoints.
While still using this session, I could then show the client that the indirect checkpoint improved checkpoint performance down to 20ms or faster. That is a huge win, especially when considering the checkpoints were triggering every 60-70 seconds. OUCH!
Put a Bow on It
In this article, once again have demonstrated the power of Extended Events. And no, you cannot do this with that tool called Profiler. You need the power of XEvents in order to reveal the issues related to the database checkpoint process.
Not only will you be able to see the performance issues, but you will also be able to justify changes that will resolve those problems. Furthermore, you will then be able to demonstrate that the change was the correct thing to do!
For more uses of Extended Events, I recommend my series of articles designed to help you learn XE little by little.
Interested in seeing the power of XE over Profiler? Check this one out!
This is twelfth post in the 2021 “12 Days of Christmas” series. All articles in the series (for all years) is available via this page.
The post Checkpoint Analysis: Your Exclusive How-To with XEvents first appeared on SQL RNNR.