Recently, we received a call from a customer who was experiencing timeouts on multiple SQL Server instances. We followed the standard troubleshooting steps that any DBA would take, but couldn't identify an obvious cause. However, there were some clues:
- Page Latches had increased significantly and were now the primary wait stat replacing CXPACKET. No evidence of tempdb contention and the Azure-based disks showed no evidence of I/O bottlenecks or throttling.
- Query store confirmed the SQL Plan hadn't changed but the average duration had increased significantly.
- Average CPU utilisation on the servers had increased despite no increase in traffic. Overall CPU utilisation was less than 50% and there was no evidence of CPU contention.
- Capturing a profile with parameters and rerunning the profile showed inconsistent performance, even when the query plan hadn't changed, ruling out parameter sniffing.
These servers all run a single stored procedure so the performance of this stored procedure is very well understood, and the code hadn't been changed, so it was clear to us that something outside of SQL was impacting the performance.
Root Cause
To find the root cause, we conducted a configuration comparison of the servers against an Aireforge configuration snapshot to check for any changes. We immediately noticed that an extended event session called SQLAdvancedThreatProtectionTraffic
had recently been deployed. This session is deployed by default with Azure Defender for SQL Server. More info: https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-sql-introduction#advanced-threat-protection
CREATE EVENT SESSION [SQLAdvancedThreatProtectionTraffic] ON SERVER ADD EVENT sqlserver.database_stopped, ADD EVENT sqlserver.distinct_query_completed( ACTION(sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.client_pid,sqlserver.database_name,sqlserver.num_response_rows,sqlserver.session_id,sqlserver.tsql_stack,sqlserver.username) WHERE ([database_id]>=(0) AND [database_id]<=(2147483647) AND [peer_address]<>'<internal>')), ADD EVENT sqlserver.private_login_finish( ACTION(sqlserver.database_id,sqlserver.server_instance_name) WHERE ([peer_address]<>'<internal>' AND ([error]<>(17830) OR [state]<>(11)))) WITH (MAX_MEMORY=1024 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=10 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF) GO
The issue with this extended event session is that it captures every single SQL query running on the server. This usually wouldn't be a problem because extended events are by design, lightweight and asynchronous. However, for a high transaction system, the process of capturing every SQL query and sending this traffic to Azure overwhelmed the servers.
What's worse, is that extended events don't show up in query plans or process explorer as consuming any resources, which led us initially to the wrong line of investigation reviewing query plans and the parameters.
Microsoft's Estimations
According to Microsoft, Azure Defender for SQL averaged 3% of CPU usage for peak slices. In our specific case, this figure was closer to 12%. More info: https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-sql-usage#is-there-a-performance-effect-from-deploying-microsoft-defender-for-azure-sql-on-machines
This increase was not insignificant, but even more concerning was the increase in waits...
After disabling Defender and stopping to extended event session, the total waits in SQL server instantly dropped from 1.12 s/s to 0.22 s/s. This significant decrease brought down the average execution time of the stored procedure from several seconds to just 5 ms.
Summary
The takeaway from this blog shouldn't be to turn off antivirus entirely, instead, it should emphasise the importance of testing the impact of changes on your system before implementing them.
If you choose to go with third-party tools, Microsoft has a detailed and comprehensive list of exclusions to configure to prevent antivirus from affecting SQL Server: https://support.microsoft.com/en-us/topic/how-to-choose-antivirus-software-to-run-on-computers-that-are-running-sql-server-feda079b-3e24-186b-945a-3051f6f3a95b
In summary, a poorly configured antivirus can really impact the performance of SQL server, even if it's built by Microsoft and running on Microsoft hardware. It is crucial to carefully review your antivirus configuration and remain vigilant for any unexplained performance degradation.