Troubleshooting NUMA Node Imbalance Problems

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan - you get a weekly email packed with all the essential knowledge you need to know about performance tuning on SQL Server.)

My goal for last week was to get familiar with the HammerDB load generator and test-drive the TPC-C workload in my powerful Home Lab. This sounds like an easy task, because HammerDB automates all the necessary tasks for you. You can create the schema of the TPC-C database, and HammerDB will also populate the database with test data. And then you just run the real workload against the generated database. Unfortunately, I hit a huge performance problem that I want to describe in this blog posting, and how I resolved it.

VM Configuration

In my case, I have generated a TPC-C database with 1000 warehouses, which resulted in a database size of around 100 GB. The database was running on a standalone SQL Server 2017 instance on a VMware based virtual machine with 16 vCPUs and 128 GB RAM. The underlying ESXi Host is a 2-socket NUMA system, and therefore I have also configured vNUMA within the virtual machine accordingly.

In my case I have distributed the 16 vCPUs across 2 sockets (Cores per Socket = 8) to match the underlying NUMA topology. Therefore, ESXi created a Wide-VM where 8 vCPUs where take from each physical NUMA node. And the 128 GB RAM where also evenly distributed across both physical NUMA nodes (64 GB from each NUMA node). The following picture shows the configuration of this VM.

The configuration of the Virtual Machine

The Problem

Everything seems fine so far. But as soon as I was running the TPC-C workload against SQL Server, I got a strange behavior. The CPU utilization of SQL Server was somewhere between 50% – 60%, but 1 NUMA node within the virtual machine was always utilized at 100%! The following picture shows that behavior:

Only one NUMA node is utilized at 100%!

It seems that I have triggered with that workload a NUMA Node Imbalance problem, where only 1 NUMA handles the whole SQL Server workload, and the other NUMA Node is just idle. It was a default configuration of SQL Server 2017, so I haven’t made any specific settings (like Processor Affinity, taking SQLOS schedulers offline, MAXDOP, Resource Governor). When I have checked the column load_factor of sys.dm_os_schedulers I was also able to verify that SQL Server only used 1 NUMA Node for this workload:

The Load Factor of the various SQL Server Schedulers

As you can see in the picture, the second NUMA node was utilized, and the first one was just idle. SQL Server implements a Round-Robin NUMA scheduler, where incoming connections are distributed in a Round-Robin fashion across the various NUMA nodes. You can also check the affinity of a connection in sys.dm_exec_connections:

All incoming connections are only assigned to 1 NUMA node

All the connections are affinitized to only 1 NUMA node. So why the heck SQL Server is doing it in that way with that specific workload? To check if the NUMA scheduler of SQL Server is working as expected, I have opened 50 different connections within SQL Server Management Studio, and guess what: SQL Server has distributed the connections as expected across the 2 NUMA nodes:

But for other workloads it just works...

This also means that SQL Server itself has no problem, and that the problem must be the workload – as usual.

The Solution

I have spent a lot of time researching if someone else has also already had this specific problem with the TPC-C workload. Because my test-case that I was running, was not that specific. Even HammerDB states in their documentation that NUMA should work as expected with their software.

To make some progress, I have finally asked Mr. SQL Server NUMA (you also have to check-out his blog!) on Twitter, if he can help me to troubleshoot that problem. We have checked a lot of things together (SQL Server configuration again, various Performance Counters), but the result was the same again: SQL Server itself is not the problem. But then, Lonny made an interesting statement:

i suspect (but haven;t proven yet) that every persistent user connection is followed by a short-lived connection. something like a user count, etc. That would result in all persistent connections on one NUMA node, and all short-lived on the other…
— L_ N___ (@sqL_handLe) September 20, 2019

As I have said previously, SQL Server distributes the incoming connections in a Round-Robin fashion across the available NUMA nodes. Maybe HammerDB opens a connection for the workload, and then opens another short-lived connection (maybe checking if the previous opened connection was established), and then opens another connection for the next virtual user.

This would mean that all workload connections would land in one NUMA node, and all the short-lived connections would land in the other NUMA node. But they are short-live, therefore they are closed immediately, and that NUMA node would not be utilized in any way. The following pictures tries to visualize that idea:

An idea is born

We have then talked with HammerDB on Twitter about that observation, and their first initial response was as expected: HammerDB doesn’t cause this problem, fix your SQL Server!

Nothing in the client to distribute the workload to a particular NUMA node – this is entirely on the SQL Server side. Similar seen with multiple server processor groups. One mitigation here was reduce the User delay in the VU options so the delay between logins is lower eg 5ms.
— HammerDB (@hammerdbresult) September 20, 2019

Then we have talked a little bit in more details with them about our observations, and finally they revoked their initial statement:

Ahhhh! – there is something already fixed in the new version being worked on – line 57 shouldn't be there – so try commenting
else {#database connect odbc $connection
if {!$azure} {odbc "use $database"}
Sorry! i think this crept when adding azure options
— HammerDB (@hammerdbresult) September 20, 2019

Now we had the verification: it’s the workload, NOT SQL Server! In line 212 of the generated HammerDB script they are opening an additional – not needed – connection to SQL Server, which is afterwards immediately closed:

The faulty line in the HammerDB script

As soon as I have removed that line, and ran the workload again, I had an evenly distributed TPC-C workload across my 2 NUMA nodes:

No NUMA Node Imbalance anymore!

There are no short-lived connections anymore, and therefore the workload connections are now evenly distributed across the 2 NUMA nodes. Houston, we have solved the problem!

Summary

I have already heard a few times about NUMA Node Imbalance problems, but I have never ever seen them with my own eyes. Therefore, I was quite happy when I have observed that problem, but it drove me crazy not to know the underlying root cause why this problem occurs.

So, please keep this information in your mind if you have a NUMA system for your SQL Server workload. If your workload includes some short-lived connections it can happen quite fast that you have a NUMA Node Imbalance problem. Identifying the problem is quite easy, but troubleshooting it is another thing. You have also seen that SQL Server doesn’t really care about the load factor of a given NUMA node. SQL Server just distributes the incoming connections in a round robin fashion. We can also discuss if this is a good or bad approach…

I also want to thank Lonny to help me to resolve that problem (and to understand it), and also HammerDB for their fast response and help on Twitter.

Thanks for your time,

-Klaus