Virtual Nomad: NUMA - Non Unified Memory Access

Last two days I spent quite a lot of time on reading absolutely new topics for me - NUMA, Transparent Page Sharing, TLB, Large Pages, Page Table, and I hope within this week I will make several posts on each of these technologies and how they work together. I have done some tests playing around with Large Pages in ESXi and would like to share this information as well.

I want to start with an explanation about what the NUMA is and to what extent ESX is NUMA-aware. There is a huge amount of information about NUMA in the Internet and probably you are already aware about it, so please let me know if I made any mistakes in my topic.

NUMA stands for Non Unified Memory Access and has nothing to do with Romanian music band and its song "Numa, Numa, yeah". Currently it is presented in Intel Nehalem and AMD Opteron processors. I always assumed that CPUs equally share memory, however this is not the case with NUMA.

Old way of sharing memory

Each CPU and equal part of RAM is defined as NUMA node. If you have server with 4 CPUs with 64 GBytes of RAM you will end up with 4 NUMA nodes, each with 16 GBytes of RAM.

NUMA Architecture

As I understood, this new approach was invented because modern servers have so many CPUs/Cores/RAM that it is technologically and economically not efficient to provide one shared bus for all of them. Basically, utilization of shared bus leads to bus contention due to CPU competition in access to memory and to low level of server's scalability. With new architecture for each CPU there is local and remote memory. While local memory is accessed directly, the remote memory can be reached through NUMA controller, which imposes higher latency on memory access time.

It means that our OS, that works with such hardware architecture should be NUMA aware and wisely manage its processes/applications, otherwise your OS risks to get to the situation when it schedules Application on CPU in one NUMA node, but the Application's memory resides in another NUMA node. I just checked that NUMA is supported starting from Windows 2003 and has been supported by ESX at least starting from ESX Server 2.

I don't know if it is possible to check NUMA stats from GUI, but you can definitely check it in esxtop.

Since we have 2 socket server we can see here 2 NUMA nodes and the memory size per node . In one of the articles I read that value in brackets is supposed to be free memory, however, in our ESXi 4.1 server I saw that this value was higher than actual RAM size of NUMA node, so i still need to check it.

As soon as ESXi server detects that it runs on NUMA server it starts NUMA scheduler, which takes care of VM and makes sure that each VM's vCPU is scheduled only within the same NUMA node. If you use pre-ESX 4.1 version pay attention to the number of vCPUs your VM runs - it shouldn't be higher than the number of cores in one Physical CPU. Ohterwise, NUMA scheduler will ignore this VM and its vCPU's will be scheduled in old fashioned round-robin way over all NUMA nodes, thus, increasing the latency and decreasing overall performance of your VM. In case you run ESX 4.1 there is a new feature called wide VM. This lets you to create VM with a higher number of vCPUs than number of cores in Physical CPU. Intelligent NUMA scheduler breaks wide VM into several NUMA clients and then each NUMA client is treated as normal VM .

NUMA Scheduler not only takes decision on initial placement of VM to NUMA node, but it also constantly monitors the amount of local memory used by VM, and if it goes lower threshold (There is unconfirmed information that it is set to 80%), it will move VM to another NUMA node to increase amount of local memory. It will also migrate VM to another NUMA node if there is lack of free memory on home NUMA node of VM. Moreover, during the migration from one NUMA node to another one ESXi will control memory migration rate in order not to congest the system. In some cases, when ESXi knows that VM was migrated to new NUMA node just for a short period of time, it will even avoid migration of VM's memory.

It is also worth to mention that you have to be careful installing memory modules into motherboard. The DIMMs have to be evenly distributed between CPUs because it is not ESXi NUMA Scheduler that distributes memory between NUMA nodes, but physical architecture of the server and DIMM setup. So you'd better check your server documentation.

You might also want to check check more NUMA stats from esxtop that are very useful for performance troubleshooting.

Here is description of values in these columns:
NHN NUMA Node Number
NMIG Number of migrations between NUMA nodes
NRMEM Amount of remote memory used by VM
NLMEM Amount of local memory used by VM
N&L Percentage of local memory used by VM
GST_ND(X) Allocated VM memory on node X
OVD_ND(X) Overhead of VM memory usage

Here is the interesting thing in this screenshot - as you can see in the first line is a VM with only 70% of local memory. I was monitoring it and the migration of this VM to another NUMA node never started. That's why I mentioned that the actual threshold for VM migration to a new NUMA node is not confirmed.

Writing this article i just tried to clear up all that mess of the new information in me head, and therefore I tried not to copy-paste all possible information about NUMA, but just the essential information that will help me structure my knowledge.

If you find this post useful please share it with any of the buttons below.

Tuesday, 24 May 2011

NUMA - Non Unified Memory Access

No comments:

Post a Comment