How NUMA configuration affects VM performance

 

In my last blog, we discussed about Virtual Sockets, Virtual Cores, Guest OS socket limitations & how VMware addresses these limitations in vSphere Environment.

Let me ask you same question again posted in my last blog:

Question:
Below are the setup details:

ESX Server configuration is : 2 Socket * 2 Core per Socket
VM1 Configuration is : 1 Socket * 4 Core Per Socket
VM2 Configuration is : 4 Socket * 1 Core Per Socket
Here Assumption is: VMs doesn’t have any Socket Limitations?
Both the VMs are running CPU & Memory intensive workloads.

The question is which VM will perform better and why?

Answer: Assuming you all guessed correctly based on our earlier discussion, Both VMs will perform equally. In other words, no of sockets & no of cores allocation doesn’t impact VM performance at all. There is no performance impact using virtual sockets or virtual cores.

Why VM performance doesn’t get impacted with virtual socket or core allocation:

VM remains intact because of the power of abstraction layer. Virtual Socket and Virtual Core are logical entities defined by VMkernel for vCPU configuration at VM level. When we run a operating system, Guest OS detects hardware layout within Virtual Machine like no of socket and core available at Guest OS level & it schedules instructions accordingly. For ex, In case of Guest OS socket limitations, it will try to exercise more no of cores rather than using more sockets.

As I said, the scope of virtual sockets and virtual core is only at Guest OS level. The VMkernel schedules a VMM process for every vCPU assigned to Virtual Machine.
The vCPU configuration from VMKernel perspective is sum = core * number of sockets. For ex: in above scenario, VM1 would require = 1 * 4 = 4 vCPUs
VM2 would requires = 4 * 1 = 4 vCPUs
In Conclusion, from VMkernel perspective, both the VMs requires equal amount of vCPUs regardless of no of socket or no of cores per socket allocated to Virtual Machine.

Virtual Sockets & Virtual Core scope is limited to Guest OS level. At VMKernel level, Total no of sockets & Cores gets translated into no of vCPUs which gets mapped to Physical CPUs done by CPU scheduler.

Let’s explore the example of 2 virtual socket 2 virtual core configuration

virtualSocketVirtualCore

The light blue box shows the configuration the virtual machine presents to the guest OS. When a CPU instruction leaves the virtual machine it get picked up the Vmkernel. For each vCPU the VMkernel schedules a VMM world. When a CPU instruction leaves the virtual machine it gets picked up by a vCPU VMM world. Socket configurations are transparent for the VMkernel

There is another tweak in the story, If your VM is configured with 8 vCPUs or greater than that in such cases, no of virtual sockets will impact Virtual Machine Performance because of vNUMA gets activated.

 Again, if VM is configured with more than 8 vCPUs then by default in vSphere 5.0, vNUMA gets activated and VMkernel presents Physical NUMA topology like NUMA client and NUMA node directly to Guest OS for better scheduling decision in the Guest OS. In vSphere 5.0, vNUMA is enabled by default on VMs greater than 8 vCPUs.

In such vNUMA scenarios, Virtual Machine performance directly depends on No of Sockets presents in the Guest OS. It is because NUMA node creation happens on the basis of no of sockets populated to Operating Systems. More Sockets more NUMA nodes and more NUMA nodes means better performance.

Let’s Deep-Dive into NUMA Architecture Concepts.

 

WHAT IS NUMA?

Definition from WikiPedia:
“Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.”

NUMA architecture is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.

“Ignorance of NUMA can result in a applicaton performance issues”

Background Of NUMA Architecture:

 

UMA ( Uniform Memory Access)

Perhaps the best way to understand NUMA is to compare it with its cousin UMA, or Uniform Memory Access. In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:

UMA

UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.

 

NUMA ( Non-Uniform Memory Access)

In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:

NUMA

Why NUMA is better than UMA

In NUMA, As its name implied, Non-Uniform access of memory introduce different memory access time with the location of the data to be accessed.
If data resides in local memory, access is fast.
If data resides in remote memory, access is slower.
The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.
NUMA-1
In Conclusion, NUMA stands for Non-Uniform Memory Access, which translates into a variance of memory access latencies. Both AMD Opteron and Intel Nehalem are NUMA architectures. A processor and memory form a NUMA node. Access to memory within the same NUMA node is considered local access, access to the memory belonging to the other NUMA node is considered remote access.

How NUMA Node gets created

NUMA node creation is based on number of sockets & memory for each NUMA node gets calculated by dividing total system memory with No of NUMA nodes.
If physical system configured with 4 Sockets * 4 Cores per Socket and Total Memory is 12GB.

In this case, Total NUMA node created = 4
memory allocated to each NUMA node = 12/4 = 3GB

Case Study 1 : OS is not NUMA aware

Physical system configured with 4 Socket * 4 Core per socket and memory 12Gb
Multi threaded SQL applications along with some general purpose applications running on OS installed on above mentioned system.
Since OS is not NUMA aware, In worst case scenario of CPU allocation, multiple threads ( 4 thread) of SQL application can be scheduled on 4 different cores of 4 different NUMA node. In this case, lot of data will be accessed through remote memory over interconnect link which in result cause increase in memory latencies and reduce overall application performance.
refer below diagram:
Non-NUMA-CPU-Placement

Case Study 2: OS in NUMA aware

Now Since OS is NUMA aware and had complete view of NUMA nodes of Physical System so it will try it best to schedule multiple threads of same application in single NUMA node to avoid Remote Access Memory & using Local Memory of that node as much as it can for better performance.
In this example, all the 4 threads of SQL application will be scheduled on 4 cores of single NUMA node decided by NUMA aware CPU scheduler of OS. Since all the threads will be accessing local memory assigned to NUMA node so no data will be accessed over remote memory which in result improvise overall application perfomance.

Refer Below Diagram:

NUMA-CPU-Placement

That’s the reason, NUMA plays very important role & it can seriously influence performance of memory intensive workloads.

I hope this article helps you guys to understand the basics of NUMA architecture and how NUMA influence workload performance.

In my upcoming articles, I will be covering few more details about NUMA w.r.t ESXi Environment like:

How ESXi NUMA scheduler works? How pNUMA is different than vNUMA? How vCPU sizing impact NUMA scheduler in ESXi environment? Understanding NUMA stats using esxtop command?

Please Feel Free to post your queries if you have anything. I would be happy to answer your queries. Please don’t forget to write comments or feedback about this article. 

BE SOCIABLE, KEEP SHARING KEEP LEARNING!!!

govmlab on sabtwittergovmlab on sablinkedingovmlab on sabgooglegovmlab on sabfacebookgovmlab on sabemail
I am VMware Solution Architect with 10+ Years of enriching experience in Datacenter Virtualization Technologies, Storage Area Networks and Software Defined Datacenter, Networking and Storage.
I hold Numerous certification including RHCE, CCNA, VCP4.0, VCP5.1, VCP5.5, vCloud and EMC certification.
While spending countless hours exploring the product inside and out and learning everything about it, Eventually I discovered my passion for teaching and helping others learn from my knowledge and experience so turned to Trainer cum Blogger to educate every single person keen to learn Virtualization.

Leave a Reply