Overview of Performance and Optimization Options kDinesh

NUMA :-

In a Symmetric Multiprocessor (SMP) machine, all CPUs within the system access the shared memory at the same speed.

Non-Uniform Memory Access(NUMA) allows different CPUs to access other parts of memory at different speeds.

Non-Uniform Memory Access, or NUMA, allows multiple CPUs to share L1, L2, and L3 caches and main memory. When running workloads on NUMA hosts, the vCPUs executing processes must be on the same NUMA node as the memory used by these processes. This ensures all memory accesses are local to the node and thus do not consume the limited cross-node memory bandwidth, adding latency to memory accesses.

CPU Pinning :-

In virtualization, a guest’s vCPU is typically scheduled across any of the cores within the system. Of course, this can lead to sub-optimal cache performance as virtual CPUs are scheduled between CPU cores within a NUMA node or, worse, between NUMA nodes.

CPU pinning provides the ability to restrict which physical CPUs the virtual CPUs run on, resulting in faster memory read and writes from the CPU.

Huge Page :-

The memory is divided into fixed-size areas, both virtual and physical. The most widespread page size is 4 KB. However, often, it is possible to use bigger pages. They are called huge pages. Supported page sizes are limited both by the hardware and kernel used. Usually, on x86-64 machines, available sizes are 4 KB, 2 MB, and 1 GB. Operating systems supporting huge pages allow pages of different sizes to co-exist simultaneously.

The larger memory pages mean the operating system can manage fewer pages and access memory pages faster, thus optimizing memory management. The additional benefit of using HugePages is the larger pages cannot be swapped out to disk.

Huge pages are a memory management technique used in Linux systems to improve performance using larger memory blocks than the default page size. They help reduce the pressure on the Translation Lookaside Buffer (TLB) and lower the overhead of managing memory in systems with large amounts of RAM.

Virtual memory :-

Virtual memory provides an abstraction layer between the physical memory and the applications. It allows each process to have its own private address space, isolated from other processes’ address spaces. This ensures that one process cannot directly access the memory of another process, providing security and stability.

Virtual memory also allows a computer to use more memory than is physically installed by using disk storage as an extension of the physical memory. This is achieved through swapping or paging, where portions of memory are temporarily stored on disk and loaded back into Memory when needed.

Paging :-

Paging is enables virtual memory to work. It involves dividing the virtual memory address space into fixed-size units called pages and the physical memory into fixed-size units called frames. The size of a page is typically the same as that of a frame. The operating system maintains a mapping between virtual pages and physical frames called the page table.

Translation Lookaside Buffers (TLB) :-
When a CPU accesses memory using a virtual address, a virtual address translates into a physical address.
The translation process involves looking up the virtual address in the page table, which contains mappings between virtual pages and physical frames. However, the page table is pretty complex and slow, and we can’t parse the entire data structure every time some process accesses the memory.

The CPU uses a TLB to cache recent virtual-to-physical address translations to speed up address translation. The TLB is much faster than the main memory because it’s a small, specialised cache built within the CPU.

When the CPU needs to translate a virtual address, it first checks the TLB for the translation. The CPU can quickly access the corresponding physical memory if the translation is found in the TLB (a “TLB hit”).

If the translation is not found in the TLB (a “TLB miss”), the CPU must access the page table in the main memory to perform the translation. This process is called a “page table walk” and can be time-consuming. After the translation is obtained from the page table, the CPU updates the TLB with the new translation, so future accesses to the same virtual address can benefit from the TLB cache.

Real-Time Kernel (RT) :-

It is an optimised kernel designed to maintain low latency, consistent response time, and determinism.

A real-time capable Linux kernel aims to provide a bounded response time to an external event.

You can reference below link

https://ubuntu.com/blog/what-is-real-time-linux-i

PTP: –

Time synchronisation is required in any network.

PTP, or Precision Time Protocol, is another network-based time synchronisation standard, but instead of millisecond-level synchronisation, PTP networks aim to achieve nanosecond- or even picosecond-level synchronization. For most commercial and industrial applications, NTP is more than accurate enough, but if you need even tighter synchronization and timestamping, you’ll need to migrate to a PTP server.

It uses hardware timestamping. PTP equipment is dedicated to one specialised purpose: keeping devices synchronised. For that reason alone, PTP networks have much sharper time resolutions. Unlike NTP, PTP devices will actually timestamp the amount of time that synchronization messages spend in each device, which accounts for device latency.

DPDK: –

Kernel Space: – The memory space where the core of the operating system (kernel) executes and provides its services is known as kernel space. It’s reserved for running device drivers, OS kernel, and all other kernel extensions.

User space: – The user space is also known as userland and is the memory space where all user applications or application software are executed. Everything other than OS cores and kernel runs here.

By default, Linux uses the kernel to process packets; this puts pressure on the kernel to process packets faster as the NICs’ (Network Interface Card) speeds increase fast.

There have been many techniques to bypass kernel to achieve packet efficiency. This involves processing packets in the userspace instead of kernel space. DPDK is one such technology.

DPDK is a set of libraries and drivers that perform fast packet processing. This enables a user to create optimized performance with a packet processing application.

Without DPDK, packet processing is done through the kernel network stack, which is interrupt-driven. Each time NIC receives incoming packets, there is a kernel interrupt to process the packets and a context switch from kernel space to user space. This creates delay.

With the DPDK, the processing happens in the user space using Poll mode drivers. These poll mode drivers can poll data directly from NIC, thus providing fast switching by completely bypassing kernel space. This improves the throughput rate of data.

PCI-Passthrough: –

In non-virtualized environments, the data traffic is received by the physical NIC (pNIC) and is sent to an application in the user space via the kernel space. However, in a virtual environment or VNF network, there are pNICs, virtual NICs (vNICs), a hypervisor, and a virtual switch in between.

The hypervisor and the virtual switch take the data from the pNIC and then send it to the vNIC of the Virtual Machine or the Virtual Network Function (VNF), then to the application. The virtual layer causes virtualization overhead and additional packet processing that reduces I/O packet throughput and builds up bottlenecks.

Peripheral Component Interconnect (PCI) passthrough gives Virtual machines or VNFs direct access to physical PCI devices that seem and behave as if they were physically connected to the VNF. PCI passthrough can be used to map a single pNIC to a single VNF, making the VNF appear to be directly connected to the pNIC.

However, using PCI passthrough dedicates an entire pNIC to a single VNF. It cannot be shared with other Virtual Network Functions (VNFs). Therefore, it limits the number of VNFs to the number of pNICs in the system.

SR-IOV :-

SR-IOV stands for “Single Root I/O Virtualization”.

The SR-IOV specification defines a standardized mechanism to virtualize PCIe devices. This mechanism can virtualize a single PCIe Ethernet controller to appear as multiple PCIe devices.

By creating virtual slices of PCIe devices, each virtual slice can be assigned to a single VM/VNF, thereby eliminating the issue that happened because of limited NICs

This can be further coupled with DPDK as part of VNF, thus taking combined advantage of DPDK and SR-IOV.

kDinesh

breaking down intricate tech topics into digestible insights for curious minds

Overview of Performance and Optimization Options

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from kDinesh