VPP/How To Optimize Performance (System Tuning)
This page describes system configuration tweaks that can help maximize the packet processing performance of VPP applications.
Contents
- 1 General Considerations
- 2 BIOS settings
- 3 Running VPP in a KVM VM
- 3.1 Disable Interrupt Balancing (irqbalance)
- 3.2 In a VM: Disable Kernel Samepage Merging (KSM)
- 3.3 In a VM: Configure KVM Parameters
- 3.4 In a VM: Remove VirtIO Balloon Driver
- 3.5 In a VM: Set CPU Affinity and NUMA Memory Policy for the VPP VM threads
- 3.6 Set CPU Affinity for VPP in the VM
- 3.7 In a VM: Don't run anything else in the VM!
- 3.8 Hyperthreading
- 4 Other
- 5 VPP configuration
- 6 References
General Considerations
WARNING: The suggestions on this page have been validated on Intel CPUs ONLY. The applicability of these suggestions to other CPU architectures (such as arm64) has not been verified. Please consider any adjustments that might be appropriate for non-Intel CPUs.
Most of the suggestions on this page apply to both VM machines and Bare Metal OS instances (by "Bare Metal" we mean an instance of an operating system running directly on hardware and not on a virtual machine). Please note that the section titles that contain the words "In a VM" are suggestions that would apply only to an OS running on a virtual machine.
BIOS settings
Power Management
Intel processors have a power management feature where the system goes in power savings mode when the system is being under utilized. This feature should be turned off to avoid variance in vpp application performance. The system should be configured for maximum performance (bios configuration). The downside of this is that even when the host system is idle, the power consumption is not down.
For maximum performance, low-power processor states (C6, C1 enhanced) should be disabled.
Turboboost / Speedstep
Speedstep is a CPU feature that dynamically adjusts the frequency of processor to meet processing needs, decreasing the frequency under low cpu-load conditions. Turboboost overclocks a core when the demand for cpu is high. Turboboost requires that Speedstep is enabled.
While these two configuration are good for power saving they could introduce a variance in dataplane performance when there is a burst of packets. For consistency of behavior, these two features should be disabled.
For maximum performance, Speedstep and Turboboost can both be enabled. BIOS changes are likely not sufficient to enable Turboboost. The host OS may also need changes to support running at higher clock speeds. The specific configuration changes required are different on Ubuntu, CentOS, RedHat, etc. Please see this link for details: Avoiding CPU speed scaling
Ob Ubuntu, “performance” mode for all CPU cores should be set in these files:
root# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance performance performance performance <etc>
The following output is from a system with an “Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz” with Turboboost enabled, showing the cores running at 2.90Ghz:
root # grep MHz /proc/cpuinfo cpu MHz : 2900.292 cpu MHz : 2900.000 cpu MHz : 2900.000 cpu MHz : 2900.000 cpu MHz : 2899.902 cpu MHz : 2899.902 cpu MHz : 2899.902 cpu MHz : 2900.000 cpu MHz : 2900.000 cpu MHz : 2900.000 cpu MHz : 2900.000 cpu MHz : 2900.000
Virtualization Extensions
Intel virtualization extensions (VT – for VT-x) and VT-d (for direct IO) and DMA remapping (DMAR) must be turned on. VT-d enables IOMMU virtualization capabilities that are required for PCIe passthrough. Also, interrupt remapping should be enabled so that hardware interrupts can be remapped to a VM for PCIe passthrough.
On host bootup, the output of the command should look like:
$ dmesg | grep -e DMAR -e IOMMU [ 0.000000] ACPI: DMAR 000000008f6dd000 001A8 (v01 Cisco0 CiscoUCS 00000001 MSFT 0100000D) [ 0.056527] dmar: IOMMU 0: reg_base_addr fe710000 ver 1:0 cap c90780106f0462 ecap f020ff [ 0.056637] IOAPIC id 8 under DRHD base 0xfe710000 IOMMU 0 [ 0.056638] IOAPIC id 9 under DRHD base 0xfe710000 IOMMU 0
We recommend disabling VT-d Coherency Support for higher performance.
Note that Intel Sandy Bridge CPUs have a limitation with their VT-d IOTLB that limits PCIe passthrough throughput. Sandy Bridge (and earlier) CPUs are not recommended if high performance is required.
CMDLINE configuration parameters
Kernel command line during startup
Set the grub configuration for those kernel parameters which can only be set via the kernel command line during startup:
GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"
In the example command above:
- intel_iommu=on>/tt> must be set for PCIe-passthrough interfaces to work in a VM.
- <tt>isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 are multi-core scheduler / placement configuration parameters
- hugepagesz=1GB hugepages=64 default_hugepagesz=1GB - setting 1GB hugepages will drastically improve VPP initialization times.
Tickless Kernel
For high performance applications, using a tickless kernel can result in improved performance. The host kernel must have the cores operating in tickless mode and the same cores should be dedicated to the vpp application.
You can check if local timer interrupts are occurring on each core from the output of:
grep LOC /proc/interrupts
or dynamically with:
watch -n1 -d "cat /proc/interrupts | egrep 'LOC|CPU'"
The host kernel may have been built with the CONFIG_NO_HZ_FULL_ALL option. If so, tickless operation will happen automatically on any core on which the linux scheduler has only one thread to run. To check for this, look for that string in your linux kernel config file. This file may be at /boot/<kernel version> (determine your kernel version with “uname –a”) or at /proc/config.gz.
If the kernel was not built with CONFIG_NO_HZ_FULL it may still be possible to run tickless by configuring it in the grub file (see the Grub File section). Specify the same set of cpus for both nohz_full and isolcpus.
To eliminate local timer interrupts, RCU callbacks need to be isolated as well. This is either done in the kernel config, or by the rcu_nocbs grub option.
GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"
In a VM: CPU Isolation on Host (isolcpus)
For optimal performance of a virtual machine, specifically for the dataplane/forwarding features, the CPUs assigned to the virtual machine should be used exclusively by the VM. One reasonable way to configure this is via cgroup configuration, where the cpu’s assigned to the cgroup node for the VM are not shared with other tasks on the system. The kernel thread will still run on all the cores – so this does not give complete isolation. This configuration ensures that the host does not schedule other tasks on the same physical cpu and thus lets the qemu thread (and by that token the guest run on that core (almost) exclusively).
The host kernel threads can still be scheduled on the pcpu as mentioned earlier. To isolate the host CPUs completely, even from the kernel threads, isolcpus can be configured. The qemu threads can then be pinned to the isolated cpus. This requires grub configuration on the host and isolates them from running any load (other than the load that’s explicitly pinned to these cores).
Most deployments may not need configuration as it requires customized work load scheduling on the host system. Also this information needs to be propagated to the virtual router/virtual machine and the virtual router/machine needs to use the same isolated cpus. Our recommendation is to have this mechanism if the operator has the need and systems in place to configure/manage the host with this level of detail.
Be aware that cpu cores on a socket may not be numbered contiguously. This can be checked with:
grep “physical id” /proc/cpuinfo
For example, on an HP ProLiant DL380 Gen9 with two CPU E5-2680 v3 12-core CPUs, the cores from different sockets are interleaved like this:
physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 1 physical id : 1 physical id : 1 physical id : 1 physical id : 1 physical id : 1 physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 1 physical id : 1 physical id : 1 physical id : 1 physical id : 1 physical id : 1
There must always be one core that is not isolated. Commonly this is cpu 0. isolcpus is configured in the grub file. (See the Grub File section.) The cpu list can be a combination of ranges and/or comma-separated values, such as isolcpus=1-13 or isolcpus=11,12,13,14 or isolcpus=0-5,12-17.
For example:
GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"
The CONFIG_NO_HZ_FULL linux kernel build option is used to configure a tickless kernel. The idea is to configure certain processor cores to operate in tickless mode and these cores do not receive any periodic interrupts. These cores will run dedicated tasks (and no other tasks will be schedules on such cores obviating the need to send a scheduling tick). A CONFIG_HZ based timer interrupt will invalidate L1 cache on the core and this can degrade dataplane performance by a few % points (to be quantified, but estimated to be 1-3%). Running tickless typically means getting 1 timer interrupt/sec instead of 1000/sec.
Running VPP in a KVM VM
The following configuration tweaks have been used to demonstrate a 98-100% Max bandwidth zero packet drop rate forwarding 1000 byte ipv4 packets from one 10GigE interface to another bi-directionally using 2 10GigE ports on an Ixia traffic generator.
Disable Interrupt Balancing (irqbalance)
The Irqbalance daemon is enabled by default. It is designed to distribute hardware interrupts across CPUs in a multi-core system in order to increase performance. However, it can/will cause the cpu running the vpp VM to be stalled, causing dropped Rx packets. When irqbalance is disabled, all interrupts will be handled by cpu0, so the vpp VM (or any other service VMs) should NOT run on cpu0.
Disable irqbalance by setting ENABLED="0" in the default configuration file (/etc/default/irqbalance):
#Configuration for the irqbalance daemon #Should irqbalance be enabled? ENABLED="0" #Balance the IRQs only once? ONESHOT="0"
Man page: http://manpages.ubuntu.com/manpages/precise/man1/irqbalance.1.html
In a VM: Disable Kernel Samepage Merging (KSM)
KSM is a memory-saving de-duplication feature, that merges anonymous (private) pages (not pagecache ones).
While diagnosing the vpp Rx zero packet drop issue, we noticed a correlation between the /sys/kernel/debug/kvm/pf_fixed counter being incremented and the periodic Rx packet drops. We observed that disabling KSM eliminated the incrementing of these counters. KSM is enabled in Ubuntu 14.04 server on the host OS only. It is disabled when Ubuntu 14.04 server is run in a VM.
Disable KSM by writing "0" to /sys/kernel/mm/ksm/run in the host OS:
sudo bash echo 0 > /sys/kernel/mm/ksm/run exit
For more information, see: http://www.linux-kvm.org/page/KSM
In a VM: Configure KVM Parameters
In order to run VPP in a VM, the following parameters must be configured on the command line invocation or in the libvirt / virsh xml domain configuration:
-cpu host : This parameter causes the VM to inherit the host OS flags. Note: libvirt 0.9.11 or greater is required for this to be included in the xml configuration. -m 8192 : 8 GB of ram is required for optimal zero packet drop rates. TBD: Need to investigate why this is true. 4GB has Rx pkt drops even though there is only 2.2GB allocated! -smp 2,sockets=1,cores=4,threads=2 To disable PXE boot delays, add the ",rombar=0" option to the end of each "-device" option list or add "<rom bar='off'/> to the device xml configuration.
In a VM: Remove VirtIO Balloon Driver
Use of the VirtIO Balloon driver in the vpp VM causes Rx packet drops when the balloon driver calls mmap().
Remove the VirtIO Balloon Driver from the VM configuration:
If editing the xml configuration, remove the memballoon driver by setting the model='none':
<memballoon model='none'/>
or delete the device definition from the command line parameter list:
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
In a VM: Set CPU Affinity and NUMA Memory Policy for the VPP VM threads
CPU Affinity and NUMA Memory Policy can be configured with libvirt
.
For more information, see: https://libvirt.org/formatdomain.html#elementsNUMATuning
Set CPU Affinity for VPP in the VM
In order to prevent the linux scheduler from relocating the vpe application to a different CPU and in order to prevent interrupt handlers from running on the same cpu as vpe, the qn application cpu affinity shall be set to cpu0 and the vpe application cpu affinity set to cpu1.
If occasional packet drops are acceptable (e.g. a few hundred packets / 10s of minutes), this configuration step may be omitted.
Note: Given the KVM -smp options, there is only one NUMA node, thus no need to set NUMA memory affinity in the VM for the vpe application.
A VPP application should configure the correct cpu affinity during application initialization.
In a VM: Don't run anything else in the VM!
As noted in the previous section, setting the CPU affinity for the vpe and qn application in the VM is important prevent Rx packet drops under the right circumstances. Running other applications (e.g. htop) in the vpp VM may also cause Rx packet drops.
Hyperthreading
When hyperthreading is enabled, each physical CPU core is appears as two logical cores. Each logical core shares the resources (L1 and L2 cache, registers) of the physical core. This is controlled by a setting in the BIOS.
In general dataplane performance suffers when hyperthreading is enabled and so the recommendation is to disable it.
Since HT configuration is a BIOS setting, and changing it requires a reboot, a deployment will choose to operate with a particular setting and in reality, not enable/disable it based on the workload being run on the machine.
If HT is enabled, it is still possible to obtain the same performance as with HT disabled. To do this, isolate the extra logical cores (see CPU isolation) and do not assign any threads to them.
Other
Transparent Hugepages
Transparent hugepage (THP) feature automates the task of creating and managing hugepages. A kernel daemon process (khugepaged) runs in background and stitches free pages together to form/free hugepages.
We recommend turning this feature off and instead allocating hugepages explicitly (this is not a strong recommendation). It is possible to preallocate hugepages and still have THP daemon on the host system.
To turn off THP:
echo never > /sys/kernel/mm/transparent_hugepages/enabled
Memory locking / Swap behavior
On a heavily loaded host system, linux will evict a process’ pages to free memory. This can happen to text pages, which are backed by a physical store. If swapping is enabled, the data segments can be swapped out to swap area on disk in case the system is running low on memory. This typically happens when system is overprovisioned. This is the typical setup on a server, but uncommon on embedded systems. Swapping leads to “slow” and non-deterministic response times (added latency to access the page). Page eviction can add to latency if the page is not in memory.
Our recommendation for running nfv application is not to overprovision the system and specifically to avoid swapping (turn swap off). For deterministic response time, we recommend to pin qemu memory for vpp applications. Pinning/locking qemu memory ensures that the qemu process pages are always memory resident. This provides consistent response times.
The parameter to turn on locking of qemu process memory is: -realtime mlock=on
A few things need to be considered to turn on page locking. The calling process must have the process limits (prlimit) set appropriately to lock the appropriate amount/size of memory. If using virsh to start the virtual router, the process limits of libvirtd must be set appropriately.
To validate that the process memory is locked, check the value of VmLck field in /proc/<pid>/status file. The <pid> needs to be the pid of the qemu process (or pid of any of the qemu threads for the virtual router).
KSM
Kernel Same-page Merging (also known as kernel shared memory and memory merging) is a kernel feature that makes it possible for a hypervisor system to share identical memory pages amongst different processes or among multiple virtual machines. While not directly linked, Kernel-based Virtual Machine (KVM) can use KSM to merge memory pages occupied by virtual machines.
KSM is a linux kernel feature (today qemu being the only client application). KSM consumes non-trivial cpu resources on the host system in trying to optimize memory utilization. Also, KSM attempts to merge pages at periodic intervals (typically 200 ms, but configurable via tuning the entry in /sys/kernel/mm/ksm/sleep_millisecs)
We recommend turning this function off when running a single vpp instance.
If there are multiple vpp instances running on a system, turning on this feature will save memory at the expense of some cpu cycles.
To turn off this feature, execute:
echo 0 > /sys/kernel/mm/ksm/run
If it's not practical to turn off ksm, we recommend turning off ksm across numa nodes:
echo 0 > /sys/kernel/mm/ksm/merge_across_nodes
In a VM: Pass host CPU flags to the guest
Pass the host CPU configuration to the sunstone virtual router. This is specifically important to see if the host cpu supports 1gb huge pages (pdpe1gb flag in /proc/cpuinfo). This is done using the –cpu=host flag in qemu commandline.
VPP configuration
Multithreading
In any environment where high throughput performance is a requirement, it is suggested to run VPP in multithreaded mode.
If running in the default single-threaded configuration, then the same thread that is handling packet forwarding will also perform administrative tasks such as responding to API calls or collecting statistics (which may consume different amounts of time depending on NIC make and model, NIC placement, and the amount of NICs configured for use in VPP), thus allowing external factors to impact forwarding performance. Therefore, even if the required performance target can be achieved by a single CPU core, running VPP in a "one main thread plus one worker thread" configuration will help to alleviate the impact external factors can have, and allow the one worker thread to deliver better and more consistent forwarding performance.
References
- kernel.org hugetlbpage doc
- hugetlbfs man page
- kernel.org kernel-per-CPU-kthreads doc
- kernel.org cgroup memory.txt
- kernel.org cgroups.txt
- kernel.org cgroup cpusets.txt
- kernel.org cgroup hugetlb.txt
- kernel.org cgroup devices.txt
- kvm VFIO (.pdf format)
- kernel.org vfio.txt
- red hat cpu/irq.html
- tickless kernel
- NO_HZ kernel operation
- kernel.org sched-domains.txt
- dpdk overview
- NO_HZ "full god mode"
- lwn.net transparent hugepages issue
- kernel.org kernel-per-CPU-kthreads.txt
- lwn.net hugetlbfs
- greenhost.nl multi-queue NICs with SMP on Linux
- irqbalance man page
- kernel.org IRQ-affinity.txt
- kernel.org network scaling.txt