VPP/How To Optimize Performance (System Tuning)

From fd.io
< VPP
Revision as of 15:38, 18 March 2016 by Ckoester (Talk | contribs)

Jump to: navigation, search

This page describes system configuration tweaks that can help maximize the packet processing performance of VPP applications.

General Considerations

WARNING: The suggestions on this page have been validated on Intel CPUs ONLY. The applicability of these suggestions to other CPU architectures (such as arm64) has not been verified. Please consider any adjustments that might be appropriate for non-Intel CPUs.

Most of the suggestions on this page apply to both VM machines and Bare Metal OS instances (by "Bare Metal" we mean an instance of an operating system running directly on hardware and not on a virtual machine). Please note that the section titles that contain the words "In a VM" are suggestions that would apply only to an OS running on a virtual machine.

BIOS settings

Power Management

Intel processors have a power management feature where the system goes in power savings mode when the system is being under utilized. This feature should be turned off to avoid variance in vpp application performance. The system should be configured for maximum performance (bios configuration). The downside of this is that even when the host system is idle, the power consumption is not down.

For maximum performance, low-power processor states (C6, C1 enhanced) should be disabled.

Turboboost / Speedstep

Speedstep is a CPU feature that dynamically adjusts the frequency of processor to meet processing needs, decreasing the frequency under low cpu-load conditions. Turboboost overclocks a core when the demand for cpu is high. Turboboost requires that Speedstep is enabled.

While these two configuration are good for power saving they could introduce a variance in dataplane performance when there is a burst of packets. For consistency of behavior, these two features should be disabled.

For maximum performance, Speedstep and Turboboost can both be enabled. BIOS changes are likely not sufficient to enable Turboboost. The host OS may also need changes to support running at higher clock speeds. The specific configuration changes required are different on Ubuntu, CentOS, RedHat, etc. Please see this link for details: Avoiding CPU speed scaling

Ob Ubuntu, “performance” mode for all CPU cores should be set in these files:

root# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance
<etc>

The following output is from a system with an “Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz” with Turboboost enabled, showing the cores running at 2.90Ghz:

root # grep MHz /proc/cpuinfo
cpu MHz         : 2900.292
cpu MHz         : 2900.000
cpu MHz         : 2900.000
cpu MHz         : 2900.000
cpu MHz         : 2899.902
cpu MHz         : 2899.902
cpu MHz         : 2899.902
cpu MHz         : 2900.000
cpu MHz         : 2900.000
cpu MHz         : 2900.000
cpu MHz         : 2900.000
cpu MHz         : 2900.000

Virtualization Extensions

Intel virtualization extensions (VT – for VT-x) and VT-d (for direct IO) and DMA remapping (DMAR) must be turned on. VT-d enables IOMMU virtualization capabilities that are required for PCIe passthrough. Also, interrupt remapping should be enabled so that hardware interrupts can be remapped to a VM for PCIe passthrough.

On host bootup, the output of the command should look like:

$ dmesg | grep -e DMAR -e IOMMU 
[    0.000000] ACPI: DMAR 000000008f6dd000 001A8 (v01 Cisco0 CiscoUCS 00000001 MSFT 0100000D)
[    0.056527] dmar: IOMMU 0: reg_base_addr fe710000 ver 1:0 cap c90780106f0462 ecap f020ff
[    0.056637] IOAPIC id 8 under DRHD base  0xfe710000 IOMMU 0
[    0.056638] IOAPIC id 9 under DRHD base  0xfe710000 IOMMU 0

We recommend disabling VT-d Coherency Support for higher performance.

Note that Intel Sandy Bridge CPUs have a limitation with their VT-d IOTLB that limits PCIe passthrough throughput. Sandy Bridge (and earlier) CPUs are not recommended if high performance is required.

CMDLINE configuration parameters

Kernel command line during startup

Set the grub configuration for those kernel parameters which can only be set via the kernel command line during startup:

GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"

In the example command above:

  • intel_iommu=on>/tt> must be set for PCIe-passthrough interfaces to work in a VM.
  • <tt>isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 are multi-core scheduler / placement configuration parameters
  • hugepagesz=1GB hugepages=64 default_hugepagesz=1GB - setting 1GB hugepages will drastically improve VPP initialization times.


Tickless Kernel

For high performance applications, using a tickless kernel can result in improved performance. The host kernel must have the cores operating in tickless mode and the same cores should be dedicated to the vpp application.

You can check if local timer interrupts are occurring on each core from the output of:

grep LOC /proc/interrupts

or dynamically with:

watch -n1 -d "cat /proc/interrupts | egrep 'LOC|CPU'"

The host kernel may have been built with the CONFIG_NO_HZ_FULL_ALL option. If so, tickless operation will happen automatically on any core on which the linux scheduler has only one thread to run. To check for this, look for that string in your linux kernel config file. This file may be at /boot/<kernel version> (determine your kernel version with “uname –a”) or at /proc/config.gz.

If the kernel was not built with CONFIG_NO_HZ_FULL it may still be possible to run tickless by configuring it in the grub file (see the Grub File section). Specify the same set of cpus for both nohz_full and isolcpus.

To eliminate local timer interrupts, RCU callbacks need to be isolated as well. This is either done in the kernel config, or by the rcu_nocbs grub option.

GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 rcu_nocbs=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"

In a VM: CPU Isolation on Host (isolcpus)

For optimal performance of a virtual machine, specifically for the dataplane/forwarding features, the CPUs assigned to the virtual machine should be used exclusively by the VM. One reasonable way to configure this is via cgroup configuration, where the cpu’s assigned to the cgroup node for the VM are not shared with other tasks on the system. The kernel thread will still run on all the cores – so this does not give complete isolation. This configuration ensures that the host does not schedule other tasks on the same physical cpu and thus lets the qemu thread (and by that token the guest run on that core (almost) exclusively).

The host kernel threads can still be scheduled on the pcpu as mentioned earlier. To isolate the host CPUs completely, even from the kernel threads, isolcpus can be configured. The qemu threads can then be pinned to the isolated cpus. This requires grub configuration on the host and isolates them from running any load (other than the load that’s explicitly pinned to these cores).

Most deployments may not need configuration as it requires customized work load scheduling on the host system. Also this information needs to be propagated to the virtual router/virtual machine and the virtual router/machine needs to use the same isolated cpus. Our recommendation is to have this mechanism if the operator has the need and systems in place to configure/manage the host with this level of detail.

Be aware that cpu cores on a socket may not be numbered contiguously. This can be checked with:

grep “physical id” /proc/cpuinfo

For example, on an HP ProLiant DL380 Gen9 with two CPU E5-2680 v3 12-core CPUs, the cores from different sockets are interleaved like this:

physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1

There must always be one core that is not isolated. Commonly this is cpu 0. isolcpus is configured in the grub file. (See the Grub File section.) The cpu list can be a combination of ranges and/or comma-separated values, such as isolcpus=1-13 or isolcpus=11,12,13,14 or isolcpus=0-5,12-17.

For example:

GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"

The CONFIG_NO_HZ_FULL linux kernel build option is used to configure a tickless kernel. The idea is to configure certain processor cores to operate in tickless mode and these cores do not receive any periodic interrupts. These cores will run dedicated tasks (and no other tasks will be schedules on such cores obviating the need to send a scheduling tick). A CONFIG_HZ based timer interrupt will invalidate L1 cache on the core and this can degrade dataplane performance by a few % points (to be quantified, but estimated to be 1-3%). Running tickless typically means getting 1 timer interrupt/sec instead of 1000/sec.

Running VPP in a KVM VM

The following configuration tweaks have been used to demonstrate a 98-100% Max bandwidth zero packet drop rate forwarding 1000 byte ipv4 packets from one 10GigE interface to another bi-directionally using 2 10GigE ports on an Ixia traffic generator.

Disable Interrupt Balancing (irqbalance)

The Irqbalance daemon is enabled by default. It is designed to distribute hardware interrupts across CPUs in a multi-core system in order to increase performance. However, it can/will cause the cpu running the vpp VM to be stalled, causing dropped Rx packets. When irqbalance is disabled, all interrupts will be handled by cpu0, so the vpp VM (or any other service VMs) should NOT run on cpu0.

Disable irqbalance by setting ENABLED="0" in the default configuration file (/etc/default/irqbalance):

#Configuration for the irqbalance daemon

#Should irqbalance be enabled?
ENABLED="0"
#Balance the IRQs only once?
ONESHOT="0"

Man page: http://manpages.ubuntu.com/manpages/precise/man1/irqbalance.1.html

In a VM: Disable Kernel Samepage Merging (KSM)

KSM is a memory-saving de-duplication feature, that merges anonymous (private) pages (not pagecache ones).

While diagnosing the vpp Rx zero packet drop issue, we noticed a correlation between the /sys/kernel/debug/kvm/pf_fixed counter being incremented and the periodic Rx packet drops. We observed that disabling KSM eliminated the incrementing of these counters. KSM is enabled in Ubuntu 14.04 server on the host OS only. It is disabled when Ubuntu 14.04 server is run in a VM.

Disable KSM by writing "0" to /sys/kernel/mm/ksm/run in the host OS:

sudo bash
echo 0 > /sys/kernel/mm/ksm/run
exit

For more information, see: http://www.linux-kvm.org/page/KSM

In a VM: Configure KVM Parameters

In order to run VPP in a VM, the following parameters must be configured on the command line invocation or in the libvirt / virsh xml domain configuration:

-cpu host  :  This parameter causes the VM to inherit the host OS flags.  
Note: libvirt 0.9.11 or greater is required for this to be included in the xml configuration.

-m 8192    :  8 GB of ram is required for optimal zero packet drop rates.  
             TBD: Need to investigate why this is true.  4GB has Rx pkt drops even though there is only 2.2GB allocated!

-smp 2,sockets=1,cores=4,threads=2

To disable PXE boot delays, add the ",rombar=0" option to the end of each "-device" option list or 
add "<rom bar='off'/> to the device xml configuration.


In a VM: Remove VirtIO Balloon Driver

Use of the VirtIO Balloon driver in the vpp VM causes Rx packet drops when the balloon driver calls mmap().

Remove the VirtIO Balloon Driver from the VM configuration:

If editing the xml configuration, remove the memballoon driver by setting the model='none':

  <memballoon model='none'/>

or delete the device definition from the command line parameter list:

 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

In a VM: Set CPU Affinity and NUMA Memory Policy for the VPP VM threads

CPU Affinity and NUMA Memory Policy can be configured with libvirt.

For more information, see: https://libvirt.org/formatdomain.html#elementsNUMATuning


Set CPU Affinity for VPP in the VM

In order to prevent the linux scheduler from relocating the vpe application to a different CPU and in order to prevent interrupt handlers from running on the same cpu as vpe, the qn application cpu affinity shall be set to cpu0 and the vpe application cpu affinity set to cpu1.

If occasional packet drops are acceptable (e.g. a few hundred packets / 10s of minutes), this configuration step may be omitted.

Note: Given the KVM -smp options, there is only one NUMA node, thus no need to set NUMA memory affinity in the VM for the vpe application.

A VPP application should configure the correct cpu affinity during application initialization.

In a VM: Don't run anything else in the VM!

As noted in the previous section, setting the CPU affinity for the vpe and qn application in the VM is important prevent Rx packet drops under the right circumstances. Running other applications (e.g. htop) in the vpp VM may also cause Rx packet drops.

Hyperthreading

When hyperthreading is enabled, each physical CPU core is appears as two logical cores. Each logical core shares the resources (L1 and L2 cache, registers) of the physical core. This is controlled by a setting in the BIOS.

In general dataplane performance suffers when hyperthreading is enabled and so the recommendation is to disable it.

Since HT configuration is a BIOS setting, and changing it requires a reboot, a deployment will choose to operate with a particular setting and in reality, not enable/disable it based on the workload being run on the machine.

If HT is enabled, it is still possible to obtain the same performance as with HT disabled. To do this, isolate the extra logical cores (see CPU isolation) and do not assign any threads to them.

Other

Transparent Hugepages

Transparent hugepage (THP) feature automates the task of creating and managing hugepages. A kernel daemon process (khugepaged) runs in background and stitches free pages together to form/free hugepages.

We recommend turning this feature off and instead allocating hugepages explicitly (this is not a strong recommendation). It is possible to preallocate hugepages and still have THP daemon on the host system.

To turn off THP:

echo never  > /sys/kernel/mm/transparent_hugepages/enabled

Memory locking / Swap behavior

On a heavily loaded host system, linux will evict a process’ pages to free memory. This can happen to text pages, which are backed by a physical store. If swapping is enabled, the data segments can be swapped out to swap area on disk in case the system is running low on memory. This typically happens when system is overprovisioned. This is the typical setup on a server, but uncommon on embedded systems. Swapping leads to “slow” and non-deterministic response times (added latency to access the page). Page eviction can add to latency if the page is not in memory.

Our recommendation for running nfv application is not to overprovision the system and specifically to avoid swapping (turn swap off). For deterministic response time, we recommend to pin qemu memory for vpp applications. Pinning/locking qemu memory ensures that the qemu process pages are always memory resident. This provides consistent response times.

The parameter to turn on locking of qemu process memory is: -realtime mlock=on

A few things need to be considered to turn on page locking. The calling process must have the process limits (prlimit) set appropriately to lock the appropriate amount/size of memory. If using virsh to start the virtual router, the process limits of libvirtd must be set appropriately.

To validate that the process memory is locked, check the value of VmLck field in /proc/<pid>/status file. The <pid> needs to be the pid of the qemu process (or pid of any of the qemu threads for the virtual router).

KSM

Kernel Same-page Merging (also known as kernel shared memory and memory merging) is a kernel feature that makes it possible for a hypervisor system to share identical memory pages amongst different processes or among multiple virtual machines. While not directly linked, Kernel-based Virtual Machine (KVM) can use KSM to merge memory pages occupied by virtual machines.

KSM is a linux kernel feature (today qemu being the only client application). KSM consumes non-trivial cpu resources on the host system in trying to optimize memory utilization. Also, KSM attempts to merge pages at periodic intervals (typically 200 ms, but configurable via tuning the entry in /sys/kernel/mm/ksm/sleep_millisecs)

We recommend turning this function off when running a single vpp instance.

If there are multiple vpp instances running on a system, turning on this feature will save memory at the expense of some cpu cycles.

To turn off this feature, execute:

echo 0 > /sys/kernel/mm/ksm/run

If it's not practical to turn off ksm, we recommend turning off ksm across numa nodes:

echo 0 > /sys/kernel/mm/ksm/merge_across_nodes

In a VM: Pass host CPU flags to the guest

Pass the host CPU configuration to the sunstone virtual router. This is specifically important to see if the host cpu supports 1gb huge pages (pdpe1gb flag in /proc/cpuinfo). This is done using the –cpu=host flag in qemu commandline.

VPP configuration

Multithreading

In any environment where high throughput performance is a requirement, it is suggested to run VPP in multithreaded mode.

If running in the default single-threaded configuration, then the same thread that is handling packet forwarding will also perform administrative tasks such as responding to API calls or collecting statistics (which may consume different amounts of time depending on NIC make and model, NIC placement, and the amount of NICs configured for use in VPP), thus allowing external factors to impact forwarding performance. Therefore, even if the required performance target can be achieved by a single CPU core, running VPP in a "one main thread plus one worker thread" configuration will help to alleviate the impact external factors can have, and allow the one worker thread to deliver better and more consistent forwarding performance.

References