VPP/Missing Prefetches
From fd.io
Introduction
vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block.
Baseline
Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out:
vpp# show run Name Clocks Vectors/Call FortyGigabitEthernet84/0/1-out 9.08e0 50.09 FortyGigabitEthernet84/0/1-tx 3.84e1 50.09 dpdk-input 7.45e1 50.09 interface-output 1.08e1 50.09 ip4-input-no-checksum 3.92e1 50.09 ip4-lookup 3.88e1 50.09 ip4-rewrite-transit 3.43e1 50.09
Baseline "perf top" function-level profile:
21.47% libvnet.so.0.0.0 [.] ip4_input_no_checksum_avx2 13.73% vpp [.] i40e_recv_scattered_pkts_vec 13.42% libvnet.so.0.0.0 [.] ip4_lookup_avx2 12.53% libvnet.so.0.0.0 [.] ip4_rewrite_transit_avx2 9.88% vpp [.] i40e_xmit_pkts_vec 9.44% libvnet.so.0.0.0 [.] dpdk_input_avx2 4.85% libvnet.so.0.0.0 [.] dpdk_interface_tx_avx2 3.25% libvnet.so.0.0.0 [.] vnet_per_buffer_interface_output_a 2.68% libvnet.so.0.0.0 [.] vnet_interface_output_node_no_flat 2.26% libvlib.so.0.0.0 [.] dispatch_node 1.30% libvlib.so.0.0.0 [.] vlib_put_next_frame 1.04% vpp [.] rte_delay_us_block