VPP/Missing Prefetches
Introduction
vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block.
Baseline
Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out:
vpp# show run Name Clocks Vectors/Call FortyGigabitEthernet84/0/1-out 9.08e0 50.09 FortyGigabitEthernet84/0/1-tx 3.84e1 50.09 dpdk-input 7.45e1 50.09 interface-output 1.08e1 50.09 ip4-input-no-checksum 3.92e1 50.09 ip4-lookup 3.88e1 50.09 ip4-rewrite-transit 3.43e1 50.09
The key statistic to note here: ip4-input-no-checksum costs 39 clocks per packet
Baseline "perf top" function-level profile:
14.21% libvnet.so.0.0.0 [.] ip4_input_no_checksum_avx2 14.14% libvnet.so.0.0.0 [.] ip4_lookup_avx2 14.10% vpp [.] i40e_recv_scattered_pkts_vec 12.64% libvnet.so.0.0.0 [.] ip4_rewrite_transit_avx2 10.60% libvnet.so.0.0.0 [.] dpdk_input_avx2 9.70% vpp [.] i40e_xmit_pkts_vec 4.88% libvnet.so.0.0.0 [.] dpdk_interface_tx_avx2 3.67% libvlib.so.0.0.0 [.] dispatch_node 3.25% libvnet.so.0.0.0 [.] vnet_per_buffer_interface_output_avx2 2.96% libvnet.so.0.0.0 [.] vnet_interface_output_node_no_flatten 1.85% libvlib.so.0.0.0 [.] vlib_put_next_frame 1.80% libvlib.so.0.0.0 [.] vlib_get_next_frame_internal 1.12% vpp [.] rte_delay_us_block
Turn off the dual-loop prefetch block in ip4_input_inline(...)
/* Prefetch next iteration. */ if (0) { vlib_buffer_t * p2, * p3; p2 = vlib_get_buffer (vm, from[2]); p3 = vlib_get_buffer (vm, from[3]); vlib_prefetch_buffer_header (p2, LOAD); vlib_prefetch_buffer_header (p3, LOAD); CLIB_PREFETCH (p2->data, sizeof (ip0[0]), LOAD); CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD); }
This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature:
Name Clocks Vectors/Call FortyGigabitEthernet84/0/1-out 7.91e0 76.97 FortyGigabitEthernet84/0/1-tx 3.76e1 76.97 dpdk-input 6.62e1 76.97 interface-output 9.91e0 76.97 ip4-input-no-checksum 5.53e1 76.97 ip4-lookup 3.49e1 76.97 ip4-rewrite-transit 3.32e1 76.97
This single change causes ip4-input-no-checksum to increase to 55 clocks/pkt (from 39 clocks/pkt). ip4-input-no-checksum jumps to the top of the "perf top" summary:
21.47% libvnet.so.0.0.0 [.] ip4_input_no_checksum_avx2 13.73% vpp [.] i40e_recv_scattered_pkts_vec 13.42% libvnet.so.0.0.0 [.] ip4_lookup_avx2 12.53% libvnet.so.0.0.0 [.] ip4_rewrite_transit_avx2
The "perf top" detailed function profile shows a gross stall (32% of the function runtime) at the first use of packet data:
│ /* Check bounds. */ │ ASSERT ((signed) b->current_data >= (signed) -VLIB_BUFFER_PRE_DAT │ return b->data + b->current_data; 0.77 │ movswq (%rbx),%rax │ p1 = vlib_get_buffer (vm, pi1); │ │ ip0 = vlib_buffer_get_current (p0); │ ip1 = vlib_buffer_get_current (p1); │ │ sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX]; 0.06 │ mov 0x20(%rbx),%r11d │ sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX]; 0.20 │ mov 0x20(%rbp),%r10d 0.03 │ lea 0x100(%rbx,%rax,1),%rdx 0.80 │ movswq 0x0(%rbp),%rax │ │ arc0 = ip4_address_is_multicast (&ip0->dst_address) ? lm- 0.23 │ movzbl 0x10(%rdx),%edi 32.64 │ lea 0x100(%rbp,%rax,1),%rax │ and $0xfffffff0,%edi 0.84 │ cmp $0xe0,%dil │ arc1 = ip4_address_is_multicast (&ip1->dst_address) ? lm- 0.81 │ movzbl 0x10(%rax),%edi │ │ vnet_buffer (p0)->ip.adj_index[VLIB_RX] = ~0; 5.32 │ movl $0xffffffff,0x28(%rbx) │ ip1 = vlib_buffer_get_current (p1); │ │ sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX]; │ sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX]; │