VPP/Missing Prefetches

From fd.io
< VPP
Revision as of 16:58, 30 November 2016 by Dbarach (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block.

Baseline

Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out:

vpp# show run
             Name                 Clocks       Vectors/Call  
FortyGigabitEthernet84/0/1-out         9.08e0           50.09
FortyGigabitEthernet84/0/1-tx          3.84e1           50.09
dpdk-input                             7.45e1           50.09
interface-output                       1.08e1           50.09
ip4-input-no-checksum                  3.92e1           50.09
ip4-lookup                             3.88e1           50.09
ip4-rewrite-transit                    3.43e1           50.09

The key statistic to note here: ip4-input-no-checksum costs 39 clocks per packet

Baseline "perf top" function-level profile:

 14.21%  libvnet.so.0.0.0           [.] ip4_input_no_checksum_avx2
 14.14%  libvnet.so.0.0.0           [.] ip4_lookup_avx2
 14.10%  vpp                        [.] i40e_recv_scattered_pkts_vec
 12.64%  libvnet.so.0.0.0           [.] ip4_rewrite_transit_avx2
 10.60%  libvnet.so.0.0.0           [.] dpdk_input_avx2
  9.70%  vpp                        [.] i40e_xmit_pkts_vec
  4.88%  libvnet.so.0.0.0           [.] dpdk_interface_tx_avx2
  3.67%  libvlib.so.0.0.0           [.] dispatch_node
  3.25%  libvnet.so.0.0.0           [.] vnet_per_buffer_interface_output_avx2
  2.96%  libvnet.so.0.0.0           [.] vnet_interface_output_node_no_flatten
  1.85%  libvlib.so.0.0.0           [.] vlib_put_next_frame
  1.80%  libvlib.so.0.0.0           [.] vlib_get_next_frame_internal
  1.12%  vpp                        [.] rte_delay_us_block

Turn off the dual-loop prefetch block in ip4_input_inline(...)

/* Prefetch next iteration. */
if (0)
  {
     vlib_buffer_t * p2, * p3;

     p2 = vlib_get_buffer (vm, from[2]);
     p3 = vlib_get_buffer (vm, from[3]);

     vlib_prefetch_buffer_header (p2, LOAD);
     vlib_prefetch_buffer_header (p3, LOAD);

     CLIB_PREFETCH (p2->data, sizeof (ip0[0]), LOAD);
    CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD);
  }

This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature:

             Name               Clocks       Vectors/Call  
FortyGigabitEthernet84/0/1-out       7.91e0           76.97
FortyGigabitEthernet84/0/1-tx        3.76e1           76.97
dpdk-input                           6.62e1           76.97
interface-output                     9.91e0           76.97
ip4-input-no-checksum                5.53e1           76.97
ip4-lookup                           3.49e1           76.97
ip4-rewrite-transit                  3.32e1           76.97

This single change causes ip4-input-no-checksum to increase to 55 clocks/pkt (from 39 clocks/pkt). ip4-input-no-checksum jumps to the top of the "perf top" summary:

 21.47%  libvnet.so.0.0.0               [.] ip4_input_no_checksum_avx2
 13.73%  vpp                            [.] i40e_recv_scattered_pkts_vec
 13.42%  libvnet.so.0.0.0               [.] ip4_lookup_avx2
 12.53%  libvnet.so.0.0.0               [.] ip4_rewrite_transit_avx2

The "perf top" detailed function profile shows a gross stall (32% of the function runtime) at the first use of packet data:

      │       /* Check bounds. */
      │       ASSERT ((signed) b->current_data >= (signed) -VLIB_BUFFER_PRE_DAT
      │       return b->data + b->current_data;
 0.77 │       movswq (%rbx),%rax
      │               p1 = vlib_get_buffer (vm, pi1);
      │
      │               ip0 = vlib_buffer_get_current (p0);
      │               ip1 = vlib_buffer_get_current (p1);
      │
      │               sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX];
 0.06 │       mov    0x20(%rbx),%r11d
      │               sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX];
 0.20 │       mov    0x20(%rbp),%r10d
 0.03 │       lea    0x100(%rbx,%rax,1),%rdx
 0.80 │       movswq 0x0(%rbp),%rax
      │
      │               arc0 = ip4_address_is_multicast (&ip0->dst_address) ? lm-
 0.23 │       movzbl 0x10(%rdx),%edi
32.64 │       lea    0x100(%rbp,%rax,1),%rax
      │       and    $0xfffffff0,%edi
 0.84 │       cmp    $0xe0,%dil
      │               arc1 = ip4_address_is_multicast (&ip1->dst_address) ? lm-
 0.81 │       movzbl 0x10(%rax),%edi
      │
      │               vnet_buffer (p0)->ip.adj_index[VLIB_RX] = ~0;
 5.32 │       movl   $0xffffffff,0x28(%rbx)
      │               ip1 = vlib_buffer_get_current (p1);
      │
      │               sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX];
      │               sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX];
      │