Difference between revisions of "VPP/Missing Prefetches"

From fd.io
< VPP
Jump to: navigation, search
m
m (Turn off prefetch block in ip4_input_inline(...))
Line 33: Line 33:
 
   1.12%  vpp                        [.] rte_delay_us_block
 
   1.12%  vpp                        [.] rte_delay_us_block
  
=== Turn off prefetch block in ip4_input_inline(...) ===
+
=== Turn off the dual-loop prefetch block in ip4_input_inline(...) ===
  
 
  /* Prefetch next iteration. */
 
  /* Prefetch next iteration. */
Line 49: Line 49:
 
     CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD);
 
     CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD);
 
   }
 
   }
 +
 +
This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature:
 +
 +
              Name              Clocks      Vectors/Call 
 +
FortyGigabitEthernet84/0/1-out      7.91e0          76.97
 +
FortyGigabitEthernet84/0/1-tx        3.76e1          76.97
 +
dpdk-input                          6.62e1          76.97
 +
interface-output                    9.91e0          76.97
 +
ip4-input-no-checksum                5.53e1          76.97
 +
ip4-lookup                          3.49e1          76.97
 +
ip4-rewrite-transit                  3.32e1          76.97

Revision as of 16:53, 30 November 2016

Introduction

vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block.

Baseline

Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out:

vpp# show run
             Name                 Clocks       Vectors/Call  
FortyGigabitEthernet84/0/1-out         9.08e0           50.09
FortyGigabitEthernet84/0/1-tx          3.84e1           50.09
dpdk-input                             7.45e1           50.09
interface-output                       1.08e1           50.09
ip4-input-no-checksum                  3.92e1           50.09
ip4-lookup                             3.88e1           50.09
ip4-rewrite-transit                    3.43e1           50.09

Baseline "perf top" function-level profile:

 14.21%  libvnet.so.0.0.0           [.] ip4_input_no_checksum_avx2
 14.14%  libvnet.so.0.0.0           [.] ip4_lookup_avx2
 14.10%  vpp                        [.] i40e_recv_scattered_pkts_vec
 12.64%  libvnet.so.0.0.0           [.] ip4_rewrite_transit_avx2
 10.60%  libvnet.so.0.0.0           [.] dpdk_input_avx2
  9.70%  vpp                        [.] i40e_xmit_pkts_vec
  4.88%  libvnet.so.0.0.0           [.] dpdk_interface_tx_avx2
  3.67%  libvlib.so.0.0.0           [.] dispatch_node
  3.25%  libvnet.so.0.0.0           [.] vnet_per_buffer_interface_output_avx2
  2.96%  libvnet.so.0.0.0           [.] vnet_interface_output_node_no_flatten
  1.85%  libvlib.so.0.0.0           [.] vlib_put_next_frame
  1.80%  libvlib.so.0.0.0           [.] vlib_get_next_frame_internal
  1.12%  vpp                        [.] rte_delay_us_block

Turn off the dual-loop prefetch block in ip4_input_inline(...)

/* Prefetch next iteration. */
if (0)
  {
     vlib_buffer_t * p2, * p3;

     p2 = vlib_get_buffer (vm, from[2]);
     p3 = vlib_get_buffer (vm, from[3]);

     vlib_prefetch_buffer_header (p2, LOAD);
     vlib_prefetch_buffer_header (p3, LOAD);

     CLIB_PREFETCH (p2->data, sizeof (ip0[0]), LOAD);
    CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD);
  }

This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature:

             Name               Clocks       Vectors/Call  
FortyGigabitEthernet84/0/1-out       7.91e0           76.97
FortyGigabitEthernet84/0/1-tx        3.76e1           76.97
dpdk-input                           6.62e1           76.97
interface-output                     9.91e0           76.97
ip4-input-no-checksum                5.53e1           76.97
ip4-lookup                           3.49e1           76.97
ip4-rewrite-transit                  3.32e1           76.97