Difference between revisions of "VPP/Using VPP as a VXLAN Tunnel Terminator"

From fd.io
< VPP
Jump to: navigation, search
(Initial content)
 
m
Line 360: Line 360:
 
     i32 retval;
 
     i32 retval;
 
}) vl_api_bd_ip_mac_add_del_reply_t;
 
}) vl_api_bd_ip_mac_add_del_reply_t;
<pre>
+
</pre>
  
  
Line 389: Line 389:
 
     u16 _vl_msg_id;
 
     u16 _vl_msg_id;
 
     u32 client_index;
 
     u32 client_index;
u32 context;
+
    u32 context;
u32 sw_if_index;
+
    u32 sw_if_index;
 
}) vl_api_delete_loopback_t;
 
}) vl_api_delete_loopback_t;
  

Revision as of 20:07, 9 January 2016

This page describes the support in the VPP platform for a Virtual eXtensible LAN (VXLAN).

Introduction

A VXLAN provides the features needed to allow L2 bridge domains (BDs) to span multiple servers. This is done by building an L2 overlay on top of an L3 network underlay using VXLAN tunnels.

This makes it possible for servers to be co-located in the same data center or be separated geographically as long as they are reachable through the underlay L3 network.

You can refer to this kind of L2 overlay bridge domain as a VXLAN (Virtual eXtensible VLAN) segment.


Features

This implementation of support for VXLAN in the VPP engine includes the following features:

  • Makes use of the existing VPP L2 bridging and cross-connect functionality.
  • Allows creation of VXLAN as per RFC-7348 to extend L2 network over L3 underlay.
  • Provides Unicast mode where packet replication done at head end toward remote VTEPS.
  • Supports Split Horizon Group (SHG) numbering in packet replication.
  • Supports interoperations with a Bridge Virtual Interface (BVI) to allow inter-VXLAN or VLAN packet forwarding via routing.
  • Supports VXLAN to VLAN gateway.
  • Supports ARP request termination.

VXLAN as defined in RFC 7348 allows VPP bridge domains on multiple servers to be interconnected via VXLAN tunnels to behave as a single bridge domain which can also be called a VXLAN segment. Thus, VMs on multiple servers connected to this bridge domain (or VXLAN segment) can communicate with each other via layer 2 networking.

The functions supported by VPP for VXLAN are described in the following sub-sections.


VXLAN Tunnel Encap and Decap

VXLAN Headers

The VXLAN tunnel encap includes IP, UDP and VXLAN headers as follows:

   Outer IPv4 Header:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |Protocl=17(UDP)|   Header Checksum             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Outer Source IPv4 Address               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                   Outer Destination IPv4 Address              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Outer UDP Header:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Source Port         |       Dest Port = VXLAN Port  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           UDP Length          |        UDP Checksum           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   VXLAN Header:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R|R|R|R|I|R|R|R|            Reserved                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                VXLAN Network Identifier (VNI) |   Reserved    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Inner Ethernet Header:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |             Inner Destination MAC Address                     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Inner Destination MAC Address | Inner Source MAC Address      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                Inner Source MAC Address                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |OptnlEthtype = C-Tag 802.1Q    | Inner.VLAN Tag Information    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Payload:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Ethertype of Original Payload |                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
   |                                  Original Ethernet Payload    |
   |                                                               |
   |(Note that the original Ethernet Frame's FCS is not included)  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


The headers are used as follows:

  • In the IP header, DIP is the destination server VTEP and SIP is the local server VTEP. After VXLAN header encap, the DIP will be used by the encap server to forward packet to the destination server via L3 routing, assuming routes to all destination VTEP IPs are already set up correctly.
  • At destination, the DIP is local and the UDP destination port number let the server identify the packet as a VXLAN packet and start VXLAN decap processing.
  • VXLAN decap processing uses the VNI from the VXLAN header to identify how to perform L2 forwarding of the inner Ethernet frame. The VNI is typically associated with a BD in the server where the inner frame should be forwarded.


UDP Port Numbering for VXLAN Header

The UDP destination port number in the VXLAN encap must be 4789, as assigned by IANA for VXLAN. The UDP source port value in the VXLAN encap should be set according to flow-hash of payload Ethernet frame to help with ECMP load-balance as packet is forwarded in the underlay network. The hash would be a normal 5-tuple hash for IP4/6 packets and a 3 tuple SMAC/DMAC/Etype hash for other packet types.


VTEPs and VXLAN Tunnel Creation

Create VXLAN Tunnel with VTEPs

VTEPs (VXLAN Tunnel End Points) are specified via VXLAN tunnel creation – the source and destination IP addresses of each VXLAN tunnel are the local server VTEP address and the destination server VTEP address. The VNI value used for the VXLAN tunnel is also specified on VXLAN tunnel creation. Once a VXLAN tunnel is created, it is like a VPP interface and not yet associated with any BD.

Associate VXLAN Tunnel with BD

Once a VXLAN tunnel interface is created, it can be added to a bridge domain (BD) as a bridge port by specifying its BDID, just like how a local Ethernet interface can be added to a BD. As a VXLAN tunnel is added to a BD, the VNI used for creating the VXLAN tunnel will be mapped to the BDID. It is a good practice to allocate the same value for both VNI and BDID for all VXLAN tunnels on the same BD or VXLAN segment for all servers to prevent confusion.

Connecting VXLAN Tunnels among Multiple Servers

To setup a VXLAN segment or BD over multiple servers, it is recommended that a VPP BD with the same BDID should be created on each server and then a full mesh of VXLAN tunnels among all servers should be created to link up this BD in each server. In other word, on each server with this BD, a VXLAN tunnel with its VNI set to the same value as the BDID should preferably be created for each of the other servers and be added to the BD. Making all BDIDs and VNIs the same value makes VXLAN segment connectivity much more apparent and less confusing.


VXLAN Flooding

VXLAN Unicast Mode with Head-End Replication

As VXLAN tunnel is just a bridge port, the normal packet replication to all bridge ports in a BD will happen naturally. This behavior matches the VXLAN unicast mode operation where headend packet replication is used to flood packets to remote VTEPs. The VXLAN multicast mode, utilizing IP multicast for flooding to remote VTEPs, is NOT supported.

Split Horizon Group

As VXLAN tunnels are added to a BD, they must be configured with the same and non-zero Split Horizon Group (SHG) number. Otherwise, flood packet may loop among servers with the same VXLAN segment because VXLAN tunnels are fully meshed among servers.


VXLAN over IPv6

VXLAN tunnel over an IPv6 underlay network is not supported in this implementation. Currently, only IPv4 underlay is supported.

The payload for VXLAN tunnel are Ethernet frames which may contain IPv4, IPv6 or packets with other protocols.


VXLAN Tunnel Input and Output with Stats

The VXLAN tunneling implementation supports full interface TX/RX stats with packet and byte counters as follows:

  • Packet input from a VXLAN tunnel interface will perform decap of IP/VXLAN header before forwarding L2 forwarded. So packet size show in the RX stats is after decap so excludes IP/UDP/VXLAN header.
  • Packet output to VXLAN tunnel interface will perform encap of IP/VXLAN header before being L3 forwarded. So output packet size shown in the TX stats is after encap so includes IP/UDP/VXLAN header.

Most features that can be done on a L2 interface, such as VLAN tag manipulation (or input ACL) can also be applied to VXLAN-tunnel interface.

VXLAN to VLAN Gateway Support

VXLAN to VLAN gateway function can be performed in VPP via the bridge port VLAN tag manipulation function. If an Ethernet port is connected to a VLAN and this port is in promiscuous mode, one can create a sub-interface on this port for the VLAN add this sub-interface to the VXLAN segment or BD with the proper VLAN tag pop/push/rewrite operation to perform VXLAN to VLAN gateway function.

For the case where no other ports are on the VXLAN BD except the VXLAN tunnel and VLAN sub-interface, one can simply cross connect the VXLAN tunnel to the VLAN sub-interface with proper VLAN tag manipulation to become an extremely efficient VXLAN to VLAN gateway.

BVI Support

A bridge domain (BD) can have an L3 Bridge Virtual Interface (BVI) (one per BD) with multiple VXLAN-tunnels and Ethernet bridge ports. All three types of bridge ports (BVI, VXLAN and Ethernet) must interoperate properly to forward traffic to each other. BVI allow VMs on separate BDs or VXLAN segments to reach each other via IRB (Integrated Routing and Bridging).

The BVI for a BD is setup by creating and setup a loopback interface and then add it to the BD as its BVI interface. A BD in a VPP can only have one BVI. A VXLAN segment consists of multiple BDs from multiple VPPs, however, may have multiple BVIs with one BVI on each BD of each VPP.


ARP Request Termination

With VXLAN BD MACs provisioned statically and MAC learning disabled, VPP can perform L2 unicast forwarding very efficiently since unknown unicast flooding is not necessary. With IP4 unicast traffic, however, there is still one source of broadcast traffic due to ARP requests from tenant VMs to find MACs for IP addresses.

In order to minimize flooding of ARP requests to the whole VXLAN segment, VPP will allow control plan to provision any VPP BD with IP addresses and their associated MACs. Thereafter, VPP can utilize the IP and MAC information it has for the BD to terminate ARP requests and generate appropriate ARP responses to the requesting tenant VMs. If no suitable IP/MAC info is found, the ARP request cannot be terminated and so will still be flooded to the VXLAN segment.


Cross Connect instead of Bridging

As an optimization, the VXLAN tunnel interface can also be cross connected to a L2 interface if there are only 2 bridge ports, being the VXLAN tunnel and an Ethernet port, in a BD. The cross connect optimization can improve packet forwarding performance as bridging overhead of MAC learning and lookup will not be necessary.

Instead of using cross connect, it is still possible to optimize BD forwarding with 2 bridge ports if both MAC learning and MAC lookup are disabled. The ability to disable learning is supported in VPP while MAC lookup cannot be disabled at present.

Another usage can be to cross connect two VXLAN tunnels to perform VNI stitching.


Architecture

The following sections explain the design approach behind the various aspects of the VPP VXLAN tunneling implementation.


VXLAN Tunnels

In order to support VXLAN on VPP, the design is to provide the ability to create VXLAN tunnel interface which can be added to a bridge domain as a bride port to participate in L2 forwarding. Thus, the current VPP L2 bridging functionality will naturally interact with VXLAN tunnels as normal bridge ports, perform learning, flooding and unicast forwarding transparently.

It is the VXLAN tunnel code which will handle the encap and decap handling of the forwarded packet as follows:

  • OUTPUT – as a packet is L2 forwarded in a BD to a VXLAN tunnel, the L2 output for that packet will naturally send it to the VXLAN output node vxlan-encap. This node can then lookup the tunnel control block of the VXLAN tunnel and find the encap string and VRF for IP forwarding. The node can then put the encap string on the packet and setup VRF in the packet context and then send the packet to ip4-looup node to perform forwarding.
  • INPUT – On receiving a VXLAN encap packet with DIP being a local IP (in fact, VPP’s VTEP IP) address, the packet will get to ip4-local node for processing. The ip4-local node will then send the packet to vxlan-input node because the UDP destination port number is for VXLAN (4789). The vxlan-input node will use the SIP and VNI of the outer header to lookup the VXLAN tunnel control block to obtain its sw_if_index, remove the VXLAN header from the packet, setup the sw_if_index of the VXLAN tunnel as input sw_if_index, and finally pass the packet to l2-input node for L2 forwarding.

Note that vxlan-input node does not map VNI to BDID but rather to the sw_if_index of the VXLAN tunnel and set it as the input interface of the decap’ed Ethernet payload. Thereafter, the L2 forwarding of the Ethernet packet just proceeds according to whether it is in a BD or cross connected to another L2 interface. This flexibility will even allow two VXLAN tunnels to be cross connected to each other to allow VNI stitching which may be useful in the future.

The following two diagrams show the relationship of the newly added VXLAN encap node and input(decap) node with the other VPP nodes while forwarding packets, assuming the VXLAN tunnel is configured as an interface on a BD.

Diagram of VPP graph node path to VXLAN encap

VPP graph node path for VXLAN decap


ARP Request Termination

The ARP termination feature is added to the L2 feature list with a bit allocated from the feature bitmap correspond to a new graph node l2-arp-term. The arp-term-l2bd node is added to the ARP handling module of VPP to process ARP request packets in the L2 forwarding path. This feature bit will be chosen so that l2-arp-term will be called just before the l2-flood node.

A new API/CLI that allows the user to provide IP and MAC address bindings for a specified BD is also added to the L2 BD handling module of VPP. Each MAC address will be stored in a hash table mac_by_ip4 for the BD using IP address as key.

The classify-and-dispatch path of the l2-input node of VPP is enhanced decide whether the ARP termination bit should be enabled for the incoming packet or not. This check is done in the broadcast MAC classification path only so normal unicast and mcast packet performance is not affected. If a packet is a broadcast ARP packet and ARP termination is enabled for a BD, the ARP termination bit in the feature bitmap will be set for this packet. Thereafter, the packet will go through L2 feature processing in its normal order and hit the ARP termination node. For other packets, or if ARP termination is not enabled in the BD, the ARP termination bit of the packet will be clear so no extra processing overhead is added.

The new graph node arp-term-l2bd will lookup mac_by_ip4 using the ARP request IP as key. If a MAC entry is found, an ARP reply will be sent to the input interface. If entry is not found, the ARP request packet will be flooded normally following the normal L2 broadcast forwarding path.


Loopback Interface MAC

The loopback interface creation CLI/API is enhanced to let user specify MAC address on creation. A new API is also added to allow deletion of a loopback interfaces. Thus, loopback interface can be configured as BVI for BDs with the desired MAC address. If VPP assigned MAC address is used for loopback/BVI, there is a potential for MAC address conflicts when VXLAN tunnels are used to connect multiple BDs, each with its own BVI, among servers.


Software Modules

The VXLAN tunnel modules follows the most up-to-date approach of providing tunnels on VPP, with a c source file for each of encap, decap and creation/deletion function in the VPP workspace directory.

open-vpp/vnet/vnet/vxlan:

  • encap.c – implement the vxlan-encap node to process packet output to the VXLAN tunnel.
  • decap.c – implement the vxlan-input node to process packet received on the VXLAN tunnel.
  • vxlan.c – implement the VXLAN tunnel create/delete function to create or delete VXLAN tunnel interfaces and their associated data structure. The debug CLI to create and delete VXLAN tunnel is also provided by this file.


Other header and miscellaneous files present in this directory are as follows:

  • vxlan.h – VXLAN tunnel related data structure, function prototypes and inlined body.
  • vxlan_packet.h – VXLAN tunnel header related definitions.
  • vxlan_error.def – text for VXLAN tunnel related global counters.

The files modified for ARP termination and MAC/IP binding notification are as follows:

  • open-vpp/vnet/vnet/l2/l2_input.c/h
  • open-vpp/vnet/vnet/l2/l2_bd.c/h
  • open-vpp/vnet/vnet/ethernet/arp.c/h


The files modified to allow MAC address to be specified for loopback interface are as follows:

  • open-vpp/vnet/vnet/ethernet/ethernet.c/h
  • open-vpp/vnet/vnet/ethernet/interface.c



API

Bridge Domain Creation

The following is the VPP API message definition for creating and deleting bridge domains:

typedef VL_API_PACKED(struct _vl_api_bridge_domain_add_del_tunnel {
    u32 client_index;
    u32 context;
    u32 bd_id;
    u8 flood;
    u8 uu_flood;
    u8 forward;
    u8 learn;
    u8 arp_term;
u8 is_add;
}) vl_api_bridge_domain_add_del_t;

typedef VL_API_PACKED(struct _vl_api_bridge_domain_add_del_reply {
    u16 _vl_msg_id;
    u32 context;
    i32 retval;
}) vl_api_bridge__add_del_tunnel_reply_t;


In the bridge domain create/delete message, the fields flood, uu_flood, forward, learn and arp-term specify whether each of the relevant features are enabled or not, with 0 indicate disabled and non-zero enabled. The field is_add is set to 1 to add/modify new/existing bridge domain and set to 0 to delete a bridge domain as specified by bd_id value.


VXLAN Tunnel Creation

The following is the VPP API message definition for creating and deleting VXLAN tunnels:

typedef VL_API_PACKED(struct _vl_api_vxlan_add_del_tunnel {
    u16 _vl_msg_id;
    u32 client_index;
    u32 context;
    u8 is_add;
    u32 src_address;
    u32 dst_address;
    u32 encap_vrf_id;
    u32 decap_next_index;
    u32 vni;
}) vl_api_vxlan_add_del_tunnel_t;

typedef VL_API_PACKED(struct _vl_api_vxlan_add_del_tunnel_reply {
    u16 _vl_msg_id;
    u32 context;
    i32 retval;
    u32 sw_if_index;
}) vl_api_vxlan_add_del_tunnel_reply_t;


In the tunnel create/delete message, the field decap_next_index should be set to the next index for l2-input node. Next index will default to that for l2-input node if the value specified for decap_next_index in the message is ~0 (0xffffffff). The field next_index can also be set to ip4-input or ip6-input node if payload is IP4/IP6 packet without L2 header but this is not the normal usage of VXLAN tunnel.

In the reply message, the sw_if_index of the created VXLAN tunnel is returned which can then be used to add it to a BD or cross connect to another L2 interface.


ARP Termination

The following set bridge flags API message is enhanced to allow ARP termination be enabled or disabled on a BD with a new bit allocated:

typedef VL_API_PACKED(struct _vl_api_bridge_flags {
    u16 _vl_msg_id;
    u32 client_index;
    u32 context;
    u32 bd_id;
    u8  is_set;
    u32 feature_bitmap;
}) vl_api_bridge_flags_t;

typedef VL_API_PACKED(struct _vl_api_bridge_flags_reply {
    u16 _vl_msg_id;
    u32 context;
    u32 retval;
    u32 resulting_feature_bitmap;
}) vl_api_bridge_flags_reply_t;


The bits allocated for BD features in feature_bitmap are in open-vpp/vnet/vnet/l2/l2_bd.h:

 #define L2_LEARN    (1<<0)
 #define L2_FWD      (1<<1)
 #define L2_FLOOD    (1<<2)
 #define L2_UU_FLOOD (1<<3)
 #define L2_ARP_TERM (1<<4)

The following is the API message for adding and deleting IP and MAC entries into BDs to support ARP request termination:

typedef VL_API_PACKED(struct _vl_api_bd_ip_mac_add_del {
    u16 _vl_msg_id;
    u32 client_index;
    u32 context;
    u32 bd_id;
    u8 is_add;
    u8 is_ipv6;
    u8 ip_address[16];
    u8 mac_address[6];
}) vl_api_bd_ip_mac_add_del_t;

typedef VL_API_PACKED(struct _vl_api_bd_ip_mac_add_del_reply {
    u16 _vl_msg_id;
    u32 context;
    i32 retval;
}) vl_api_bd_ip_mac_add_del_reply_t;


The API is designed to be generic to support IPv6 as well. For now, the API call will fail if is_ipv6 is set as IPv6 neighbor discovery termination is not yet supported.


Loopback Interface Creation

The following sample shows the API message for creating and deleting loopback interfaces:

typedef VL_API_PACKED(struct _vl_api_create_loopback {
    u16 _vl_msg_id;
    u32 client_index;
u32 context;
u8  mac_address[6];
}) vl_api_create_loopback_t;

typedef VL_API_PACKED(struct _vl_api_create_loopback_reply {
    u16 _vl_msg_id;
    u32 context;
    u32 sw_if_index;
    i32 retval;
}) vl_api_create_loopback_reply_t;

typedef VL_API_PACKED(struct _vl_api_delete_loopback {
    u16 _vl_msg_id;
    u32 client_index;
    u32 context;
    u32 sw_if_index;
}) vl_api_delete_loopback_t;

typedef VL_API_PACKED(struct _vl_api_delete_loopback_reply {
    u16 _vl_msg_id;
    u32 context;
    i32 retval;
}) vl_api_delele_loopback_reply_t;


If the value 0 is used for mac_address, then the default MAC address of dead:0000:000n will be used where n is the loopback instance number.

The sw_if_index field is used to specify the loopback interface to delete.

Note that BD membership or IP addresses of the loopback interface must be cleared before deleting a loopback interface. Otherwise, the VPP may become unstable or even crash. Similarly, sw_if_index used to delete a loopback interface must be the correct one or the result VPP behavior is unpredictable.