In the previous post, I covered some of the basic concepts behind network overlays, primarily highlighting the need to move into a more robust, L3 based, network environments. In this post I would like to cover network overlays in more detail, going over the different encapsulation options and highlighting some of the key points to consider when deploying an overlay-based solution.

Underlying fabric considerations

While network overlays give you the impression that networks are suddenly all virtualized, we still need to consider the physical underlying network. No matter what overlay solution you might pick, it’s still going to be the job of the underlying transport network to switch or route the traffic from source to destination (and vice versa).

Like any other network design, there are several options to choose from when building the underlying network. Before picking up a solution, it’s important to analyze the requirements - namely the scale, amount of virtual machines (VMs), size of the network as well as the amount of traffic. Yes, there are some fancy network fabric solutions out there from any of your favorite vendors, but simple L3 Clos network will do just fine. The big news here is that the underlying network should no longer be a L2 bridged network, but can be configured as a L3 routed network. Clos topology with ECMP routing can provide efficient non-blocking forwarding with a quick convergence time in a case of a failure. Known protocols such as OSPF, IS-IS, and BGP, with the addition of a protocol like BFD, can provide a good standard-based foundation for such a network. One thing I do want to highlight when it comes to the underlying network, is the requirement to support Jumbo frames. No matter what overlay encapsulation you may choose to implement, extra bytes of header will be added to the frames, resulting in a need for high MTU support from the physical network.

For the virtualization/cloud admin, with overlay networks, the data network used to carry the overly traffic is no longer a special network that requires careful VLAN configuration. It is now just one more infrastructure network used to provide simple TCP/IP connectivity.


When it comes to the overlay data-plane encapsulation, the amount of discussions, comparisons and debate out there is amazing. There are several options and standards available, all of them have the same goal: provide an emulated L2 networks over IP infrastructure. The main difference between them is the encapsulation format itself and their approach to the control plane - which is essentially the way to obtain MAC-to-IP mapping information for the tunnel end-points.

It all started with the well-known Generic Routing Encapsulation (GRE) protocol that was rebranded as NVGRE. GRE is a simple point-to-point tunneling protocol which is being used in todays networks to solve various design challenges and therfore is well understood by many network engineers. With NVGRE, the inner frame is being encapsulated with GRE encapsulation as specified in RFC 2784 and RFC 2890. The Key field (32 bits) in the GRE header is used to carry the Tenant Network Identifier (TNI) and is used to isolate different logical segments. One thing to note about GRE is the fact that it uses IP protocol number 47 for communication, i.e., it does not use TCP or UDP - which make it hard to create header entropy. Header entropy is something that you really want to have if you are using a flow-based ECMP network to carry the overlay traffic. Interesting enough, the authors of NVGRE do not cover the control plane part but only the data-plane considerations.

Other option would be Virtual Extensible LAN (VXLAN). Unlike NVGRE, VXLAN is a new protocol that was designed to solve the overlay networks use case. It uses UDP for communication (port 4789) and a 24-bit segment ID known as the VXLAN network identifier (VNID). With VXLAN, a hash of the inner frame’s header is used as the VXLAN source UDP port. As a result, a VXLAN flow can be unique, with the IP addresses and UDP ports combination in its outer header while traversing the underlay physical network. Therefore, the hashed source UDP port introduces a desirable level of entropy for ECMP load balancing. When it comes to the control plane, VXLAN does not provide any solution, but instead relies on flooding emulated with IP multicast. The original standard recommends to create an IP multicast group per VNI to handle broadcast traffic within a segment. This requires support for IP multicast on the underlying physical network as well as proper configuration and maintenance of the various multicast trees. This approach may work for small scale environments, but for large environments with good number of logical VXLAN segments this is probably not a good idea. It also important to note here that while IP multicast is a clever way to handle IP traffic, it is not commonly implemented in Data Center networks today, and the requirement to deploy an IP multicast network (which can be fairly complex) just to introduce VXLAN is not something that is accepted in most cases. These days, it is common to see “unicast mode” VXLAN implementations that do not require any kind of multicast support.

You may also have heard about Stateless Transport Tunneling Protocol (STT) which was originally introduced by Nicira (now VMware NSX). The main reason I decided to mention STT here is one of its benefits: the ability to leverage TCP offloading capabilities from existing physical NICs, resulting in improved performance. STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload (LSO) on what it thinks is a simple TCP datagram. That said, new generation NICs also offer offload capabilities for NVGRE and VXLAN, so this is not a unique benefit of STT anymore.

Last but not least, I would also like to introduce Geneve: Generic Network Virtualization Encapsulation, which looks to take a more holistic view of tunneling. From a first look, Geneve looks pretty much similar to VXLAN. It uses a UDP-based header and a 24 bit Virtual Network Identifier. So what is unique about Geneve? The fact that it uses an extendable header format, similar to (long-living) protocols such as BGP, LLDP, and IS-IS. The idea is that Geneve can evolve over time with new capabilities, not by revising the base protocol, but by adding new optional capabilities. The protocol has a set of fixed header, parameters and values, but then leave room for non-defined optional fields. New fields can be added to the protocol by simply defining and publishing them. The protocol is created in such a way that implementations know there may be optional fields that they may or may not understand. Although the protocol is new, there is some work to enable Open vSwitch support as well as NIC vendors announcing support for offloading capabilities.

I also want to leave room here for some other protocols that can be used as an encapsulation option. There is nothing wrong with MPLS for example, other than the fact that it requires to be enabled throughout the underlying transport network.

So should I pick a winner? probably not. As you can see you have got some options to choose from, but let’s make it clear: all protocols discussed above are ignoring the real problems (hint: control-plane operations) and providing a nice framework for data-plane encapsulation, which is just part of the deal. If I need to pick one, I would say that it looks like VXLAN and Geneve are here to stay (but we should let the market decide).

Tunnel End Point

I have already mentioned the term tunnel end-point, sometimes refer to as VTEP, earlier. But what is this end-point, and more importantly, where is it located? The function of VTEP is to encapsulate the VM traffic within an IP header to send across the underlying IP network. With the most common implementations, the VMs are unaware of the VTEP. They just send untagged or VLAN-tagged traffic that needs to classified and associated with a VTEP. Initial designs (which are still the most common ones) implemented the VTEP functionality within the hypervisor which houses the VMs, usually in the software vSwitch. While this is a valid solution that is probably here to stay, it also worth mentioning an alternative design in which the VTEP functionality is implemented in hardware, for e.g., within a top-of-rack (ToR) switch. This makes sense is some environments, especially where performance and throughput is critical.

Control plane or flooding

Probably the most interesting question to ask when picking an overlay network solution is what’s going on with the control plane and how the network is going to handle Broadcast, Unknown unicast and Multicast traffic (sometimes refer to as BUM traffic). I am not to going to provide easy answers here, simply because of the fact that there are plenty of solutions out there, each addresses this problem differently. I just want to emphasize that the protocol you are going to use to form the overly network (e.g., NVGRE, VXLAN, or what have you) is essentially taking care only for the data-plane encapsulation. For control plane you will need to rely either on flooding (basically continue to learn MAC addresses via the “flood and learn” method to ensure that the packet reaches all the other tunnel end-points), or consulting some sort of database which includes the MAC to IP bindings in the network (e.g., an SDN controller).

Connectivity with the outside world

Another factor to consider is the connectivity with the outside world - or how can a VM within an overlay network communicate with a device resides outside of the network. No matter how much overlays would be popular throughout the network, there are still going to be devices inside and outside of the Data Center that speaks only native IP or understand just 802.1Q VLANs. In order to communicate with those the overlay packet will need to get into some kind of a gateway that is capable of bridging or routing the traffic correctly. This gateway should handle the encapsulation/decapsulation function and provide the required connectivity. As with the control plane considerations, this part is not really covered in any of the encapsulation standards. Common ways to solve this challenge is by using virtual gateways, essentially logical routers/switches implemented in software (take a look on Neutron’s l3-agent to see how OpenStack handle this), or by introducing dedicated physical gateway devices.

Are overlays the only option?

I would like to summarize this post by emphasizing that overlays are an exciting technology which probably makes sense in certain environments. As you saw, an overly-based solution needs to be carefully designed, and as always depends on your business and network requirements. I also would like to emphasize that overlays are not the only option to scale-out networking, and I have seen some cool proposals lately which are probably deserve their own post.