Published: 2022-12-21
Updated: 2023-09-06
Virtual eXtensible LAN is a new and popular tunneling technology, allowing a network to tunnel Ethernet frames over a routed network. This solves many scalability problems, especially for large scale datacenter networks. The aim of this article is to introduce VXLAN, look at why it was developed and a quick under-the-hood look into how it works. I plan on creating additional posts where I explore specific VXLAN features, of which there are many. It is impossible to cover it all in one go.
The main purpose of VXLAN is to replace the VLAN tag, so this article will spend some time talking about VLANs and why they exist. If you feel confident in your knowledge of how a switch perform MAC-learning and VLAN-tagging then you may skip the VLAN chapter.
Another main purpose of VXLAN is to not rely on Spanning Tree Protocol (STP) to keep the network free from loops. Since Ethernet has no concept of loop detection and relies on flooding on flooding unknown traffic, STP must be deployed in switched network topologies to keep them loop-free. STP achieves this by blocking inter-switch links that would otherwise cause a loop. The STP tradeoff is that these blocked links remain unutilized. When using VXLAN, inter-switch links are routed links which means they can all be active and forwarding. This will be discussed in greater detail below.
We will also cover the topics of Ingress Replication and attempt to demystify the terms Underlay and Overlay.
Let's get started!
VLANs allow a local area network to be separated into multiple virtual local area networks. This is a strong security tool as placing devices in separate network segments make them unable to communicate without passing through a router or firewall. If VLANs did not exist, each network segment would require its own set of switches, making it difficult and resource inefficient to deploy a new isolated segment in your network. Thanks to VLANs and the act of VLAN-tagging Ethernet frames when forwarding, a switch can be split into 4094 virtual segments. As devices can only communicate with devices on the same VLAN, creating new isolated segments become trivial.
VLAN tags are added by attaching a 802.1Q extension to the Ethernet header, adding four more bytes to the Ethernet frame. Out of the 4 bytes added by the 802.1Q extension, only 12 bits are allocated for VLAN IDs. This means that only 4096 values are possible (0-4095). With VLAN 0 and 4095 being reserved values, only VLANs 1-4094 are usable.
The image below shows an example of a local network separated into three Virtual LANs; Blue, Green and Orange. PC11, PC21 and PC31 can communicate directly since they are all members of the same VLAN. If PC11 wants to communicate with PC12, the packet must be sent to R1 for Inter-VLAN routing.
For this network to function correctly, the SW1-3 switches and R1 must all support 802.1Q. Part of the functionality includes tagging and untagging Ethernet frames as they are forwarded. For the switches it also means creating separate MAC-address tables, one for each VLAN.
Furthermore, the switchports that connect to each PC is assigned to a single VLAN, called an Access Port in Cisco terminology. In this example, the PC11 switchport on SW1 is assigned to the Blue VLAN, the PC12 switchport is assigned to the Green VLAN, etc. This allows PC11 and PC21 to communicate directly, but forces PC11-PC12 communication to be routed via R1, our Inter-VLAN router.
Finally, the links between the switches (and R1) must be configured to allow VLAN-tagged traffic, in Cisco terms called a Trunk port. It is only when the traffic is forwarded out on a Trunk port that the VLAN tag is added to the Ethernet frame. This is necessary to tell the next device what VLAN the frame belongs to.
Let's look at how VLANs and Inter-VLAN is used when forwarding traffic between switches and routers in a switched topology. In this example PC11 sends a ping to PC32. Because PC11 and PC32 are in different subnets, PC11 must send the traffic to its default gateway (R1) who perform inter-VLAN routing, forwarding the packet between vlan BLUE and GREEN.
Our packet walk start when SW1 receives the frame from PC11. SW1 finds a matching entry for R1 in its VLAN Blue MAC-address table, so the frame is to be sent out on the SW1-R1 Trunk port. SW1 therefore adds the VLAN tag Blue when forwarding the frame to R1.
R1 receives the frame on its VLAN Blue subinterface and sees the destination MAC in the frame is its own, so the frame is accepted for further processing. The Ethernet header is at this point irrelevant so R1 removes it and then examines the IP header, finding that the destination IP is PC32. R1 performs a routing lookup and determines the outgoing interface towards PC32 to be the VLAN Green subinterface. R1 then slaps on a new Ethernet header setting itself as the source MAC and PC32 as the destination MAC. Finally it adds VLAN tag Green and forwards the frame to SW1.
SW1 receives the frame and by looking at the VLAN tag it knows the frame belongs to VLAN Green. SW1 then checks the destination MAC against its VLAN Green MAC-address table, learning that the packet should be forwarded to SW3. Since the SW1-SW3 link is a Trunk port, the VLAN tag is kept and the frame is sent to SW3.
SW3 receives the frame from SW1 and uses the VLAN tag to associate the frame to VLAN Green. It finds the outoing port to PC32 and notices that it is an access port, so the VLAN tag is removed and the frame is forwarded.
This concludes our packetwalk where we examined how VLAN tags are added and removed depending on the interface type is it to be forwarded out on. Additionally, we have explained how a router strips and adds Ethernet headers as the IP packet is routed between network segments.
The VLAN chapter has reached its end and we can continue our journey about learning VXLAN. Coming up next; a discussion on the Spanning-Tree Protocol, what problems it has and how we can use VXLAN so solve those problems.
Why do we need STP in an Ethernet switched network? The simple reason is that Ethernet has no built-in loop prevention mechanism. The IP header has a TTL field that is decremented everytime it passes through a router and the packet is discarded once TTL reaches 0. Ethernet has no such field or mechanism. The diagram below shows how STP has blocked the SW2-SW3 link to keep the topology loop-free (see massive red cross):
If we did not have STP running then inter-switch links would be forwarding. This would mean that any Broadcast, Multicast or Unknown Unicast packet (BUM traffic) flooded by SW1 would eventually come back to SW1 (SW1->SW3->SW2->SW1) and create an endless loop. Each looped packet will gradually increase the load on all links and network devices until something breaks, taking down the network.
So by enabling STP and making SW1 the STP Root Bridge, SW2 or SW3 decide that the SW2-SW3 link forms a loop and promptly disables it. No packet flooded by SW1 can come back to SW1, so no loop can form. If the SW1-SW2 or SW1-SW3 link fails, the SW2-SW3 link is reenabled to keep network traffic flowing.
A major drawback of STP is that network bandwidth is reduced. The SW2-SW3 link could be forwarding traffic but it must be blocked to avoid loops, forcing traffic between SW2 and SW3 to pass through SW1.
Additionally, if the SW1-SW3 link is heavily utilized, adding another link between the two switches (SW1-SW3a, SW1-SW3b) would create a new loop in the network. This is because any flooded packet received on SW1-SW3a would be sent back out via SW1-SW3b. Because of this, STP would have to block the SW1-SW3b link, which takes away the whole point of adding another link.
To fix this one typically configures a Port-Channel/Link-Aggregation (LAG) interface, bundling the two physical links into one logical interface. LAGs are out of scope for this article, but this solves the SW1-SW3 bandwidth problem. The SW2-SW3 link would however remain unused. Let's see what VXLAN can do to fix this problem.
Now that we have reviewed how VLAN tagging works and learned the limitations of STP, we can finally focus on VXLAN. VXLAN is both a tunneling/encapsulation technology and a flood-and-learn forwarding mechanism similar to Ethernet switching. We will explore each subject below.
When an Ethernet frame enter a VXLAN tunnel, the Ethernet frame is encapsulated inside a VXLAN header which in turn is encapsulated inside a UDP-packet. Because we are using UDP, we also need a new IP header and Ethernet header. This adds 50 bytes of overhead to the original Ethernet frame. The source IP-address in the outer IP header is a loopback address called VXLAN Tunnel Endpoint (VTEP) configured on the switch that generated the VXLAN packet. The destination IP is the VTEP Loopback address of the switch that is to receive the VXLAN packet.
The default UDP destination port when sending the VXLAN packet is 4789. The source-port is often hashed based on the contents of the original Ethernet frame, allowing ECMP to forward the tunneled packets over multiple paths.
The VXLAN header contain the Virtual Network Identifier (VNI) field. The VNI tells the receiving VXLAN router what VLAN or VRF the encapsulated packet belongs to. The VNI field is 24 bits long, meaning it can handle 16,7 million different values. This is vastly more than the 4094 combinations allowed by 802.1Q.
Let's examine this image. The left side displays the original Ethernet Frame. It has an Ethernet header, an IP header, and a TCP header. The rest of the packet contain the application payload, in this case a HTTP packet. This is the frame generated by an end device.
The right side show what the packet looks like after the original Ethernet frame was VXLAN-encapsulated. A VXLAN header was added, inside of which we can find the VNI field. Outside of that we have the UDP header where port 4789 signal that it is a VXLAN packet. Then we have a new IP header where the source IP is the VTEP that did the VXLAN encapsulation (10.0.0.1) and the destination IP is the remote VTEP that should receive the VXLAN packet (10.0.0.2). Lastly, the IP packet is encapsulated inside an Ethernet frame so that it can be sent out on the local ethernet segment.
When the destination switch (10.0.0.2) receives the packet, it sees that the destination IP is a locally configured IP-address, so it is the intended recipient. The switch further inspects the packet and sees that the UDP header has destination port 4789, so the packet should be sent to the internal VXLAN process for decapsulation. The VNI is read from the VXLAN header, the original Ethernet frame in decapsulated and forwarded out on the VLAN (or VRF) indicated by the VNI. I will get into more detail below.
The strength of VXLAN comes from the fact that the Ethernet frame can be routed anywhere we want to send it, as any intermediate router only see traffic coming from 10.0.0.1 going to 10.0.0.2. As long as these two IP-addresses are able to communicate, the network in between can be any size, shape or form. Only 10.0.0.1 and 10.0.0.2 need to be VXLAN-capable as they must perform encapsulation and decapsulation. This is different from MPLS where every node along the way must know how to label-switch based on MPLS labels. With routed links, there is no need for STP. Active-active loadbalancing is easily implemented as ECMP is a natural part of modern routing protocols.
VXLAN does have some drawbacks though, 50 byte overhead is one. Also, due to how frames are encapsulated, fragmenting a VXLAN packet is not possible. You must ensure the network path can support the extra size of your VXLAN packets. Another drawback of VXLAN is the need for Ingress-Replication, which I will cover further down.
As shown in the diagram below, the inter-switch links are now routed links with OSPF enabled. As part of the added VXLAN configuration on the switches, each switch has a Loopback0 interface configured for VTEP connectivity. SW1 has its loopback interface configured with IP-address 10.0.0.1/32. This prefix is advertised by OSPF to allow SW2 and SW3 to learn how to reach 10.0.0.1.
Let's do a new packetwalk between PC11 and PC32 now that VXLAN has been enabled:
Initial SW1 behavior is unchanged as it receives a frame from PC11 on vlan Blue. It sends the traffic to R1 via a trunk port so vlan Blue is inserted before forwarding.
R1 behavior is unchanged as it perform inter-VLAN routing, stripping the now obsolete vlan Blue ethernet header and adding a new vlan Green Ethernet header, setting itself as source MAC and PC32 as destination MAC.
SW1 now receives the packet on vlan Green and via MAC-address table lookup sees that PC32 is reachable via VTEP 10.0.0.3, VNI Green. So SW1 VXLAN-encapsulates the Ethernet frame and prepares to forward to SW3. But before it does so, SW1 removes the vlan Green tag from the original Ethernet frame. This is an important detail with VXLAN, because the network identifier is now carried by the VNI value in the VXLAN header. The VXLAN packet is then sent to SW3.
SW3 receives the VXLAN packet, decapsulating the original Ethernet frame which according to the VNI belongs to vlan Green. SW3 perform MAC-address table lookup for PC32 and then forwards the original frame. SW3 also perform MAC-learning while VXLAN-decapsulating, allowing it to learn that the MAC-address of R1 is reachable via VTEP 10.0.0.1 for vlan Green. This helps SW3 know where to forward any ping reply from PC32 to PC11.
One key benefit of SW1 removing the VLAN tag before VXLAN-encapsulating the packet is that SW1 and SW3 may map VNI Green to different local VLANs. It is fully possible for SW1 to map VNI Green to vlan 10 locally and for SW3 to map VNI Green to vlan 30.
One thing we haven't yet discussed is how VXLAN handles BUM-traffic flooding. In an Ethernet switched network, a switch will flood BUM frame on all other ports in the same VLAN. The frame is not flooded out on the same port it was received as that would break Ethernet MAC-learning aswell as potentially create a loop.
How does a switch flood a BUM packet when VXLAN is in use? The answer is called VXLAN Flood List and is configured either globally or per VLAN. For example, the vlan BLUE flood-list on SW1 would be "10.0.0.2, 10.0.0.3". The vlan BLUE flood-list on SW2 would be "10.0.0.1, 10.0.0.3" and so on. When SW1 receives a BUM packet on a local port on vlan BLUE, it will: 1. flood the frame out on all other local ports in that VLAN 2. VXLAN-encapsulate the frame and generate a copy to every destination VTEP in the flood list.
One copy is sent to SW2 (10.0.0.2) and one copy is sent to SW3 (10.0.0.3). Whenever SW2 and SW3 receive their copy, they flood the decapsulated frame out on all local ports in that VLAN. The same split-horizon rule applies here: a packet received on the "VXLAN" port is not flooded back out on the VXLAN port as it would break MAC-learning and create a loop.
This process of creating multiple copies is called Ingress Replication (IR) or Head-end Replication (HER). This is one of the drawbacks of VXLAN compared to native Ethernet switching, because unicast routing has no flooding mechanism. It is up to the ingress VTEP to generate VXLAN-encapsulated copies of the frame and send one copy to each VTEP in the flood list. This process can scale poorly as it take a lot more effort flooding a BUM packet if there are 100 VTEPs in the flood list. Different vendors and ASICs have different limitations, so this is something you may need to keep in mind if you decide to go for a VXLAN design.
One solution for IR scalability would using a multicast address as VTEP and rely on multicast routing to perform replication to have routers perform replication as the packet travel down the multicast tree, but this adds its own complexity and may not be supported by all vendors.
Popular terms used in VPN technologies nowadays are "Underlay" and "Overlay". The term underlay refers to the physical topology of the network, while the overlay term refers to the virtual topology created inside each VPN. As this is popular terminology, I'll make an attempt at explaining it using this diagram below:
The underlay portion shows what the actual physical topology looks like in our VXLAN-enabled network. There are a couple of routers in the middle, routing traffic between the VXLAN-enabled SW1-SW4 switches.
The blue overlay describes what the virtual topology looks like for the blue VPN. From the customer perspective it looks like SW1-SW3 are directly connected, just like any Ethernet switched network.
The red overlay describes the virtual topology for the red VPN.
As traffic is tunneled across the overlay, devices in the underlay are hidden from customer view. Additionally, the "core" routers R1 and R2 are only forwarding packets between VXLAN VTEPs. They don't have to learn where each PC in each VPN is, because that's not relevant to the job R1 and R2 perform. This is similar to the BGP-Free Core concept often used in MPLS network designs, and it allows R1 and R2 to keep a much reduced network state. This means that smaller and cheaper boxes can be purchased.
This underlay/overlay separation can be useful for the service provider or datacenter operator running the VXLAN network, as the customers inside each overlay are oblivious to the physical topology of the underlay network. This adds security as it is harder for an customer to map the ISP/DC network or access any devices therein.
Another benefit is that changes in the underlay does not affect the overlay. Sure, if R2 goes down in this example SW4 will be isolated, but SW1, SW2 and SW3 Scan continue communicating via R1 without issue, so the blue overlay topology remain unaffected.
You have reached the end of this article! This has been an introduction to the VXLAN technology and how it can solve issues with VLAN-based ethernet-switched networks that rely on STP for loop avoidance. We have also discussed how VXLAN tunneling works, what flood lists are, what Ingress Replication is and ended with an explanation of the underlay and overlay terms.
We have only scratched the surface of what VXLAN can do. I have created additional posts to explain specific VXLAN features here: