EVPN Multihoming - MLAG vs ESI

Published: 2023-02-19

This article will be comparing MLAG and ESI for EVPN multihoming. Multihoming is the practice of connecting a device to multiple points in the network. For example, a server may connect to two switches for redundancy, maintaining network connectivity in case one of the switches fail. This is better than a server being singlehomed to one point in the network.

With EVPN and VXLAN there are two ways to achieve multihoming. The first is built into the EVPN protocol using Ethernet Segment Identifiers (ESI). By uniquely identifying multihomed ethernet segments, two or more switches can learn that they provide redundant connectivity to the same ethernet segment. This segment can connect to a downstream router, switch or server. The server uplinks are bundled into a Port-Channel (LAG), creating one logical interface.

Another technology is Multi-Chassi Link Aggregation (MLAG) that enable two physical switches to operate as a single logical switch. This technology is usually vendor-proprietary and its implementation details are often a well guarded secret.

The goal of this article is to compare these two technologies to help you decide which one you would prefer when designing your network. Each technology comes with its own benefits and drawbacks; I hope to cover most of them below.

I will be using Arista vEOS images in this article to build the lab topology, so this means that we will be limited to learning about the Arista MLAG implementation. Many vendors have MLAG implementations, Cisco Nexus virtual Port-Channel (vPC) being one example, but I will focus on Arista in this article.

Some quick links if you want to skip ahead:


Lab topology

This is the lab topology that we will be configuring in this article. The left side contain two MLAG-pairs, LE03a/b and LE04a/b. Connected to LE03 we have SRV31 and SRV32. Connected to LE04 are SRV41 and SRV42. We will focus on this part of the topology in the MLAG chapter.

On the right we have three standalone switches (LE05, LE06 and LE07) that provide EVPN Multihoming using the ESI method. Each switch connect to one singlehomed server; SRV51, SRV61 and SRV71, respectively. There are also four multihomed servers, for example SRV561 and SRV562 connected to LE05 and LE06.

In the middle we have two spine switches providing inter-leaf connectivity. Each switch, spine or leaf, has a router-ID in the 10.0.0.XX/32 format where XX is the node ID. This router-ID is configured as an IP-address on the Loopback0 interface and will be used for BGP EVPN adjacencies. LE03a/b and LE04a/b has a shared IP-address configured on Loopback1 with IP-address 10.0.0.3/32 and 10.0.0.4/32, respectively. I will cover why Loopback1 is necessary later in the article. Spines and leaves run OSPF as IGP to advertise their loopback-prefixes.

Spine configuration

Since the spine configurations remain unchanged, I display them here. The spines act as BGP Route-Reflectors, reflecting EVPN routes between leaves. The spines do not run any VXLAN features, their main job is to forward packets between leaves as quickly as possible.

service routing protocols model multi-agent
!
interface Ethernet1
   description "LE03a"
   no switchport
   ip address 10.1.31.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "LE03b"
   no switchport
   ip address 10.1.32.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE04a"
   no switchport
   ip address 10.1.41.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet4
   description "LE04b"
   no switchport
   ip address 10.1.42.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet5
   description "LE05"
   no switchport
   ip address 10.1.5.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet6
   description "LE06"
   no switchport
   ip address 10.1.6.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet7
   description "LE07"
   no switchport
   ip address 10.1.7.1/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Loopback0
   ip address 10.0.0.1/32
!
ip routing
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN route-reflector-client
   neighbor EVPN timers 5 15
   neighbor EVPN send-community
   neighbor 10.0.0.5 peer group EVPN
   neighbor 10.0.0.6 peer group EVPN
   neighbor 10.0.0.7 peer group EVPN
   neighbor 10.0.0.31 peer group EVPN
   neighbor 10.0.0.32 peer group EVPN
   neighbor 10.0.0.41 peer group EVPN
   neighbor 10.0.0.42 peer group EVPN
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
interface Ethernet1
   description "LE03a"
   no switchport
   ip address 10.2.31.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "LE03b"
   no switchport
   ip address 10.2.32.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE04a"
   no switchport
   ip address 10.2.41.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet4
   description "LE04b"
   no switchport
   ip address 10.2.42.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet5
   description "LE05"
   no switchport
   ip address 10.2.5.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet6
   description "LE06"
   no switchport
   ip address 10.2.6.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet7
   description "LE07"
   no switchport
   ip address 10.2.7.2/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Loopback0
   ip address 10.0.0.2/32
!
ip routing
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN route-reflector-client
   neighbor EVPN timers 5 15
   neighbor EVPN send-community
   neighbor 10.0.0.5 peer group EVPN
   neighbor 10.0.0.6 peer group EVPN
   neighbor 10.0.0.7 peer group EVPN
   neighbor 10.0.0.31 peer group EVPN
   neighbor 10.0.0.32 peer group EVPN
   neighbor 10.0.0.41 peer group EVPN
   neighbor 10.0.0.42 peer group EVPN
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected


MLAG

This is the first multihoming solution we will be looking at in this article. The goal of MLAG is to turn two physical switches into one virtual switch. An MLAG can only consist of two switches. They must be the same model and run the same software version. MLAG revolves around generating a virtual System MAC-address, also known as System ID. This system MAC-address is used by many parts of the switches as we will demonstrate below:

# "the physical system ID"
LE03a#show version
System MAC address: 5001.0000.003a

# "the virtual system ID generated by MLAG"
LE03a#show mlag
state      :              Active
system-id  :      5001.0000.3333

LE03a#show spanning-tree
MST0
  Spanning tree enabled protocol mstp
  Root ID    Priority    32768
             Address     5001.0000.3333
             This bridge is the root

LE03a# show lacp internal
LACP System-identifier: 8000,5001.0000.003a
MLAG System-identifier: 8000,5001.0000.3333

# "the physical system ID"
LE03a#show version
System MAC address: 5001.0000.003b

# "the virtual system ID generated by MLAG"
LE03a#show mlag
state      :              Active
system-id  :      5001.0000.3333

LE03b#show spanning-tree
MST0
  Spanning tree enabled protocol mstp
  Root ID    Priority    32768
             Address     5001.0000.3333
             This bridge is the root

LE03a# show lacp internal
LACP System-identifier: 8000,5001.0000.003b
MLAG System-identifier: 8000,5001.0000.3333

The above textbox show that the physical system MAC-address of LE03a is 5001.0000.003a, but the virtual system MAC-address generated by MLAG is 5001.0000.3333. Let's look at some of places where the virtual MAC-address is used:

  1. Spanning-Tree Protocol (STP)

    STP uses the virtual MAC-address as its Bridge ID. As both switches generate the same Bridge ID, they can both be assigned as Root Bridge for the topology. This is fine as MLAG has a proprietary magic sauce to keep the topology loop-free.

  2. Link Aggregation Control Protocol (LACP)

    This is a useful protocol for negotiating and maintaining Port-Channels as it, among other things, send keepalive messages to verify that every physical link in the LAG is healthy. Another part is establishing who is at the other end of the physical link in the bundle, an important detail with when MLAG is used.
    If MLAG was not configured, LE03a and LE03b would send different System IDs in their LACPDUs to the downstream server. This stops the server from enabling all links in its Port-Channel, as connecting to multiple switches could cause a network loop. When MLAG is enabled on the LE03-switches, they send the same System ID in their LACPDUs and the server can confidently enable all links in the bundle.

Since LE01 and LE02 are not using EVPN ESI or MLAG, they cannot provide the same active-active forwarding functionality. Any connected server must use active-passive forwarding where only a single link (in green) is active at a time. As one can imagine, this wastes potential network resources as a server connected with a 10G link to each switch has a potential total bandwidth of 20G, but can only use 10G due to STP blocking all but one link.

All links turn green when MLAG is configured, enabling full link utilization.

There are probably more virtual system ID use cases than those I've mentioned here. The goal is to have any device communicating with the LE03a/b switches believe they are talking to one switch, not two.

Note: Any routed interface on the LE03-switches will continue using the physical system MAC-address. This ensures that traffic originated by LE03a is not returned to LE03b. If I created an SVI on LE03a and pinged one of the server IP-addresses, the traffic would be sourced from 5001.0000.003a, ensuring that the return traffic come back to LE03a.

MLAG and Packet Duplication

While MLAG on its own is a great technology, we need to take a moment to examine how it operates together with VXLAN. One problem that has to be solved is how to avoid packet duplication. Imagine a server on SW1 sending an ARP broadcast frame. When VXLAN-flooding, SW1 sends one copy to LE03a (10.0.0.31) and one copy to LE03b (10.0.0.32). Both LE03 switches flood their received copy out on their local switchports, causing packet duplication.

To avoid this problem, we configure Anycast IP-address 10.0.0.3/32 on Loopback1 on both LE03 switches. We then alter the VXLAN flood-list on SW1 to include [10.0.0.3] instead of [10.0.0.31, 10.0.0.32]. When SW1 perform its VXLAN-flooding, it sends one copy destined for 10.0.0.3 to SP01. SP01 has two paths to the destination and this time it sends the packet to LE03b, who receives it and floods it on all local switchports. Since no copy was sent to LE03a, no packet duplication was created. Problem solved!

MLAG configuration

It's time to review the MLAG configuration that was applied to LE03 and LE04 MLAG-pairs. I will share the full configuration for completeness, but the important parts will be examined further below.

service routing protocols model multi-agent
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
no spanning-tree vlan-id 4094
!
vlan 10
   name VLAN10
!
vlan 4094
   name MLAG
   trunk group PEER-LINK
!
interface Port-Channel3
   description "PEER-LINK"
   switchport mode trunk
   switchport trunk group PEER-LINK
!
interface Port-Channel31
   description "SRV31"
   switchport mode trunk
   mlag 31
!
interface Port-Channel32
   description "SRV32"
   switchport mode trunk
   mlag 32
!
interface Ethernet1
   description "SP01"
   no switchport
   ip address 10.1.31.3/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "SP02"
   no switchport
   ip address 10.2.31.3/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE03b"
   channel-group 3 mode active
!
interface Ethernet31
   description "SRV31"
   channel-group 31 mode active
!
interface Ethernet32
   description "SRV32"
   channel-group 32 mode active
!
interface Loopback0
   ip address 10.0.0.31/32
!
interface Loopback1
   description "VXLAN SOURCE-INTERFACE"
   ip address 10.0.0.3/32
!
interface Vlan4094
   no autostate
   ip address 10.0.3.1/30
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Vxlan1
   vxlan source-interface Loopback1
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
!
ip routing
!
mlag configuration
   domain-id LE03
   local-interface Vlan4094
   peer-address 10.0.3.2
   peer-link Port-Channel3
   reload-delay mlag 60
   reload-delay non-mlag 30
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 65000:10
      route-target both 65000:10
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
no spanning-tree vlan-id 4094
!
vlan 10
   name VLAN10
!
vlan 4094
   name MLAG
   trunk group PEER-LINK
!
interface Port-Channel3
   description "PEER-LINK"
   switchport mode trunk
   switchport trunk group PEER-LINK
!
interface Port-Channel31
   description "SRV31"
   switchport mode trunk
   mlag 31
!
interface Port-Channel32
   description "SRV32"
   switchport mode trunk
   mlag 32
!
interface Ethernet1
   description "SP01"
   no switchport
   ip address 10.1.32.3/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "SP02"
   no switchport
   ip address 10.2.32.3/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE03a"
   channel-group 3 mode active
!
interface Ethernet31
   description "SRV31"
   channel-group 31 mode active
!
interface Ethernet32
   description "SRV32"
   channel-group 32 mode active
!
interface Loopback0
   ip address 10.0.0.32/32
!
interface Loopback1
   description "VXLAN SOURCE-INTERFACE"
   ip address 10.0.0.3/32
!
interface Vlan4094
   no autostate
   ip address 10.0.3.2/30
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Vxlan1
   vxlan source-interface Loopback1
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
!
ip routing
!
mlag configuration
   domain-id LE03
   local-interface Vlan4094
   peer-address 10.0.3.1
   peer-link Port-Channel3
   reload-delay mlag 60
   reload-delay non-mlag 30
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 65000:10
      route-target both 65000:10
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
no spanning-tree vlan-id 4094
!
vlan 10
   name VLAN10
!
vlan 4094
   name MLAG
   trunk group PEER-LINK
!
interface Port-Channel3
   description "PEER-LINK"
   switchport mode trunk
   switchport trunk group PEER-LINK
!
interface Port-Channel41
   description "SRV41"
   switchport mode trunk
   mlag 41
!
interface Port-Channel42
   description "SRV42"
   switchport mode trunk
   mlag 42
!
interface Ethernet1
   description "SP01"
   no switchport
   ip address 10.1.41.4/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "SP02"
   no switchport
   ip address 10.2.41.4/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE04b"
   channel-group 3 mode active
!
interface Ethernet41
   description "SRV41"
   channel-group 41 mode active
!
interface Ethernet42
   description "SRV42"
   channel-group 42 mode active
!
interface Loopback0
   ip address 10.0.0.41/32
!
interface Loopback1
   description "VXLAN SOURCE-INTERFACE"
   ip address 10.0.0.4/32
!
interface Vlan4094
   no autostate
   ip address 10.0.4.1/30
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Vxlan1
   vxlan source-interface Loopback1
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
!
ip routing
!
mlag configuration
   domain-id LE04
   local-interface Vlan4094
   peer-address 10.0.4.2
   peer-link Port-Channel3
   reload-delay mlag 60
   reload-delay non-mlag 30
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 65000:10
      route-target both 65000:10
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
no spanning-tree vlan-id 4094
!
vlan 10
   name VLAN10
!
vlan 4094
   name MLAG
   trunk group PEER-LINK
!
interface Port-Channel3
   description "PEER-LINK"
   switchport mode trunk
   switchport trunk group PEER-LINK
!
interface Port-Channel41
   description "SRV41"
   switchport mode trunk
   mlag 41
!
interface Port-Channel42
   description "SRV42"
   switchport mode trunk
   mlag 42
!
interface Ethernet1
   description "SP01"
   no switchport
   ip address 10.1.42.4/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description "SP02"
   no switchport
   ip address 10.2.42.4/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet3
   description "LE04a"
   channel-group 3 mode active
!
interface Ethernet41
   description "SRV41"
   channel-group 41 mode active
!
interface Ethernet42
   description "SRV42"
   channel-group 42 mode active
!
interface Loopback0
   ip address 10.0.0.42/32
!
interface Loopback1
   description "VXLAN SOURCE-INTERFACE"
   ip address 10.0.0.4/32
!
interface Vlan4094
   no autostate
   ip address 10.0.4.2/30
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Vxlan1
   vxlan source-interface Loopback1
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
!
ip routing
!
mlag configuration
   domain-id LE04
   local-interface Vlan4094
   peer-address 10.0.4.1
   peer-link Port-Channel3
   reload-delay mlag 60
   reload-delay non-mlag 30
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 65000:10
      route-target both 65000:10
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

As there is quite a lot going on in the configuration above, I will go through it step by step below. The textbox show only the parts of the configuration necessary for setting up the MLAG-pair:

no spanning-tree vlan-id 4094
!
vlan 4094
   name MLAG
   trunk group PEER-LINK
!
interface Port-Channel3
   description "PEER-LINK"
   switchport mode trunk
   switchport trunk group PEER-LINK
!
interface Ethernet3
   description "LE03b"
   channel-group 3 mode active
!
interface Vlan4094
   description "LE03a-LE03b routed link"
   ip address 10.0.3.1/30
   no autostate
!
mlag configuration
   domain-id LE03
   local-interface Vlan4094
   peer-address 10.0.3.2
   peer-link Port-Channel3
   reload-delay non-mlag 30
   reload-delay mlag 60

MLAG Peer Link and MAC-learning problems

Arista MLAG is built around designating a physical peer-link interface between the two MLAG-switches. The peer-link is used for MLAG synchronization and MAC-learning. The peer-link can be a single Ethernet interface but the recommended setup is using a Port-Channel with multiple links to minimize the risk of the peer-link ever going down. In this configuration I'm using Port-Channel3 (Ethernet3) as my MLAG peer-link.

The peer-link is a switched trunk interface with all VLANs allowed. This is important as it facilitates MAC-learning for single-homed devices. For example, LE03b would not be able to learn the MAC-address of a device connected to LE03a, unless LE03a was able to flood broadcast frames from that device over the peer-link.

The peer-link being a switched trunk is extra important when VXLAN is used as LE03a has no other way to reliably flood the frame to LE03b. If we imagine that no peer-link existed between LE03a and LE03b, any broadcast frame received by LE03a from a downstream device would have to be VXLAN-flooded. Since both LE03-switches share the same VXLAN anycast IP-address (10.0.0.3), the VXLAN packet would have to be sent from 10.0.0.3 to 10.0.0.3. When SP01 or SP02 receive the packet they do a routing-lookup for 10.0.0.3 and find that there are two paths, one via LE03a and one via LE03b. There is therefore a 50% chance that the packet is sent back to LE03a instead of to LE03b, so VXLAN flooding does not work.

But if we're using EVPN, can't LE03b learn the MAC-address from LE03a via BGP? The answer is unfortunately no. We will see below that any EVPN route advertised by LE03a/b will have BGP nexthop 10.0.0.3 set, which is the MLAG Anycast IP-address. When LE03b receive the route from LE03a, BGP will mark the route as invalid as the nexthop is a locally configured IP-address. Installing this route could therefore create a routing loop. So any EVPN route advertised by LE03a is ignored by LE03b, and vice versa.

LE03a#show bgp evpn route-type mac-ip detail
BGP routing table information for VRF default
Router identifier 10.0.0.31, local AS number 65000
BGP routing table entry for mac-ip 5001.0000.1234, Route Distinguisher: 65000:10
 Paths: 2 available
  Local
    10.0.0.3 from 10.0.0.1 (10.0.0.1)
      Origin IGP, metric -, localpref 100, weight 0, invalid, internal
      Originator: 10.0.0.32, Cluster list: 10.0.0.1
      Extended Community: Route-Target-AS:65000:10 TunnelEncap:tunnelTypeVxlan
      VNI: 10 ESI: 0000:0000:0000:0000:0000

Looking at the output above, we can see that LE03a received a route from LE03b (Originator: 10.0.0.32) for MAC-address 5001.0000.1234. The route is marked as invalid. Even though the output doesn't say why, it's because the nexthop (10.0.0.3) is a locally configured IP-address on LE03a (Loopback1).

MLAG Vlan 4094

Another part of the MLAG configuration is vlan 4094 and its associated VLAN-interface. This VLAN and SVI is dedicated to MLAG communication, giving the switches in the MLAG-pair a routed point-to-point link for MLAG control traffic. By configuring trunk group PEER-LINK on VLAN 4094 and Port-Channel3, the VLAN is guaranteed to only exist on the peer-link and not leak to any other switched interfaces.

Because this VLAN is used for mission-critical MLAG communication, the commands no autostate and no spanning-tree vlan-id 4094 are entered to ensure that the interface always stays active. While not strictly necessary, I have enabled OSPF on the Vlan4094 SVI, giving the switches a backup path in the unlikely event of one of the switches losing connectivity to both spines.

MLAG Reload Delay

The final step in our MLAG configuration is setting two reload-delay values; 30 seconds for non-mlag and 60 seconds for mlag interfaces. The purpose of these commands is to keep interfaces down while the switch is still loading after booting to avoid it receiving any traffic from downstream devices before it is ready to start forwarding.

With this configuration, a switch will behave like this when it comes online after booting:

  1. After 0 seconds:

    As soon as the interfaces are ready, the peer-link comes up. MLAG can negotiate and synchronize network state. Because I enabled OSPF on the Vlan4094 interface, OSPF will come up allowing spine BGP adjacencies to establish.

  2. After 30 seconds:

    Any non-mlag interface will come up. This is typically spine uplinks and interfaces to single-homed devices. Spine OSPF adjacencies are established.

  3. After 60 seconds:

    The mlag interfaces come up. These are typically Port-Channels to downstream servers, signaling that the switch is ready to forward traffic.

In reality these reload-delay values are usually much higher. Some Arista hardware platforms require 300 seconds before MLAG-interfaces come up, others need 600 seconds or more. In my tiny virtual lab I'm not too keen on waiting 5-10 minutes, so I set aggressive timers.

MLAG Port-Channel

Whenever you configure a multihomed Port-Channel, you need to assign a mlag number. For example, I assigned Port-Channel31 with the mlag 31 command. We must use the same ID on both switches in the MLAG-pair, as this information is used to identify the Port-Channel as an MLAG and not a normal LAG interface.
Once identified, secret MLAG sauce is used to synchronize Port-Channel state and MAC-addresses between the two switches. Once this is configured, you can see a "PeerEthernet" interface in the output, shown below:

LE03a#sh run int po31
interface Port-Channel31
   switchport mode trunk
   mlag 31

LE03a#show port-channel
Port-Channel3:
  Active Ports: "Ethernet3"

Port-Channel31:
  Active Ports: "Ethernet31" "PeerEthernet31"

Port-Channel32:
  Active Ports: "Ethernet32"
  Configured, but inactive ports:
       Port             Reason
    ------------------- -------------------------
    "PeerEthernet32"    waiting for LACP response

LE03b#sh run int po31
interface Port-Channel31
   switchport mode trunk
   mlag 31

LE03b#show port-channel
Port-Channel3:
  Active Ports: "Ethernet3"

Port-Channel31:
  Active Ports: "Ethernet31" "PeerEthernet31"

Port-Channel32:
  Active Ports: "PeerEthernet32"
  Configured, but inactive ports:
       Port         Reason
    --------------- -------------------------
    "Ethernet32"    waiting for LACP response

Examining the output above, we can see that Port-Channel3 is not an MLAG, because it has no PeerEthernet interface. Port-Channel31 has two member interfaces, Ethernet31 and PeerEthernet31, so it is an MLAG with one member interface on LE03a and the other on LE03b. Last in the output we can see Port-Channel32 where the Ethernet32 member interface on LE03b has not established correctly due to a lack of LACP messages from the server.

BGP EVPN next-hop IP-address

Despite the above configuration specifying Loopback0 as the update-source in the EVPN peer group BGP configuration, the LE03 and LE04 switches will set their Loopback1 IP-address as the BGP next-hop for any EVPN route they advertise. Let's look at an example:

SP01#show bgp evpn route-type mac-ip detail

BGP routing table entry for mac-ip 5001.0000.0041
 Route Distinguisher: 65000:10
 Paths: 2 available
  Local (Received from a RR-client)
    10.0.0.4 from 10.0.0.42 (10.0.0.42)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal, best
      Extended Community:
        Route-Target-AS:65000:10
        TunnelEncap:tunnelTypeVxlan
      VNI: 10
      ESI: 0000:0000:0000:0000:0000
  Local (Received from a RR-client)
    10.0.0.4 from 10.0.0.41 (10.0.0.41)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal
      Extended Community:
        Route-Target-AS:65000:10
        TunnelEncap:tunnelTypeVxlan
      VNI: 10
      ESI: 0000:0000:0000:0000:0000

BGP routing table entry for mac-ip 5001.0000.0031
 Route Distinguisher: 65000:10
 Paths: 2 available
  Local (Received from a RR-client)
    10.0.0.3 from 10.0.0.31 (10.0.0.31)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal, best
      Extended Community:
        Route-Target-AS:65000:10
        TunnelEncap:tunnelTypeVxlan
      VNI: 10
      ESI: 0000:0000:0000:0000:0000
  Local (Received from a RR-client)
    10.0.0.3 from 10.0.0.32 (10.0.0.32)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal
      Extended Community:
        Route-Target-AS:65000:10
        TunnelEncap:tunnelTypeVxlan
      VNI: 10
      ESI: 0000:0000:0000:0000:0000

As we can see in the highlighted output above. LE03a (10.0.0.31) and LE03b (10.0.0.32) both advertise the 5001.0000.0031 MAC-address with the next-hop set to Loopback1 IP-address 10.0.0.3. This is what enables the Anycast functionality, avoiding packet duplication and optimizing BUM traffic flooding. The BGP EVPN next-hop is decided by the vxlan source-interface command in the interface Vxlan1 configuration mode.

MLAG Conclusion

We have now examined the configuration of a MLAG switch-pair. We have seen what is needed to avoid packet duplication, why the peer-link must be a switched trunk interface and the limitations of BGP EVPN and Anycast IP-addressing.

Pros:
  • Active-active forwarding using MLAG Port-Channels
  • Reduced VXLAN-flooding thanks to Anycast
  • Well suited for datacenter networks where ToR switches are deployed in pairs
Cons:
  • Peer-link is necessary for MAC-learning
  • BGP EVPN routes from MLAG peer are discarded

Now that we have examined the MLAG solution, lets examine multihoming using capabilities built into EVPN.


EVPN ESI

EVPN was originally invented as a protocol for L2VPN in Service-Provider (SP) networks as a replacement for VPLS. VPLS has drawbacks similar to standard VXLAN where the dataplane is also the control plane; MAC-learning was performed when frames were flooded between PEs. Another drawback of VPLS is that it can't do active-active multihoming.

So EVPN was invented to fix the shortcomings of VPLS. Using BGP as control plane, MAC-addresses were advertised without the need for flooding frames across the topology. Active-active multihoming was implemented using ESI and EVPN route-types 1 and 4, advertising Auto-Discovery routes and Ethernet Segment Identifiers (ESI), respectively. The former is used for multihoming MAC aliasing, the latter for Designated Forwarder (DF) elections. We will go into greater detail on both below.

One benefit that EVPN ESI has over MLAG is that a downstream device can connect to any combination of switches for multihoming. Whereas MLAG forces the downstream device to connect to switches in the same MLAG pair, with EVPN ESI a server can connect to any switch.

Lab Topology

Going back to our lab topology diagram and focusing on the right side this time, we have three standalone switches (LE05, LE06 and LE07) that provide EVPN Multihoming using the ESI method. Each switch connect to one singlehomed server; SRV51, SRV61 and SRV71, respectively. There are also four multihomed servers, for example SRV561 and SRV562 connected to LE05 and LE06.

Because all multihoming communication is sent via BGP, there is no need for a peer-link. This is what makes ESI more flexible than MLAG.

EVPN ESI and Packet Duplication

This EVPN ESI awesomeness does have a drawback compared to MLAG, and that is how packet duplication is avoided. Where MLAG solved this problem using an Anycast IP-address on both switches in the MLAG-pair, EVPN ESI must use a split horizon-based approach which can be quite complex. So buckle up, this is about to get nutty!

When LE05 and LE06 realize that they are both connected to SRV561, they independently run an algorithm to determine the Designated Forwarder (DF). Whoever becomes DF is responsible for forwarding Broadcast, Multicast and unknown Unicast (BUM) traffic to the multihomed server. The other switch or switches are not allowed to forward these frames, thus avoiding packet duplication. The same process occurs for the SRV562 ethernet segment and this time LE06 may be elected as DF. This helps share the BUM-traffic load between the switches.

For simplicity sake we will assume that LE05 is DF for SRV561 and SRV562, and LE06 is DF for SRV671 and SRV672. Because SRV51, SRV61 and SRV71 are all singlehomed there is no need for a DF election on these ethernet segments.

Broadcast from LE03a

Let's say SRV31 behind LE03a/b sends an ARP broadcast frame. LE03a is the receiver and perform VXLAN-flooding, sending a copy to each remote VTEP: LE04, LE05, LE06, LE07.

Dotted line means ARP was not flooded out on this port

  • LE05 floods the ARP to SRV51 (singlehomed) and to SRV561/562 because LE05 is the DF.

  • LE06 floods the ARP to SRV61 (singlehomed) and to SRV671/672 because LE06 is the DF.

  • LE07 floods the ARP to SRV71 (singlehomed).

While this solution does work very well, it can't scale as high as the MLAG solution. A limitation of VXLAN scalability is Ingress Replication. A switch can only generate a finite amount of copies while VXLAN flooding before reaching some kind of hardware limit. While researching I found this document saying that a Cisco Nexus 9000-switch is limited to 64 peers per VNI in regards to Ingress Replication. In MLAG-terms, 64 peers equal 128 switches thanks to the Anycast VTEP. In EVPN ESI terms, 64 peers equal 64 switches. This suggests that MLAG has twice the scalability of EVPN ESI.

More on avoiding Packet Duplication

We're not done with Packet Duplication yet. I told you it was about to get nutty and we're getting closer. What happens when a device connected to a EVPN ESI leaf send out an ARP broadcast frame?

Each multihomed server has a unique color to show that it has multiple connections. For example, only SRV561 is blue.

In this example SRV561 sends out an ARP broadcast frame. It happened to be sent to LE06 even though it also connects to LE05. To avoid packet duplication, two split-horizon rules must be used. The first rule goes as follows:

  • When a VTEP receives a BUM packet from another VTEP, the source IP of the VXLAN header is examined to stop the packet from being flooded out on ethernet segments that are shared with the source VTEP.

This means that when LE05 receive its VXLAN-flooded copy of the ARP broadcast frame, it will see that the VXLAN packet came from LE06 (10.0.0.6). Based on this, LE05 must not forward this frame to SRV561 or SRV562 as these devices are also connected to LE06. Note that this overrides the default DF behavior. Even though LE05 is the DF for SRV561 and SRV562, because the frame came from LE06 it cannot be forwarded as doing so could cause a network loop. LE05 therefore only floods the frame to SRV51.

LE07 will only forward the frame to singlehomed SRV71. SRV671/672 also connect to LE06, so LE07 must not flood this BUM packet to them. LE07 would not to do anyway as it is not the DF, but the above rule still take precedence.

The second rule says this:

  • The ingress VTEP should always perform replication to all directly connected devices, regardless of DF status, for any BUM traffic received on any downstream port. This is called Local Bias in RFC 8365.

Because the first rule forced LE05 to override its DF behavior and not forward the BUM packet to SRV561 or SRV562, this rule forces LE06 to override its non-DF behavior and forward the frame to SRV562. LE06 forward the frame to SRV61, SRV671 and SRV672 as they too are directly connected. The packet is not forwarded to SRV561 as that would mean sending the frame out on the same interface it was received.

The ARP packet travel path across the network.

These split-horizon rules effectively stop packet duplication by having the ingress VTEP perform local flooding. The egress VTEP only flood the frame out on ethernet segments that are not shared with the ingress VTEP.

Note: these rules only apply to VXLAN EVPN. MPLS EVPN utilizes labels to influence the split-horizon behavior.

EVPN ESI Configuration

Let's review the configuration necessary to build multihoming with ESI. I will again start by displaying the full configuration, then go into more detail further below.

service routing protocols model multi-agent
!
link tracking group EVPN-ESI-MH
   recovery delay 60
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
!
vlan 10
   name VLAN10
!
vlan 20
   name VLAN20
!
interface Port-Channel51
   switchport mode trunk
!
interface Port-Channel561
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0561
      route-target import 00:05:00:06:05:61
   lacp system-id 5001.0005.0006
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel562
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0562
      route-target import 00:05:00:06:05:62
   lacp system-id 5001.0005.0006
   link tracking group EVPN-ESI-MH downstream
!
interface Ethernet1
   no switchport
   ip address 10.1.5.5/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet2
   no switchport
   ip address 10.2.5.5/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet51
   description "SRV51"
   channel-group 51 mode active
!
interface Ethernet56/1
   description "SRV561"
   channel-group 561 mode active
!
interface Ethernet56/2
   description "SRV562"
   channel-group 562 mode active
!
interface Loopback0
   ip address 10.0.0.5/32
!
interface Vxlan1
   vxlan source-interface Loopback0
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
   vxlan vlan 20 vni 20
!
ip routing
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 10.0.0.5:10
      route-target both 65000:10
      redistribute learned
   !
   vlan 20
      rd 10.0.0.5:20
      route-target both 65000:20
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
link tracking group EVPN-ESI-MH
   recovery delay 60
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
!
vlan 10
   name VLAN10
!
vlan 20
   name VLAN20
!
interface Port-Channel61
   switchport mode trunk
!
interface Port-Channel67
   lacp system-id 5001.0000.0067
!
interface Port-Channel561
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0561
      route-target import 00:05:00:06:05:61
   lacp system-id 5001.0005.0006
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel562
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0562
      route-target import 00:05:00:06:05:62
   lacp system-id 5001.0005.0006
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel671
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0006:0007:0671
      route-target import 00:06:00:07:06:71
   lacp system-id 5001.0006.0007
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel672
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0006:0007:0672
      route-target import 00:06:00:07:06:72
   lacp system-id 5001.0006.0007
   link tracking group EVPN-ESI-MH downstream
!
interface Ethernet1
   no switchport
   ip address 10.1.6.6/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet2
   no switchport
   ip address 10.2.6.6/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet56/1
   description "SRV561"
   channel-group 561 mode active
!
interface Ethernet56/2
   description "SRV562"
   channel-group 562 mode active
!
interface Ethernet67/1
   description "SRV671"
   channel-group 671 mode active
!
interface Ethernet67/2
   description "SRV672"
   channel-group 672 mode active
!
interface Ethernet61
   description "SRV61"
   channel-group 61 mode active
!
interface Loopback0
   ip address 10.0.0.6/32
!
interface Vxlan1
   vxlan source-interface Loopback0
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
   vxlan vlan 20 vni 20
!
ip routing
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 10.0.0.6:10
      route-target both 65000:10
      redistribute learned
   !
   vlan 20
      rd 10.0.0.6:20
      route-target both 65000:20
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

service routing protocols model multi-agent
!
link tracking group EVPN-ESI-MH
   recovery delay 60
!
logging console informational
logging synchronous level informational
!
spanning-tree mode mstp
spanning-tree mst 0 priority 4096
!
vlan 10
   name VLAN10
!
vlan 20
   name VLAN20
!
interface Port-Channel71
   switchport mode trunk
!
interface Port-Channel671
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0006:0007:0671
      route-target import 00:06:00:07:06:71
   lacp system-id 5001.0006.0007
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel672
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0006:0007:0672
      route-target import 00:06:00:07:06:72
   lacp system-id 5001.0006.0007
   link tracking group EVPN-ESI-MH downstream
!
interface Ethernet1
   no switchport
   ip address 10.1.7.7/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet2
   no switchport
   ip address 10.2.7.7/28
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
   link tracking group EVPN-ESI-MH upstream
!
interface Ethernet67/1
   description "SRV671"
   channel-group 671 mode active
!
interface Ethernet67/2
   description "SRV672"
   channel-group 672 mode active
!
interface Ethernet71
   description "SRV71"
   channel-group 71 mode active
!
interface Loopback0
   ip address 10.0.0.7/32
!
interface Vxlan1
   vxlan source-interface Loopback0
   vxlan udp-port 4789
   vxlan vlan 10 vni 10
   vxlan vlan 20 vni 20
!
ip routing
!
router bgp 65000
   neighbor EVPN peer group
   neighbor EVPN remote-as 65000
   neighbor EVPN update-source Loopback0
   neighbor EVPN send-community
   neighbor 10.0.0.1 peer group EVPN
   neighbor 10.0.0.2 peer group EVPN
   !
   vlan 10
      rd 10.0.0.7:10
      route-target both 65000:10
      redistribute learned
   !
   vlan 20
      rd 10.0.0.7:20
      route-target both 65000:20
      redistribute learned
   !
   address-family evpn
      neighbor EVPN activate
   !
   address-family ipv4
      no neighbor EVPN activate
!
router ospf 1
   redistribute connected

With the full configuration shown above, let's take a deeper look at the configuration lines that enable the EVPN multihoming functionality:

interface Port-Channel51
   switchport mode trunk
!
interface Port-Channel561
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0561
      route-target import 00:05:00:06:05:61
   lacp system-id 5001.0005.0006
   link tracking group EVPN-ESI-MH downstream
!
interface Port-Channel562
   switchport mode trunk
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0562
      route-target import 00:05:00:06:05:62
   lacp system-id 5001.0005.0006

Starting with Port-Channel51, this one is very simple as SRV51 is singlehomed to LE05. No special configuration is required. Moving on to Port-Channel561, we see a couple of new commands:

  • We use the identifier 0000:0000:0005:0006:0561 command to uniquely identify this Ethernet Segment. This is the ESI. By configuring the same value on LE05 and LE06, they understand that they connect to the same ethernet segment, SRV561.
    As long as the ESI value start with 00, the rest of the identifier can contain any combination of hexadecimal characters. I elected to use a 0000:0000:<lower-leaf-ID>:<higher-leaf-ID>:<port-channel-ID> format.

  • The route-target import 00:05:00:06:05:61 command is used to create an inbound route-filter so that only switches with the ESI configured import the route. I use the <lower-leaf-ID>:<higher-leaf-ID>:<port-channel-ID> format.

  • The lacp system-id 5001.0005.0006 command is used make sure LE05 and LE06 send the same LACP system ID to SRV561 when negotiating the port-channel (LAG). SRV561 will not bring both interfaces up if it thinks they connect to different switches. This solves the same problem that MLAG did but at the interface level. I use the 5001:<lower-node-ID>:<higher-node-ID> syntax.

The Port-Channel562 configuration follow the same syntax and procedure. This ensure that each multi-homed ethernet segment is uniquely identified.

Note: Even singlehomed ethernet segments have an ESI value assigned, but use the default all-zeroes ESI.

Route-Type 1: Auto-Discovery (AD)

This route-type is used for ESI multihoming. Its purpose is signaling the link-state of the local Port-Channel interface. One could argue that this is not necessary as the switch could just withdraw any MAC-IP route once an interface goes down. However, we will discover why this is a good thing below.

Note: This route is not advertised if the ESI value is all-zeroes (singlehomed).

LE05#show bgp evpn route-type auto-discovery esi 0000:0000:0005:0006:0561
"Routes originated by LE05:"
     Network                Next Hop              Metric  LocPref Weight  Path
 * > RD: 10.0.0.5:1  auto-discovery   0000:0000:0005:0006:0561
                            -                     -       -       0       i
 * > RD: 10.0.0.5:10 auto-discovery 0 0000:0000:0005:0006:0561
                            -                     -       -       0       i
 * > RD: 10.0.0.5:20 auto-discovery 0 0000:0000:0005:0006:0561
                            -                     -       -       0       i
"Routes originated by LE06:"
 * > RD: 10.0.0.6:1  auto-discovery   0000:0000:0005:0006:0561
                            10.0.0.6              -       100     0       i Or-ID: 10.0.0.6
 * > RD: 10.0.0.6:10 auto-discovery 0 0000:0000:0005:0006:0561
                            10.0.0.6              -       100     0       i Or-ID: 10.0.0.6
 * > RD: 10.0.0.6:20 auto-discovery 0 0000:0000:0005:0006:0561
                            10.0.0.6              -       100     0       i Or-ID: 10.0.0.6

Focusing on specific lines from the output above:

  • The 10.0.0.6:1 auto-discovery 0000 route is advertised when the interface is physically up and forwarding. If Port-Channel561 goes down on LE06, this route will be withdrawn.

  • The 10.0.0.6:10 auto-discovery 0 0000 route is withdrawn when VLAN 10/VNI 10 is no longer available on that interface. If I run the switchport trunk allowed vlan remove 10 command on interface Port-Channel561 on LE06, this route will be withdrawn.

So EVPN is used to signal both the link-state of a physical interface, but also individual VLANs on that interface. These routes on their own does not accomplish much, but if we keep digging we find references to these ESIs in our MAC-IP routes:

LE07#show bgp evpn vni 10
     Network                Next Hop              Metric  LocPref Weight  Path
 * > RD: 10.0.0.5:10 auto-discovery 0 0000:0000:0005:0006:0561
                            10.0.0.5              -       100     0       i Or-ID: 10.0.0.5
 * > RD: 10.0.0.6:10 auto-discovery 0 0000:0000:0005:0006:0561
                            10.0.0.6              -       100     0       i Or-ID: 10.0.0.6
 * > RD: 10.0.0.5:10 mac-ip 5001.0000.0561
                            10.0.0.5              -       100     0       i Or-ID: 10.0.0.5

LE07#show bgp evpn route-type mac-ip vni 10 detail
BGP routing table entry for mac-ip 5001.0000.0561, Route Distinguisher: 10.0.0.5:10
 Paths: 2 available
  Local
    10.0.0.5 from 10.0.0.1 (10.0.0.1)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal, best
      Originator: 10.0.0.5, Cluster list: 10.0.0.1
      Extended Community:
        Route-Target-AS:65000:10
        TunnelEncap:tunnelTypeVxlan
      VNI: 10
      ESI: 0000:0000:0005:0006:0561

LE07#show vxlan address-table
VLAN  Mac Address     Type      Prt  VTEP
----  -----------     ----      ---  ----
  10  5001.0000.0561  EVPN      Vx1  10.0.0.5
                                     10.0.0.6

In this example LE07 have only learned the MAC-address 5001.0000.0561 from LE05, shown above. Despite this, its VXLAN address table specify both LE05 and LE06 as valid nexthops for traffic to that MAC-address. This is because both LE05 and LE06 advertise an Auto-Discovery route for ESI 0000:0000:0005:0006:0561. As soon as LE06 withdraws the AD-route for 0000:0000:0005:0006:0561 in VNI 10, LE07 updates its VXLAN-table to only use LE05 as the nexthop.

LE07 is able to do this thanks to what Arista calls a MAC Aliasing mechanism, which according to an Arista document can be quite a common occurrence. One such example is SRV561 deciding to only forward traffic via its LE05-interface.

Additionally, if SRV561 is only sending its traffic to LE05 then LE06 never gets a chance to locally learn the MAC-Address. Thanks to MAC Aliasing, LE06 is still able to install the SRV561 MAC-address entry based on the information received from LE05. If SRV561 start sending traffic on its LE06-interface, LE06 learn the MAC-address locally and start advertising the MAC-address to its EVPN neighbors.

AD-route for Mass Withdrawal

Another strength of the AD-route is its mass withdrawal feature to improve convergence time. Let's assume 1000 MAC-addresses live on the SRV561 ethernet segment. Let's then imagine that LE06 loses connectivity to SRV561. LE06 must now tell the the network that it should no longer be sent any traffic destined for these 1000 MAC-addresses. Withdrawing 1000 MAC-IP routes will take a significant amount of time as LE06 has to generate and send them, the RRs have to reflect them and all other switches must process them to remove each MAC-addresses from their VXLAN address tables. This can be very slow and resource-intensive, negatively affecting the network convergence time.

Instead, LE06 send a single AD-route withdrawal first, announcing that it lost connectivity to the SRV561 ethernet segment. This one route is quickly reflected and processed on all switches, allowing them to efficiently remove the LE06 as nexthop for all MAC-addresses mapped to the 0000:0000:0005:0006:0561 ESI. The network convergence time is now only a fraction of what it would otherwise be. LE06 can now generate its 1000 MAC-IP route withdrawals at a leisurely pace to be reflected and processed without impacting the network convergence time.

The importance of a unique RD

When you configure EVPN multihoming with ESI, you must ensure that every switch advertises its routes with a unique Route-Distinguisher. This is necessary to ensure that all type-1 Auto-Discovery routes are received correctly. If you use the same RD (65000:10 for example) on all switches then routes coming from other switches will appear identical to one that the switch originates itself. Because of this, BGP will prefer its own locally originated route and discard the others. These auto-discovery routes must be received for multihoming to function correctly, so you must use the Router-ID:VNI Route-Distinguisher format shown below:

LE05#show run sec bgp
router bgp 65000
   !
   vlan 10
      rd 10.0.0.5:10 <-- "Very important"
      route-target both 65000:10
      redistribute learned

LE06#show run sec bgp
router bgp 65000
   !
   vlan 10
      rd 10.0.0.6:10 <-- "Very important"
      route-target both 65000:10
      redistribute learned

Route-Type 4: Ethernet Segment (ES)

This route is used for Designated Forwarder elections. For every multihomed Ethernet segment, a DF must be elected.

LE05#sh run int po561
interface Port-Channel561
   switchport mode trunk
   switchport
   !
   evpn ethernet-segment
      identifier 0000:0000:0005:0006:0561
      designated-forwarder election algorithm preference 50
      route-target import 00:05:00:06:05:61
   lacp system-id 5001.0005.0006

LE05#show bgp evpn route-type ethernet-segment esi 0000:0000:0005:0006:0561 detail
BGP routing table information for VRF default
Router identifier 10.0.0.5, local AS number 65000
BGP routing table entry for ethernet-segment 0000:0000:0005:0006:0561 10.0.0.5
 Route Distinguisher: 10.0.0.5:1
 Paths: 1 available
  Local
    - from - (0.0.0.0)
      Origin IGP, metric -, localpref -, weight 0, valid, local, best
      Extended Community:
        TunnelEncap:tunnelTypeVxlan
        EvpnEsImportRt:00:05:00:06:05:61
        DF Election: Preference 50
BGP routing table entry for ethernet-segment 0000:0000:0005:0006:0561 10.0.0.6
 Route Distinguisher: 10.0.0.6:1
 Paths: 1 available
  Local
    10.0.0.6 from 10.0.0.1 (10.0.0.1)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal, best
      Originator: 10.0.0.6, Cluster list: 10.0.0.1
      Extended Community:
        TunnelEncap:tunnelTypeVxlan
        EvpnEsImportRt:00:05:00:06:05:61
        DF Election: Preference 100

LE05#show bgp evpn inst
EVPN instance: VLAN 10
  Local ethernet segment:
    ESI: 0000:0000:0005:0006:0561
      Interface: Port-Channel561
      Mode: all-active
      State: up
      ES-Import RT: 00:05:00:06:05:61
      DF election algorithm: preference
      Designated forwarder: 10.0.0.6
      Non-Designated forwarder: 10.0.0.5
EVPN instance: VLAN 20
    ESI: 0000:0000:0005:0006:0561
      Interface: Port-Channel561
      Mode: all-active
      State: up
      ES-Import RT: 00:05:00:06:05:61
      DF election algorithm: preference
      Designated forwarder: 10.0.0.6
      Non-Designated forwarder: 10.0.0.5

LE06#show bgp evpn route-type ethernet-segment esi 0000:0000:0005:0006:0561 detail
BGP routing table information for VRF default
Router identifier 10.0.0.6, local AS number 65000
BGP routing table entry for ethernet-segment 0000:0000:0005:0006:0561 10.0.0.5
 Route Distinguisher: 10.0.0.5:1
 Paths: 1 available
  Local
    10.0.0.5 from 10.0.0.1 (10.0.0.1)
      Origin IGP, metric -, localpref 100, weight 0, valid, internal, best
      Originator: 10.0.0.5, Cluster list: 10.0.0.1
      Extended Community:
        TunnelEncap:tunnelTypeVxlan
        EvpnEsImportRt:00:05:00:06:05:61
        DF Election: Preference 50
BGP routing table entry for ethernet-segment 0000:0000:0005:0006:0561 10.0.0.6
 Route Distinguisher: 10.0.0.6:1
 Paths: 1 available
  Local
    - from - (0.0.0.0)
      Origin IGP, metric -, localpref -, weight 0, valid, local, best
      Extended Community:
        TunnelEncap:tunnelTypeVxlan
        EvpnEsImportRt:00:05:00:06:05:61
        DF Election: Preference 100

LE06#show bgp evpn inst
EVPN instance: VLAN 10
  Local ethernet segment:
    ESI: 0000:0000:0005:0006:0561
      Interface: Port-Channel561
      Mode: all-active
      State: up
      ES-Import RT: 00:05:00:06:05:61
      DF election algorithm: preference
      Designated forwarder: 10.0.0.6
      Non-Designated forwarder: 10.0.0.5
EVPN instance: VLAN 20
  Local ethernet segment:
    ESI: 0000:0000:0005:0006:0561
      Interface: Port-Channel561
      Mode: all-active
      State: up
      ES-Import RT: 00:05:00:06:05:61
      DF election algorithm: preference
      Designated forwarder: 10.0.0.6
      Non-Designated forwarder: 10.0.0.5

In the output above we can see that a preference value was used to make sure that LE06 become the DF for the SRV561 ethernet segment. All data is carried by BGP as extended community values, ensuring that all switches independently come to the same conclusion about who should be DF. Because LE06 has the higher preference value (100 vs 50), it is elected as DF.

The show bgp evpn instance command show the DF status for each multihomed ethernet segment. To save space I only include output for the SRV561 segment.

Link Tracking

With MLAG we had a nice feature called reload-delay which allowed the operator to setup timers to stagger the process of taking interfaces online after the switch had finished booting. This avoids traffic blackholing by stopping the switch from receiving traffic before it is ready to forward it. With EVPN ESI we are not using MLAG, so to get the same functionality we use the Arista link tracking feature. The configuration looks like this:

A link-tracking group EVPN-ESI-MH was created, tracking the spine uplinks. If both Ethernet1 and Ethernet2 interfaces go down, the link-tracking group puts Port-Channel561 into down state, signaling to SRV561 that LE05 should cannot forward any traffic. After Ethernet1 or Ethernet2 come back up, a 60 second timer starts. Once the timer ends, Port-Channel561 is put into up state, signaling to SRV561 that LE05 is ready to forward traffic again.

STP Super Root

With MLAG, the two switches in the pair could synchronize their STP Bridge ID to ensure that downstream switches would see them as one switch. To achieve the same functionality without MLAG, Arista developed a STP super root feature, enabling a switch to send STP BPDUs with Bridge ID 0000.0000.0001 and priority set to 0. This allow L2 switches to be multihomed to LE05/6/7 just like with MLAG. The magic command is spanning-tree root support.

If your particular vendor does not support a similar command, the recommendation is to filter BPDUs on the EVPN ESI port-channel and leave it up to the downstream switch to ensure that the topology is loop-free.

EVPN ESI Conclusion

This is a solid alternative (and the only?) when MLAG cannot be used. It avoids packet duplication using complex split-horizon rules. As EVPN carry all information, there is no need for a peer-link.

Pros:

  • Based on open standards and RFCs, nothing is vendor proprietary
  • No peer-link so multihoming is very flexible
  • No peer-link so switches may be much further apart
  • Transparency, all information is carried by BGP

Cons:

  • Packet duplication split-horizon rules are complex
  • Multihoming use multiple EVPN routes, requiring larger BGP RIB than MLAG
  • Less scalable than MLAG in terms of Ingress Replication limitations

Conclusion

We have now compared MLAG and ESI for EVPN Multihoming. Both solutions must solve the packet duplication problem and they use vastly different methods of doing so.

If you're deploying a data center network, I would recommend the MLAG option. It is simple, solid and less complex than EVPN ESI. Its drawback is that it is proprietary to that vendor. You have to trust that their implementation is stable.

If you're deploying a service provider network you may not be able to deploy MLAG, and in that case EVPN ESI is your only alternative. With EVPN ESI you have a more complex but transparent setup where everything is advertised with BGP. This technically lowers your dependence on a single vendor as everything is using open standards and RFCs. As for real-world vendor inter-operability, only time and testing will tell. Two vendors may implement features or interpret RFCs differently.

I really enjoyed writing this article (although it took me quite a while to put together) and I learned a ton about EVPN ESI Multihoming along the way. I hope you enjoyed reading this article. Thanks for visiting, I hope to see you again soon!

If you want more to read, please consider other posts in my VXLAN series:

References:

Copyright 2021-2023, Emil Eliasson.
All Rights Reserved.