Internet-Draft Fast Recovery for EVPN DF-Election July 2024
Brissette, et al. Expires 9 January 2025 [Page]
Workgroup:
BESS Working Group
Internet-Draft:
draft-ietf-bess-evpn-fast-df-recovery-09
Updates:
8584 (if approved)
Published:
Intended Status:
Standards Track
Expires:
Authors:
P. Brissette, Ed.
Cisco
A. Sajassi
Cisco
LA. Burdet
Cisco
J. Drake
Independent
J. Rabadan
Nokia

Fast Recovery for EVPN Designated Forwarder Election

Abstract

The Ethernet Virtual Private Network (EVPN) solution provides Designated Forwarder (DF) election procedures for multihomed Ethernet Segments. These procedures have been enhanced further by applying Highest Random Weight (HRW) algorithm for Designated Forwarder election in order to avoid unnecessary DF status changes upon a failure. This document improves these procedures by providing a fast Designated Forwarder election upon recovery of the failed link or node associated with the multihomed Ethernet Segment. This document updates Section 2.1 of [RFC8584] by optionally introducing delays between some of the events therein.

The solution is independent of the number of EVPN Instances (EVIs) associated with that Ethernet Segment and it is performed via a simple signaling between the recovered node and each of the other nodes in the multihoming group.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 9 January 2025.

Table of Contents

1. Introduction

The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is becoming pervasive in data center (DC) applications for Network Virtualization Overlay (NVO) and DC interconnect (DCI) services, and in service provider (SP) applications for next generation virtual private LAN services.

[RFC7432] describes Designated Frowarder (DF) election procedures for multihomed Ethernet Segments. These procedures are enhanced further in [RFC8584] by applying the Highest Random Weight (HRW) algorithm for DF election in order to avoid unnecessary DF status changes upon a link or node failure associated with the multihomed Ethernet Segment. This document makes further improvements to the DF election procedures in [RFC8584] by providing an option for a fast DF election upon recovery of the failed link or node associated with the multihomed Ethernet Segment. This DF election is achieved independent of the number of EVPN Instances (EVIs) associated with that Ethernet Segment and it is performed via straightforward signaling between the recovered node and each of the other nodes in the multihomed group.
This document updates the DF Election Finite State Machine (FSM) described in Section 2.1 of [RFC8584], by optionally introducing delays between some events, as further detailed in Section 2.2. The solution is based on a simple one-way signaling mechanism.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

1.2. Terminology

PE:
Provider Edge device.
Designated Forwarder (DF):
A PE that is currently forwarding (encapsulating/decapsulating) traffic for a given VLAN in and out of a site.
EVI:
An EVPN instance spanning the Provider Edge (PE) devices participating in that EVPN.

1.3. Challenges with Existing Mechanism

In EVPN technology, multiple Provider Edge (PE) devices have the ability to encapsulate and decapsulate data belonging to the same VLAN. Under certain conditions, this may cause Layer2 duplicates and potential loops if there is a momentary overlap in forwarding roles between two or more PE devices, consequently leading to broadcast storms.

EVPN [RFC7432] currently specifies timer-based synchronization among PE devices within a redundancy group. This approach can lead to duplications and potential loops due to multiple Designated Forwarders (DFs) if the timer interval is too short, or to packet drops if the timer interval is too long.

Split-horizon filtering, as described in Section 8.3 of [RFC7432], can prevent loops but does not address duplicates. However, if there are overlapping Designated Forwarders (DFs) of two different sites simultaneously for the same VLAN, the site identifier will differ when the packet re-enters the Ethernet Segment. Consequently, the split-horizon check will fail, resulting in Layer 2 loops.

The updated Designated Forwarder (DF) procedures outlined in [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to prevent the reshuffling of VLANs among PE devices within the redundancy group during failure or recovery events. This approach minimizes the impact on VLANs not assigned to the failed or recovered ports and eliminates the occurrence of loops or duplicates during such events.

However, upon PE insertion or a port being newly added to a multihomed Ethernet Segment, HRW also cannot help as a transfer of DF role to the new port must occur while the old DF is still active.

                                  +---------+
               +-------------+    |         |
               |             |    |         |
             / |    PE1      |----|         |   +-------------+
            /  |             |    |  MPLS/  |   |             |---CE3
           /   +-------------+    |  VxLAN/ |   |     PE3     |
      CE1 -                       |  Cloud  |   |             |
           \   +-------------+    |         |---|             |
            \  |             |    |         |   +-------------+
             \ |     PE2     |----|         |
               |             |    |         |
               +-------------+    |         |
                                  +---------+

Figure 1: CE1 multihomed to PE1 and PE2.

In Figure 1, when PE2 is inserted in the Ethernet Segment or its CE1-facing interface recovered, PE1 will transfer the DF role of some VLANs to PE2 to achieve load balancing. However, because there is no handshake mechanism between PE1 and PE2, overlapping of DF roles for a given VLAN is possible which leads to duplication of traffic as well as Layer 2 loops.

Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer-based approach for transferring the DF role to the newly inserted device. This can cause the following issues:

  • Loops/Duplicates if the timer value is too short
  • Prolonged Traffic Blackholing if the timer value is too long

1.4. Design Principles for a Solution

The clock-synchronization solution for fast DF recovery presented in this document follows several design principles and presents multiples advantages, namely:

  • Complex handshake signaling mechanisms and state machines are avoided in favor of a simple uni-directional signaling approach.
  • The fast DF recovery solution maintains backwards-compatibility (see Section 4) by ensuring that PEs any unrecognized new BGP Extended Community.
  • Existing DF Election algorithms remain supported.
  • The fast DF recovery solution is independent of any BGP delays in propagation of Ethernet Segment routes (Route Type 4)
  • The fast DF recovery solution is agnostic of the actual time synchronization mechanism used, and normalizes to NTP for EVPN signalling only.

2. DF Election Synchronization Solution

The fast DF recovery solution relies on the concept of common clock alignment between partner PEs participating in a common Ethernet Segment i.e. PE1 and PE2 in Figure 1. The main idea is to have all peering PEs of that Ethernet Segment perform DF election, and apply the result at the same pre-announced time.

The DF Election procedure, as described in [RFC7432] and as optionally signalled in [RFC8584], is applied. All PEs attached to a given Ethernet Segment are clock-synchronized using a networking protocol for clock synchronization (e.g., NTP, PTP). When a new PE is inserted in an Ethernet Segment or a failed PE device of the Ethernet Segment recovers, that PE communicates to peering partners the current time plus the value of the timer for partner discovery from step 2 in Section 8.5 of [RFC7432]. This constitutes an "end time" or "absolute time" as seen from the local PE. That absolute time is called the "Service Carving Time" (SCT).

A new BGP Extended Community, the Service Carving Timestamp is advertised along with the Ethernet Segment route (RT-4) to communicate the Service Carving Time to other partners.

Upon receipt of the new BGP Extended Community, partner PEs can determine the service carving time of the newly insterted PE. To eliminate any potential for duplicate traffic or loops, the concept of skew is introduced: a small time offset to ensure a controlled and orderly transition when multiple Provider Edge (PE) devices are involved. The receiving partner PEs add a skew (default = -10ms) to the Service Carving Time to enforce this mechanism. The previously inserted PE(s) must perform service carving first, followed shortly by the newly insterted PE, after the specified skew delay.

To summarize, all peering PEs perform service carving almost simultaneously at the time announced by the newly added/recovered PE. The newly inserted PE initiates the SCT, and triggers service carving immediately on its local timer expiry. The previously inserted PE(s) receiving Ethernet Segment route (RT-4) with a SCT BGP extended community, perform service carving shortly before Service Carving Time.

2.1. BGP Encoding

A new BGP extended community is defined to communicate the Service Carving Timestamp for each Ethernet Segment.

A new transitive extended community where the Type field is 0x06, and the Sub-Type is 0x0F is advertised along with the Ethernet Segment route. The expected Service Carving Time is encoded as an 8-octet value as follows:

                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~  Timestamp Seconds            | Timestamp Fractional Seconds  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 2: Service Carving Time

The timestamp exchanged uses the NTP prime epoch of January 1, 1900 [RFC5905] and the 64-bit NTP Timestamp Format. The NTP Era value is not exchanged and Era 0 is assumed as of the writing of this document. A DF Election operation occurring exactly at the Era transition boundary some time in 2036 is outside of the scope of this document.
The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds and a 32-bit part for Fraction, which are encoded in the Service Carving Time as follows:

  • Timestamp Seconds: 32-bit NTP seconds are encoded in this field.
  • Timestamp Fractional Seconds: the high order 16 bits of the NTP 'Fraction' field are encoded in this field.

When rebuilding a 64-bit NTP Timestamp Format using the values from a received SCT BGP extended community, the lower order 16 bits of the Fractional field are set to 0. The use of a 16-bit fractional seconds yields adequate precision of 15 microseconds (2^-16 s).

This document introduces a new flag called "T" (for Time Synchronization) to the bitmap field of the DF Election Extended Community defined in [RFC8584].

                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg | |A| |T|       ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~     Bitmap    |            Reserved = 0                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 3: DF Election Extended Community
  • Bit 3: Time Synchronization (corresponds to Bit 27 of the DF Election Extended Community). When set to 1, it indicates the desire to use Time Synchronization capability with the rest of the PEs in the Ethernet Segment.

This capability is utilized in conjunction with the agreed-upon DF Election Type. For instance, if all the PE devices in the Ethernet Segment indicate possessing Time Synchronization capability and request the DF Election Type to be Highest Random Weight (HRW), then the HRW algorithm is edused in conjunction with this capability. A PE which does not support the procedures set out in this document, or receives a route from another PE in which th capability is not set MUST NOT delay Designated Forwarder election as this could lead to duplicate traffic in some instances (overlapping Designated Forwarders).

2.2. Updates to RFC8584

This document introduces an additional delay to the events and transitions defined for the default DF election algorithm FSM in Section 2.1 of [RFC8584] without changing the FSM state or event definitions themselves.

Upon receiving a RECV_ES message, the peering PE's Finite State Machine (FSM) transitions from the DF_DONE (indicating the DF election process was complete) state to the DF_CALC (indicating that a new DF calculation is needed) state . Due to the Service Carving Time (SCT) included in the Ethernet-Segment update, the completion of the DF_CALC state and the subsequent transition back to the DF_DONE state are delayed. This delay ensures proper synchronization and prevents conflicts. Consequently, the accompanying forwarding updates to the Designated Forwarder (DF) and Non-Designated Forwarder (NDF) states are also deferred.

The corresponding actions when transitions are performed or states are entered/exited are modified as follows:

  1. DF_CALC on CALCULATED: Mark the election result for the VLAN or VLAN Bundle.

    9.1
    If an SCT timestamp is present during the RECV_ES event of Action 11, wait until the time indicated by the SCT before proceeding to step 9.2.
    9.2
    Assume the role of DF or NDF for the local PE concerning the VLAN or VLAN Bundle, and transition to the DF_DONE state.

This revised approach ensures proper timing and synchronization in the DF election process, avoiding conflicts and ensuring accurate forwarding updates.

3. Synchronization Scenarios

Consider Figure 1 as an example, where initially PE2 has failed and PE1 has taken over. This scenario illustrates the problem with the DF-Election mechanism described in Section 8.5 of [RFC7432], specifically in the context of the timer value configured for all PEs on the Ethernet Segment.

Procedure based on Section 8.5 of [RFC7432] with the default 3 second timer in step 2:

  1. Initial state: PE1 is in a steady-state and PE2 is recovering
  2. Recovery: PE2 recovers at an absolute time of t=99.
  3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner PE1.
  4. Timer Start: PE2 starts a 3 second timer to allow the reception of RT-4 from other PE nodes.
  5. Immediate carving: PE1 performs service carving immediately upon RT-4 reception, i.e. t=100 plus some BGP propagation delay.
  6. Delayed Carving: PE2 performs service carving at time t=103

[RFC7432] favors traffic drops over duplicate traffic. With the above procedure, traffic drops will occur as part of each PE recovery sequence since PE1 transitions some VLANs to Non-Designated Forwarder (NDF) immediately upon RT-4 reception.
The timer value (default = 3 seconds) directly affects the duration of the packet drops. A shorter (or zero) timer may result in duplicate traffic or traffic loops.

Procedure based on the Service Carving Time (SCT) approach:

  1. Initial state: PE1 is in a steady state, and PE2 is recovering
  2. Recovery: PE2 recovers at an absolute time of t=99.
  3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target SCT value of t=103 to partner PE1.
  4. Timer Start: PE2 starts a 3 second timer to allow the reception of RT-4 from other PE nodes.
  5. Service Carving Timer: PE1 starts the service carving timer, with the remaining time until t=103
  6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time of t=103

To maintain the preference for minimal loss over duplicate traffic, PE1 should carve slightly before PE2 (with skew). The recovering PE2 performs both DF to NDF and NDF to DF transitions per VLAN at the timer's expiry. The original PE1, which received the SCT, applies the following:

This split-behavior ensures a smooth DF role transition with minimal loss.

Using the SCT approach, the negative effect of the timer to allow the reception of RT-4 from other PE nodes is mitigated. Furthermore, the BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to PE1) becomes a non-issue. The SCT approach shortens the 3-second timer window to the order of milliseconds.

3.1. Concurrent Recoveries

In the eventuality 2 or more PEs in a peering Ethernet Segment group are recovering concurrently or roughly the same time, each will advertise a Service Carving Timestamp. This SCT value would correspond to what each recovering PE considers the "end time" for DF Election. A similar situation arises in sequentially recovering PEs, when a second PE recovers approximately at the time of the first PE's advertised SCT expiry, and with its own new SCT-2 outside of the initial SCT window.

In the case of multiple concurrent DF elections, each initiated by one of the recovering PEs, the SCTs must be ordered chronologically. All PEs shall execute only a single DF Election at the service carving time corresponding to the largest (latest) received timestamp value. This DF Election will involve all active PEs in a unified DF Election update.

Example:

  1. Initial State: PE1 is in a steady state, with services elected at PE1.
  2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 with a target SCT value of t=103 to its partners (PE1)
  3. Timer Initiation by PE2: PE2 starts a 3 second timer to allow the reception of RT-4 from other PE nodes.
  4. Timer Initiation by PE1: PE1 starts the service carving timer, with the remaining time until t=103.
  5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 with a target SCT value of t=105 to its partners (PE1, PE2).
  6. Timer Initiation by PE3: PE3 starts a 3 second timer to allow the reception of RT-4 from other PE nodes
  7. Timer Update by PE2: PE2 cancels the running timer and starts the service carving timer with the remaining time until t=105.
  8. Timer Update by PE1: PE1 updates its service carving timer, with the remaining time until t=105.
  9. Service Carving: PE1, PE2, and PE3 perform service carving at the absolute time of t=105.

In the eventuality a PE in a Ethernet Segment group recovers during the discovery window specified in Section 8.5 of [RFC7432], and does not support or advertise the T-bit, then all PEs in the current peering sequence SHALL immediately revert to the default [RFC7432] behavior.

4. Backwards Compatibility

For the DF election procedures to achieve global convergence and unanimity within a redundancy group, it is essential that all participating PEs agree on the DF election algorithm to be employed. However, it is possible that some PEs may continue to use the existing modulo-based DF election algorithm from [RFC7432] and not utilize the new Service Carving Time (SCT) BGP extended community. PEs that operate using the baseline DF election mechanism will simply discard the new SCT BGP extended community as unrecognized. [RFC7432] and do not rely on the new SCT BGP extended community.

A PE can indicate its willingness to support clock-synchronized carving by signaling the new 'T' DF Election Capability and including the new SCT BGP extended community along with the Ethernet Segment Route (Type-4). If one or more PEs attached to the Ethernet Segment do not signal T=1, then all PEs in the Ethernet Segment SHALL revert to the timer-based approach as specified in [RFC7432]. This reversion is particularly crucial in preventing VLAN shuffling when more than two PEs are involved.

5. Security Considerations

The mechanisms in this document use EVPN control plane as defined in [RFC7432]. Security considerations described in [RFC7432] are equally applicable.

For the new SCT Extended Community, attack vectors may be setting the value to zero, to a value in the past or to large times in the future. The procedures in this document address implicitly what occurs with a carving time in the past, as this would be a naturally occurring event with a large BGP propagation delay: the receiving PE SHALL treat the DF Election at the peer as having occurred already, and proceed without starting any timer to futher delay service carving. For timestamp values in the future, a rogue PE may be advertising a value inconsistent with its local behavior. This is no different than a rogue PE setting all its DF Election results inconstently to its peers using (or ignoring adherence to) the procedures from [RFC7432], and the result would similarly be duplicate or dropped traffic. It is left to implementations to decide what consists an "unreasonably large" SCT value.

This document uses MPLS and IP-based tunnel technologies to support data plane transport. Security considerations described in [RFC7432] and in [RFC8365] are equally applicable.

6. IANA Considerations

IANA maintains the "EVPN Extended Community Sub-Types" registry set up by [RFC7153]. IANA is requested to confirm the First Come First Served assignment as follows:

   Sub-Type Value   Name                        Reference       Date
   --------------   -------------------------   -------------   ----
         0x0F       Service Carving Timestamp   This document   TBD

IANA should replace the field TBD with the date of publicaton of this document as an RFC.

IANA maintains the "DF Election Capabilities" registry set up by [RFC8584]. IANA is requested to make the following assignment from this registry:

    Bit         Name                         Reference        Date
    ----        ----------------             -------------    ----
    3           Time Synchronization         This document    TBD

IANA should replace the field TBD with the date of publicaton of this document as an RFC.

7. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC5905]
Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, "Network Time Protocol Version 4: Protocol and Algorithms Specification", RFC 5905, DOI 10.17487/RFC5905, , <https://www.rfc-editor.org/info/rfc5905>.
[RFC7153]
Rosen, E. and Y. Rekhter, "IANA Registries for BGP Extended Communities", RFC 7153, DOI 10.17487/RFC7153, , <https://www.rfc-editor.org/info/rfc7153>.
[RFC7432]
Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, , <https://www.rfc-editor.org/info/rfc7432>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8365]
Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., Uttaro, J., and W. Henderickx, "A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, DOI 10.17487/RFC8365, , <https://www.rfc-editor.org/info/rfc8365>.
[RFC8584]
Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet VPN Designated Forwarder Election Extensibility", RFC 8584, DOI 10.17487/RFC8584, , <https://www.rfc-editor.org/info/rfc8584>.

Appendix A. Contributors

In addition to the authors listed on the front page, the following co-authors have also contributed substantially to this document:

Gaurav Badoni
Cisco

Email: gbadoni@cisco.com

Dhananjaya Rao
Cisco

Email: dhrao@cisco.com

Appendix B. Acknowledgements

Authors would like to acknowledge helpful comments and contributions of Satya Mohanty and Bharath Vasudevan. Also thank you to Anoop Ghanwani and Gunter van de Velde for their thorough review with valuable comments and corrections.

Authors' Addresses

Patrice Brissette (editor)
Cisco
Ali Sajassi
Cisco
Luc Andre Burdet
Cisco
John Drake
Independent
Jorge Rabadan
Nokia