Internet-Draft Fast Congestion Notification (FCN) in Wi July 2026
He, et al. Expires 4 January 2027 [Page]
Workgroup:
RTGWG Working Group
Internet-Draft:
draft-he-rtgwg-wan-fcn-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
X. He
China Telecom
K. Ruan
China Telecom
X. Min
ZTE Corp.
L. Deng
China Telecom

Fast Congestion Notification (FCN) in Wide Area Network (WAN) Interconnecting RoCEv2 Networks

Abstract

Wide Area Network (WAN), when interconnecting RoCEv2 networks, needs to meet the performance requirements of "high throughput, low latency, and minimal packet loss". This document describes a solution to Fast Congestion Notification (FCN) in WAN interconnecting RoCEv2 networks, especially applicable to tunnel encapsulation.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 4 January 2027.

Table of Contents

1. Introduction

Remote Direct Memory Access (RDMA) is a method of accessing memory on a remote system without interrupting the processing of the Central Processing Unit (CPU) on that system. RDMA enables lower latency and higher throughput on the network and lower CPU utilization for the servers and storage systems. Currently, RoCEv2 (RDMA over Converged Ethernet Version 2)[IBTA-Spec] is widely deployed in lossless networks in intelligent computing centers, providing packet loss free data transmission services for high-performance computing (HPC) and AI model training and inference scenarios.

With the rapid growth in demand for computing and storage resources in AI big models and distributed storage, intelligent computing centers are interconnected through wide area networks (WANs) to provide multi-DCs collaboration to compensate for the limitations of insufficient computing and storage resources in a single DC, and improve resource utilization. The interconnection of artificial intelligence Data Centers (AIDCs) through WANs are becoming a new network structure gradually accepted by the industry, providing wide area lossless transmission for emerging application scenarios.

WAN, when interconnecting RoCEv2 networks, is required to meet the performance of "high throughput, low latency, and near-zero packet loss", [I-D.ietf-rtgwg-net-notif-ps] describes the existing problems and the need of fast network notification solutions. The RoCEv2 networks often implement a proactive congestion control mechanism based on Explicit Congestion Notification (ECN) [RFC3168]. The ECN-marked packets are routed to the destination (receiver). Then the receiver alerts the source (sender) by sending Congestion Notification Packets (CNP). After receiving the CNP, the sender slows down the sending rate immediately to mitigate congestion. This mechanism introduces Round- Trip-Time (RTT) delay and can be slow for the sender to take action. [I-D.xiao-rtgwg-rocev2-fast-cnp] defines a RoCEv2 Fast Congestion Notification Packet (Fast CNP), which can be sent by a congested network node to the traffic sender directly. [I-D.xiao-rtgwg-proxy-congestion-notification] introduce a proxy network node between the congested node and the traffic sender, in which the congested node sends the congestion notification to the proxy node, and then the proxy node translates the received congestion notification and resends the translated congestion notification to the traffic sender.

This document describes a solution to Fast Congestion Notification (FCN) in WAN interconnecting RoCEv2 networks, especially applicable to tunnel encapsulation.

2. Conventions

2.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2.2. Terminology

Abbreviations used in this document:

AI: Artificial Intelligence

AIDC: Artificial Intelligence Data Center

CNP: Congestion Notification Packet

DC: Data Center

ECN: Explicit Congestion Notification

FCN: Fast Congestion Notification

P: Provider

PE: Provider Edge

QP: Queue Pair

RDMA: Remote Direct Memory Access

RoCEv2: RDMA over Converged Ethernet version 2

SR-MPLS: Segment Routing over Multiprotocol Label Switching

SRH: Segment Routing Header

SRv6: Segment Routing over IPv6

VXLAN: Virtual Extensible Local Area Network

WAN: Wide Area Network

3. RoCEv2 Data Packet and CNP formats

RoCEv2 packets use a well-known UDP Destination Port number 4791 that unambiguously distinguishes them in a stateless manner. RoCEv2 data packet format is shown in Figure 1.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                        Ethernet Header                        ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                          IPv6 Header                          ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                           UDP Header                          ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~            InfiniBand Base Transport Header (12 Bytes)        ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                            Payload                            ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Invariant CRC                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                              FCS                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1: RoCEv2 Data Packet Format

Within the InfiniBand Base Transport Header, there is a 24-bit field called Destination Queue Pair (Destination QP), indicating the Work Queue Pair Number at the destination. The QP consists of a Send Work Queue and a Receive Work Queue. Send and receive queues are always created as a pair when the connection is estabilished and remain that way throughout their lifetime. A Queue Pair is identified by its Queue Pair Number.

The Source QP indicating the Work Queue Pair at the source is not contained in the InfiniBand Base Transport Header. It is because both the sender and the receiver know the binding relationship between the Source QP and the Destination QP.

RoCEv2 Congestion Notification Packet (CNP) format is shown in Figure 2.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                        Ethernet Header                        ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                          IPv4/6 Header                        ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                           UDP Header                          ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~      InfiniBand Base Transport Header (12 bytes)              ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   ~                    Reserved (16 bytes)                        ~
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Invariant CRC                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                              FCS                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2: RoCEv2 Congestion Notification Packet Format

The RoCEv2 CNP is generated by the receiver after receiving RoCEv2 data packet with ECN bits set. The Destination QP of the RoCEv2 CNP is set to the Work Queue Pair Number at the sender, corresponding to the Source QP of the sender.

After the sender receives the RoCEv2 CNP, the sender would reduce the transmission rate at which it sends the RoCEv2 data packets to the receiver. The congestion control algorithm used by the sender is outside the scope of this document.

4. FCN in WANs

Typically, two AIDCs based on RoCEv2 network are interconnected through a WAN, where the DC gateway in each AIDC is directly connected to the respective PE in the WAN. VPN tunnels (e.g., SR-MPLS, SRv6, VXLAN) are established between the ingress PE and egress PE to carry massive RoCEv2 traffic between DCs, as shown in Figure 3.

             +------------------------------------------+
             |       |     FCN       |                  |
             |       |<--------------|                  |
 +--------+  |   +-------+     +-------+     +-------+  |   +--------+
 |DC1     |==|==>|  PE1  |====>|P1...Pn|====>|  PE2  |==|==>|DC2     |
 |Gateway |  |   |       |     |       |     |       |  |   |Gateway |
 +----^---+  |   +-------+     +-------+     +-------+  |   +--------+
      |      |      |            WAN                    |       |
      |      +------|-----------------------------------+       |
 +----------+       |                                     +-----v----+
 |  AIDC 1  |       |                                     |  AIDC 2  |
 |          |       | CNP                                 |          |
 +----^-----+       |                                     +----------+
      |             |                                           |
 +--------+         v                                       +---v----+
 | Sender |------------                                     |Receiver|
 +--------+                                                 +--------+
Figure 3: AIDCs Interconnected Through WANs

Fast Congestion Notification (FCN) is generated by a congested node in WAN, but not generated by the receiver. When a network node in WAN encounters network congestion, it's difficult for the congested node to send a congestion notification message to the sender directly, because there exist different routing domains between WAN and AIDC. Instead, the congested node sends congestion notification packets (CNPs) to the ingress PE firstly. The ingress PE translates the received CNPs to a standard format known by the sender and then resends the translated congestion notification message to the sender.

4.1. Technical requirements for WAN

During the process of establishing a session connection between the sender and the receiver, Ingress PE needs to learn and maintain the connection relationship between the source IP address and destination IP address pair and the Work Queue Pair, including source QP and destination QP.

Assuming that the sender supports congestion management, and it sends all RoCEv2 packets with the ECN field in the IP header set to "01" or "10".

When the ingress PE receives RoCEv2 packets originated by the sender from the AIDC, the technical requirements for the ingress PE are as follows.

When a network node in WAN (execpt the ingress PE) encounters congestion, the technical requirements for the congested node are as follows.

When the ingress PE receives a Fast CNP, the technical requirements for the ingress PE are as follows.

On receiving the standard CNP, the sender can determine the source QP of the RoCEv2 flow causing congestion, based on the destination QP (i.e., the source QP of the sender) field of the InfiniBand Base Transport Header in the standard CNP, and then implement congestion control, slowing down the packet injection for the source QP of the RoCEv2 flow causing congestion.

4.2. Fast CNP Format

The congestion notification message sent from the congested node to the ingress PE is a UDP message formatted in Figure 4.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        UDP Source Port        |  UDP Destination Port = TBD1  |
   +-------------------------------+-------------------------------+
   |           UDP Length          |          UDP Checksum         |
   +-------------------------------+-------------------------------+
   |         Flow Label (20 bit)           |  CL  |     Rvd        |
   +---------------------------------------------------------------+
Figure 4: Fast CNP Format

The UDP header as specified in [RFC768] includes the UDP source port, UDP destination port, UDP length, and UDP checksum. A well-known UDP destination port (TBD1) needs to be allocated for this Fast CNP. UDP Length is the length in octets of this datagram including UDP header and the data. The data field is 4 bytes, containing 20-bit Flow Label, 3-bit CL (Congestion Level) and 9-bit Rvd (Reserved) field. Thus UDP Length of this Fast CNP is 12.

The 20-bit Flow Label field indicates the RoCEv2 packet flow causing congestion. The flow label value within the outer IPv6 header of the encapsulated RoCEv2 packet causing congestion is populated into this 20-bit Flow Label field. Optionally, this flow label value is also copied into Flow Label field of IPv6 header in this Fast CNP.

The 3-bit Congestion Level field indicates the congestion level. Value 1 represents the lowest congestion level and value 7 represents the highest congestion level. This is an optional field, when not used, it can be set to value 0.

The 9-bit Reserved field is for future use. It MUST be set to "0" on transmission and and ignored on receipt.

5. IANA Considerations

This document requests a well-known UDP port number TBD1 from the System Ports range of the "Service Name and Transport Protocol Port Number" registry [RFC6335] is requested to be assigned to the Fast Congestion Notification. Specifically, IANA is requested to assign a UDP port as shown below for which the Assignee and Contact is the IESG and the IETF Chair, respectively.

   +==============+========+===========+===============+===============+
   | Service Name | Port   | Transport | Description   | Reference     |
   |              | Number | Protocol  |               |               |
   +==============+========+===========+===============+===============+
   |    Fast      |        |           |   Receiver    |               |
   | Congestion   | TBD1   |   UDP     |   Port for    | This document |
   | Notification |        |           |   Fast CNP    |               |
   +--------------+--------+-----------+---------------+---------------+

6. Security Considerations

The Fast CNP MUST be applied in a specific controlled domain. A limited administrative domain provides the network administrator with the means to select, monitor, and control the access to the network, making it a trusted domain.

To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting policies when generating Fast CNPs.

A deployment MUST support the configuration option to enable or disable the Fast CNP feature defined in this document. By default, the Fast CNP feature MUST be disabled.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8126]
Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, , <https://www.rfc-editor.org/info/rfc8126>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.

7.2. Informative References

[I-D.ietf-rtgwg-net-notif-ps]
Dong, J., McBride, M., Clad, F., Zhang, Z. J., Zhu, Y., Xu, X., Zhuang, R., Pang, R., Lu, H., Liu, Y., Contreras, L. M., Mehmet, D., and R. Rahman, "Fast Network Notifications Problem Statement", Work in Progress, Internet-Draft, draft-ietf-rtgwg-net-notif-ps-02, , <https://datatracker.ietf.org/doc/html/draft-ietf-rtgwg-net-notif-ps-02>.
[I-D.xiao-rtgwg-proxy-congestion-notification]
Min, X., Zhang, K., and Z. Hu, "Fast Congestion Notification Packet (CNP) with Proxy", Work in Progress, Internet-Draft, draft-xiao-rtgwg-proxy-congestion-notification-03, , <https://datatracker.ietf.org/doc/html/draft-xiao-rtgwg-proxy-congestion-notification-03>.
[I-D.xiao-rtgwg-rocev2-fast-cnp]
Min, X., lihesong, Zhang, K., Cheng, W., Yang, J., and X. hexiaoming, "Fast Congestion Notification Packet (CNP) in RoCEv2 Networks", Work in Progress, Internet-Draft, draft-xiao-rtgwg-rocev2-fast-cnp-05, , <https://datatracker.ietf.org/doc/html/draft-xiao-rtgwg-rocev2-fast-cnp-05>.
[IBTA-Spec]
InfiniBand Trade Association, "InfiniBand Architecture Specification Volume 1, Release 1.4", , <https://www.infinibandta.org/ibta-specification/>.
[RFC3168]
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/info/rfc3168>.

Authors' Addresses

Xiaoming He
China Telecom
Ke Ruan
China Telecom
Xiao Min
ZTE Corp.
Lijie Deng
China Telecom