| Internet-Draft | Fast Congestion Notification (FCN) in Wi | July 2026 |
| He, et al. | Expires 4 January 2027 | [Page] |
Wide Area Network (WAN), when interconnecting RoCEv2 networks, needs to meet the performance requirements of "high throughput, low latency, and minimal packet loss". This document describes a solution to Fast Congestion Notification (FCN) in WAN interconnecting RoCEv2 networks, especially applicable to tunnel encapsulation.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 4 January 2027.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Remote Direct Memory Access (RDMA) is a method of accessing memory on a remote system without interrupting the processing of the Central Processing Unit (CPU) on that system. RDMA enables lower latency and higher throughput on the network and lower CPU utilization for the servers and storage systems. Currently, RoCEv2 (RDMA over Converged Ethernet Version 2)[IBTA-Spec] is widely deployed in lossless networks in intelligent computing centers, providing packet loss free data transmission services for high-performance computing (HPC) and AI model training and inference scenarios.¶
With the rapid growth in demand for computing and storage resources in AI big models and distributed storage, intelligent computing centers are interconnected through wide area networks (WANs) to provide multi-DCs collaboration to compensate for the limitations of insufficient computing and storage resources in a single DC, and improve resource utilization. The interconnection of artificial intelligence Data Centers (AIDCs) through WANs are becoming a new network structure gradually accepted by the industry, providing wide area lossless transmission for emerging application scenarios.¶
WAN, when interconnecting RoCEv2 networks, is required to meet the performance of "high throughput, low latency, and near-zero packet loss", [I-D.ietf-rtgwg-net-notif-ps] describes the existing problems and the need of fast network notification solutions. The RoCEv2 networks often implement a proactive congestion control mechanism based on Explicit Congestion Notification (ECN) [RFC3168]. The ECN-marked packets are routed to the destination (receiver). Then the receiver alerts the source (sender) by sending Congestion Notification Packets (CNP). After receiving the CNP, the sender slows down the sending rate immediately to mitigate congestion. This mechanism introduces Round- Trip-Time (RTT) delay and can be slow for the sender to take action. [I-D.xiao-rtgwg-rocev2-fast-cnp] defines a RoCEv2 Fast Congestion Notification Packet (Fast CNP), which can be sent by a congested network node to the traffic sender directly. [I-D.xiao-rtgwg-proxy-congestion-notification] introduce a proxy network node between the congested node and the traffic sender, in which the congested node sends the congestion notification to the proxy node, and then the proxy node translates the received congestion notification and resends the translated congestion notification to the traffic sender.¶
This document describes a solution to Fast Congestion Notification (FCN) in WAN interconnecting RoCEv2 networks, especially applicable to tunnel encapsulation.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Abbreviations used in this document:¶
AI: Artificial Intelligence¶
AIDC: Artificial Intelligence Data Center¶
CNP: Congestion Notification Packet¶
DC: Data Center¶
ECN: Explicit Congestion Notification¶
FCN: Fast Congestion Notification¶
P: Provider¶
PE: Provider Edge¶
QP: Queue Pair¶
RDMA: Remote Direct Memory Access¶
RoCEv2: RDMA over Converged Ethernet version 2¶
SR-MPLS: Segment Routing over Multiprotocol Label Switching¶
SRH: Segment Routing Header¶
SRv6: Segment Routing over IPv6¶
VXLAN: Virtual Extensible Local Area Network¶
WAN: Wide Area Network¶
RoCEv2 packets use a well-known UDP Destination Port number 4791 that unambiguously distinguishes them in a stateless manner. RoCEv2 data packet format is shown in Figure 1.¶
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Ethernet Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ IPv6 Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ UDP Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ InfiniBand Base Transport Header (12 Bytes) ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Payload ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Invariant CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FCS |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Within the InfiniBand Base Transport Header, there is a 24-bit field called Destination Queue Pair (Destination QP), indicating the Work Queue Pair Number at the destination. The QP consists of a Send Work Queue and a Receive Work Queue. Send and receive queues are always created as a pair when the connection is estabilished and remain that way throughout their lifetime. A Queue Pair is identified by its Queue Pair Number.¶
The Source QP indicating the Work Queue Pair at the source is not contained in the InfiniBand Base Transport Header. It is because both the sender and the receiver know the binding relationship between the Source QP and the Destination QP.¶
RoCEv2 Congestion Notification Packet (CNP) format is shown in Figure 2.¶
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Ethernet Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ IPv4/6 Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ UDP Header ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ InfiniBand Base Transport Header (12 bytes) ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Reserved (16 bytes) ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Invariant CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FCS |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The RoCEv2 CNP is generated by the receiver after receiving RoCEv2 data packet with ECN bits set. The Destination QP of the RoCEv2 CNP is set to the Work Queue Pair Number at the sender, corresponding to the Source QP of the sender.¶
After the sender receives the RoCEv2 CNP, the sender would reduce the transmission rate at which it sends the RoCEv2 data packets to the receiver. The congestion control algorithm used by the sender is outside the scope of this document.¶
Typically, two AIDCs based on RoCEv2 network are interconnected through a WAN, where the DC gateway in each AIDC is directly connected to the respective PE in the WAN. VPN tunnels (e.g., SR-MPLS, SRv6, VXLAN) are established between the ingress PE and egress PE to carry massive RoCEv2 traffic between DCs, as shown in Figure 3.¶
+------------------------------------------+
| | FCN | |
| |<--------------| |
+--------+ | +-------+ +-------+ +-------+ | +--------+
|DC1 |==|==>| PE1 |====>|P1...Pn|====>| PE2 |==|==>|DC2 |
|Gateway | | | | | | | | | |Gateway |
+----^---+ | +-------+ +-------+ +-------+ | +--------+
| | | WAN | |
| +------|-----------------------------------+ |
+----------+ | +-----v----+
| AIDC 1 | | | AIDC 2 |
| | | CNP | |
+----^-----+ | +----------+
| | |
+--------+ v +---v----+
| Sender |------------ |Receiver|
+--------+ +--------+
Fast Congestion Notification (FCN) is generated by a congested node in WAN, but not generated by the receiver. When a network node in WAN encounters network congestion, it's difficult for the congested node to send a congestion notification message to the sender directly, because there exist different routing domains between WAN and AIDC. Instead, the congested node sends congestion notification packets (CNPs) to the ingress PE firstly. The ingress PE translates the received CNPs to a standard format known by the sender and then resends the translated congestion notification message to the sender.¶
During the process of establishing a session connection between the sender and the receiver, Ingress PE needs to learn and maintain the connection relationship between the source IP address and destination IP address pair and the Work Queue Pair, including source QP and destination QP.¶
Assuming that the sender supports congestion management, and it sends all RoCEv2 packets with the ECN field in the IP header set to "01" or "10".¶
When the ingress PE receives RoCEv2 packets originated by the sender from the AIDC, the technical requirements for the ingress PE are as follows.¶
Ingress PE is REQUIRED to parse RoCEv2 packet header, including the InfiniBand Base Transport Header. It then extracts the source IP address, the destination IP address and the Destination QP from RoCEv2 packet header to obtain the packt flow information, and randomly assign a locally unique Flow Label value to this RoCEv2 packet flow. At the same time, it is REQUIRED to dynamically maintain the mapping table between the RoCEv2 packet flow and corresponding flow label: {source IP address, destination IP address, source QP, destination QP; Flow Label}. Also, it should maintain a timeout timer to monitor when the flow terminates. If the timer has timed out and any RoCEv2 packet from this flow has not been received, clear this flow mapping table and release the corresponding memory space.¶
Ingress PE is REQUIRED to encapsulate the RoCEv2 packet with an outer IPv6 header, with the source address being the IP address of the ingress PE and the destination address determined by the tunnel encapsulation method. For instance, if it is the VXLAN tunnel, the destination address is the IP address of the egress PE; if it is the SRv6 tunnel, the destination address is the first SID of SRH. At the same time, the assigned flow label value MUST be populated into the Flow Label field of the outer IPv6 header, while the flow label value in the original RoCEv2 packet header (if there is an IPv6 packet header) remains unchanged. In addition, the ECN field within the out IPv6 header MUST be set to the same value as the ECN field within the IP header of the original RoCEv2 packet. It then send RoCEv2 packets with the tunnel encapsulation through WAN.¶
When a network node in WAN (execpt the ingress PE) encounters congestion, the technical requirements for the congested node are as follows.¶
The congested node determines whether the sender supports congestion management, based on the ECN field of the outer IPv6 header. If congestion management is supported, then it is REQUIRED to extract the Flow Label field of the outer IPv6 header of the encapsulated RoCEv2 packet causing congestion, and generate a Fast Congestion Notification Packet (Fast CNP), with the source address being the IP address of the congested node and the destination address being the IP address of the ingress PE (i.e., the source address of the encapsulated packet). Then it sends this Fast CNP to the ingress PE. Generally, the frequency of sending Fast CNP depends on the degree of congestion, and the more severe the congestion, the more frequently Fast CNP is sent. How often Fast CNP is sent is outside the scope of this document.¶
The above Fast CNP MUST carry the flow label information of the RoCEv2 flow that caused congestion, as well as the optional congestion level. The Fast CNP format is defined in Section 4.2.¶
If the ingress PE encounters congestion, It directly sends a standard CNP to the sender.¶
When the ingress PE receives a Fast CNP, the technical requirements for the ingress PE are as follows.¶
Ingress PE is REQUIRED to extract the Flow Label as well as the optional congestion level information from the Fast CNP. Based on the dynamically maintained mapping table between the RoCEv2 packet flow and corresponding flow label, obtain the corresponding RoCEv2 packet flow information and regenerate the standard CNP for the RoCEv2 network. The source address of the CNP packet is the IP address of the ingress PE, and the destination address is the source IP address of the original RoCEv2 packet, inferred from the correlation between the flow label and the source IP address in the mapping table. At the same time, the source QP in the mapping table is populated into the destination QP field of the InfiniBand Base Transport Header in the standard CNP. The ingress PE then resends the standard CNP to the traffic sender.¶
On receiving the standard CNP, the sender can determine the source QP of the RoCEv2 flow causing congestion, based on the destination QP (i.e., the source QP of the sender) field of the InfiniBand Base Transport Header in the standard CNP, and then implement congestion control, slowing down the packet injection for the source QP of the RoCEv2 flow causing congestion.¶
The congestion notification message sent from the congested node to the ingress PE is a UDP message formatted in Figure 4.¶
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP Source Port | UDP Destination Port = TBD1 |
+-------------------------------+-------------------------------+
| UDP Length | UDP Checksum |
+-------------------------------+-------------------------------+
| Flow Label (20 bit) | CL | Rvd |
+---------------------------------------------------------------+
The UDP header as specified in [RFC768] includes the UDP source port, UDP destination port, UDP length, and UDP checksum. A well-known UDP destination port (TBD1) needs to be allocated for this Fast CNP. UDP Length is the length in octets of this datagram including UDP header and the data. The data field is 4 bytes, containing 20-bit Flow Label, 3-bit CL (Congestion Level) and 9-bit Rvd (Reserved) field. Thus UDP Length of this Fast CNP is 12.¶
The 20-bit Flow Label field indicates the RoCEv2 packet flow causing congestion. The flow label value within the outer IPv6 header of the encapsulated RoCEv2 packet causing congestion is populated into this 20-bit Flow Label field. Optionally, this flow label value is also copied into Flow Label field of IPv6 header in this Fast CNP.¶
The 3-bit Congestion Level field indicates the congestion level. Value 1 represents the lowest congestion level and value 7 represents the highest congestion level. This is an optional field, when not used, it can be set to value 0.¶
The 9-bit Reserved field is for future use. It MUST be set to "0" on transmission and and ignored on receipt.¶
This document requests a well-known UDP port number TBD1 from the System Ports range of the "Service Name and Transport Protocol Port Number" registry [RFC6335] is requested to be assigned to the Fast Congestion Notification. Specifically, IANA is requested to assign a UDP port as shown below for which the Assignee and Contact is the IESG and the IETF Chair, respectively.¶
+==============+========+===========+===============+===============+ | Service Name | Port | Transport | Description | Reference | | | Number | Protocol | | | +==============+========+===========+===============+===============+ | Fast | | | Receiver | | | Congestion | TBD1 | UDP | Port for | This document | | Notification | | | Fast CNP | | +--------------+--------+-----------+---------------+---------------+¶
The Fast CNP MUST be applied in a specific controlled domain. A limited administrative domain provides the network administrator with the means to select, monitor, and control the access to the network, making it a trusted domain.¶
To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting policies when generating Fast CNPs.¶
A deployment MUST support the configuration option to enable or disable the Fast CNP feature defined in this document. By default, the Fast CNP feature MUST be disabled.¶