| Internet-Draft | Long-haul CNP | February 2026 |
| Tian, et al. | Expires 31 August 2026 | [Page] |
This document specifies a multi-level congestion response framework and an associated Long-haul Congestion Notification Packet (Long-haul CNP) for Data Center Interconnect (DCI) wide-area network scenarios. The framework defines a graduated congestion response mechanism: lightweight ECN marking for incipient congestion and device-originated Long-haul CNP for severe or rapidly worsening congestion. Long-haul CNP packets carry explicit control instructions (e.g., rate reduction percentage, pause duration) and are sent directly by congestion-aware intermediate nodes to the traffic source via unicast, reducing feedback latency compared to receiver-mediated congestion notification. The document also specifies a multi-device collaborative suppression mechanism and BDP-adaptive dynamic threshold calculation for long-haul links. Two packet encapsulation formats are defined: an ICMPv6 extension and a RoCEv2 backward-compatible extension.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 31 August 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
RDMA over Converged Ethernet v2 (RoCEv2) is widely deployed in data center networks for high-performance computing and AI training workloads. Within a single data center, congestion control mechanisms such as DCQCN [DCQCN] and ECN-based schemes provide effective flow control. However, when RoCEv2 traffic traverses Data Center Interconnect (DCI) wide-area networks, the existing congestion notification path ("switch marks ECN, receiver generates CNP, CNP returns to sender") introduces feedback latency proportional to the WAN round-trip time, which can reach tens of milliseconds.¶
Recent work on Fast CNP [I-D.xiao-rtgwg-rocev2-fast-cnp] has addressed the fundamental latency issue by enabling switches to generate CNP packets directly to the sender. This document builds upon that foundation by specifying three complementary mechanisms that are particularly relevant for long-haul DCI scenarios:¶
Additionally, this document defines an extended packet format (Long-haul CNP) that carries explicit control instructions with quantified congestion metrics, enabling the source to perform precise rate adjustments rather than relying on generic rate reduction heuristics.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The mechanisms specified in this document are primarily designed for Data Center Interconnect (DCI) scenarios where RoCEv2 traffic traverses wide-area network paths with non-trivial propagation delay (typically RTT greater than 1 ms). In such environments, the receiver-mediated CNP feedback path introduces significant latency, and the BDP-adaptive threshold mechanism provides meaningful dynamic range.¶
For intra-data-center deployments where RTT is sub-millisecond and paths traverse few hops, the standard ECN/CNP or Fast CNP mechanisms are typically sufficient, and the additional complexity of the multi-level framework may not be warranted.¶
The multi-device collaborative suppression mechanism is most beneficial when data flows traverse two or more congestion-aware intermediate nodes, which is common in multi-hop DCI topologies.¶
Regarding IP version applicability: the ICMPv6 packet format defined in Section 6.1 is applicable only to IPv6 network deployments. For DCI environments that operate over IPv4, implementations MUST use the RoCEv2 backward-compatible extension format defined in Section 6.2. A future document may define an ICMPv4-based format if there is sufficient demand for ICMP-based Long-haul CNP in IPv4-only deployments.¶
The following diagram illustrates a typical DCI topology where this mechanism operates:¶
+------+ +------+ +------+ +------+
|Source|---->| Node |======WAN Path=====>| Node |---->| Dest |
| NIC | | N1 | | N2 | | NIC |
+------+ +------+ +------+ +------+
^ | |
| | Long-haul CNP |
+------------+ (unicast to source) |
| |
+----------------------------------------+
Long-haul CNP (if N1 control insufficient)
The mechanism operates as follows:¶
A congestion-aware intermediate node MUST parse the Base Transport Header (BTH) of traversing RoCEv2 data packets and extract the following flow identification information: Source IP Address, Destination IP Address, Source QP Number, and Destination QP Number.¶
The node SHOULD maintain a flow table with one entry per unique four-tuple (Source IP, Destination IP, Source QP, Destination QP). Flow table entries MUST be updated upon each matching packet observation. Entries MAY be subject to an aging timer; when no matching packets are observed within the configured aging period, the entry SHOULD be removed.¶
The flow table is used to construct Long-haul CNP packets with the correct addressing information when congestion is detected.¶
Congestion-aware intermediate nodes MUST continuously monitor the following metrics on each egress port:¶
The node SHOULD also maintain an estimate of the link round-trip time (RTT_est) for each egress port, which MAY be obtained through control plane configuration, active probing, or receiver reporting.¶
To account for the wide variation in link characteristics across DCI paths, the upper queue depth threshold K_max MUST be dynamically calculated based on the Bandwidth-Delay Product (BDP) of the link:¶
K_max = max(K_base, alpha * R_port * RTT_est / 8)¶
The minimum threshold K_min, used for first-level ECN marking, MUST be configured to a value less than K_max. A RECOMMENDED default is K_min = K_max / 2.¶
Implementations SHOULD recalculate K_max periodically or upon RTT_est changes, to adapt to evolving link conditions.¶
This document defines two levels of congestion response:¶
When a second-level trigger condition is met, the congestion-aware intermediate node MUST perform the following procedure:¶
The node MUST rate-limit Long-haul CNP generation to prevent excessive control traffic. The RECOMMENDED minimum interval between consecutive Long-haul CNP packets for the same flow is one estimated RTT (RTT_est).¶
The Action Flags field in the Long-haul CNP packet encodes the specific control action requested of the source. The following guidelines apply:¶
Upon receiving a Long-haul CNP packet, the source MUST:¶
The source SHOULD maintain a per-QP timer. If no new Long-haul CNP packet is received for the same QP within a configurable recovery interval (RECOMMENDED: 2 * RTT_est), the source SHOULD gradually increase its sending rate using an additive-increase algorithm until normal rate is restored or a new Long-haul CNP is received.¶
Upon receiving a Resume action, the source SHOULD increase its sending rate by the percentage indicated in the Parameter field. The source MAY combine the timer-based recovery mechanism with explicit Resume actions: when a Resume is received, the source applies the indicated rate increase immediately rather than waiting for the recovery timer.¶
If a source receives a Long-haul CNP but does not support the Long-haul CNP format, it MUST silently discard the packet (ICMPv6 format) or process it as a standard CNP (RoCEv2 format). This ensures backward compatibility with sources that only support standard CNP. See Section 8 for details.¶
When a data flow traverses multiple congestion-aware intermediate nodes, uncoordinated Long-haul CNP generation can result in duplicate or conflicting control instructions reaching the source. This section specifies coordination rules to mitigate this issue.¶
A downstream congestion-aware node that detects congestion for a given flow SHOULD check whether a Long-haul CNP for the same flow has recently been generated by an upstream node. This can be inferred by observing a reduction in the flow's arrival rate within a configurable observation window (RECOMMENDED: 1 * RTT_est). If such a reduction is observed:¶
Congestion-aware intermediate nodes MAY dynamically adjust the threshold parameters (K_min, K_max, V_ecn, V_growth) and rate-limiting intervals based on observed traffic characteristics such as average queue occupancy, traffic burstiness patterns, and link utilization history. The specific algorithms for such adjustment are implementation-dependent and outside the scope of this document.¶
This document defines two Long-haul CNP packet formats. Implementations MUST support at least one format and SHOULD indicate the supported format(s) through out-of-band configuration or capability exchange.¶
This format encapsulates the Long-haul CNP as a new ICMPv6 informational message type [RFC4443]. Because the Long-haul CNP is a new ICMPv6 message type with a fully defined fixed-length body (no variable-length "original datagram" field), the length ambiguity problem addressed by [RFC4884] does not apply. However, the Optional Extension Objects defined in this section adopt the extension structure format from [RFC4884] (Extension Header with Version and Checksum, and Extension Objects with Class-Num, C-Type, Length) for consistency with IETF ICMP extension conventions and to enable reuse of existing ICMP extension parsing implementations.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +---------------+---------------+-------------------------------+ | Cong. Level | Action Flags | Parameter | +---------------+---------------+-------------------------------+ | Source QP Number | +---------------------------------------------------------------+ | Metric Type | Congestion Metric Value (24 bits) | +---------------+-----------------------------------------------+ | Extension Header (optional, see below) | +---------------------------------------------------------------+ | Extension Object(s) (optional, variable) | ~ ~ +---------------------------------------------------------------+
The fixed-length body of the Long-haul CNP ICMPv6 message is 12 octets (3 x 32-bit words), comprising the fields from Congestion Level through Congestion Metric. This fixed length is known to all implementations, so the boundary between the fixed body and any extension structure is unambiguous.¶
Zero or more extension objects MAY follow the fixed-length body. When extension objects are present, they MUST be preceded by an Extension Header as defined in Section 3 of [RFC4884], formatted as follows:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version (4) | Reserved | Checksum | +---------------+---------------+-------------------------------+
Each Extension Object following the Extension Header uses the object header format defined in Section 4 of [RFC4884]:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Class-Num | C-Type | +-------------------------------+---------------+---------------+ | Object Payload (variable) | ~ ~ +---------------------------------------------------------------+
When no extension objects are present, the Extension Header MUST be omitted entirely. Receivers determine the presence of extension objects by checking whether the ICMPv6 message length exceeds the fixed body length (12 octets beyond the standard 4-octet ICMPv6 header).¶
This format achieves backward compatibility by reusing the standard CNP BTH Opcode (0x81) and extending the packet through a reserved bit in the BTH. This approach avoids the need for IETF to request a new Opcode from the InfiniBand Trade Association (IBTA), while ensuring that legacy endpoints that do not support Long-haul CNP will still process the packet as a standard CNP and apply their default rate reduction behavior.¶
In the standard RoCEv2 BTH, the 6-bit field immediately following the BECN (Backward Explicit Congestion Notification) bit is reserved and MUST be set to zero per the current RoCEv2 specification. This document proposes to the IBTA the definition of the most significant bit of this 6-bit reserved field as the Extension Present (E) bit:¶
The E bit definition is a proposal for IBTA consideration. This bit resides within a reserved field that is under IBTA governance, and formal allocation requires IBTA approval. Prior to such approval, implementations MUST NOT deploy this format in environments where non-participating endpoints or intermediate nodes may be present, as legacy devices that validate the reserved field as zero may reject packets with E=1. See Section 10.5 for further coordination details.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MAC Header (112 bits) | +---------------------------------------------------------------+ | IPv4/IPv6 Header (160/320 bits) | +---------------------------------------------------------------+ | UDP Header (64 bits, DstPort=4791) | +---------------------------------------------------------------+ BTH (Base Transport Header, 96 bits): +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OpCode (0x81) |S|M|Pad| TVer | Partition Key | +---------------+-+-+---+-------+-------------------------------+ |F|B|E| RSV(5b) | DestQP (24 bits) | +-+-+-+-+-+-+-+-+-----------------------------------------------+ |A| RSV (7b) | PSN (24 bits) | +-+-------------+-----------------------------------------------+ Long-haul CNP Extension Fields (present when E=1): +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Cong. Level | Action Flags | Parameter | +---------------+---------------+-------------------------------+ | Source QP Number (32 bits) | +---------------------------------------------------------------+ | Metric Type | Congestion Metric Value (24 bits) | +---------------+-----------------------------------------------+ | Optional Extension Objects (variable) | ~ ~ +---------------------------------------------------------------+ | ICRC (32 bits) | +---------------------------------------------------------------+ | FCS (32 bits) | +---------------------------------------------------------------+
When generating a Long-haul CNP in RoCEv2 format, the congestion-aware intermediate node MUST set the BTH fields as follows:¶
The extension fields immediately follow the BTH when E=1. Their encoding is consistent with the ICMPv6 format:¶
In the RoCEv2 specification, the ICRC is computed over all bytes from the beginning of the BTH to the byte immediately preceding the ICRC field itself (with certain IP and UDP header fields replaced by defined values). When a Long-haul CNP is constructed with E=1, the extension fields and any Optional Extension Objects are placed between the BTH and the ICRC field. Therefore, the ICRC computation naturally covers the extension data.¶
A legacy RNIC that receives a Long-haul CNP will compute the ICRC over the same byte range (BTH through the byte preceding the ICRC field). Because the ICRC is always located at a fixed offset from the end of the Ethernet frame (immediately before the FCS), the legacy RNIC will include the extension fields in its ICRC computation even though it does not parse them. Consequently, the ICRC verification will succeed on legacy endpoints, and no ICRC mismatch will occur due to the presence of extension fields.¶
Consider a DCI path: Source (DC-A) -> N1 -> N2 -> Dest (DC-B), where N1 and N2 are congestion-aware intermediate nodes, and the WAN RTT is 10 ms.¶
+--------+ +----+ +----+ +--------+
| Source |--->| N1 |--->| N2 |--->| Dest |
| DC-A | | | | | | DC-B |
+--------+ +----+ +----+ +--------+
10.0.0.1 10.0.0.4
QP=100 QP=200
The Long-haul CNP mechanism is designed for incremental deployment:¶
Because the Long-haul CNP in RoCEv2 format uses the standard CNP Opcode (0x81) with all mandatory BTH fields set to their standard CNP values, a legacy RNIC that does not recognize the E bit will process the packet as a standard CNP. The legacy RNIC will ignore the extension fields (which appear after the expected CNP boundary) and apply its default rate reduction behavior. This provides a graceful degradation path: precise control for supporting endpoints, and standard CNP rate reduction for legacy endpoints.¶
However, the Long-haul CNP frame is larger than a standard CNP frame due to the extension fields (at least 12 additional octets for the base extension, plus any Optional Extension Objects). Legacy RNIC implementations that perform strict frame length validation against the expected standard CNP size may reject the Long-haul CNP packet. Deployments SHOULD verify that legacy endpoints in the network tolerate CNP frames with additional trailing data beyond the standard BTH before enabling the RoCEv2 Long-haul CNP format. In environments where legacy endpoints are known to perform strict length validation, the ICMPv6 format SHOULD be used instead, or all endpoints should be upgraded to support the Long-haul CNP extension.¶
Long-haul CNP packets carry control instructions that directly affect the source's sending behavior. The following security considerations apply:¶
This document requests IANA to allocate a new value from the "ICMPv6 'type' Numbers" registry for the Long-haul CNP message type. The value SHOULD be allocated from the informational message range (128-255).¶
| Type | Name | Reference |
|---|---|---|
| TBD1 | Long-haul Congestion Notification | [This Document] |
This document requests IANA to create a sub-registry titled "Long-haul Congestion Notification Code Values" under the "ICMPv6 'type' Numbers" registry, for Code values associated with the ICMPv6 Type allocated in Section 10.1. The initial contents of this sub-registry are:¶
| Code | Name | Reference |
|---|---|---|
| 0 | Flow-level Long-haul CNP | [This Document] |
| 1-253 | Unassigned | |
| 254-255 | Experimental | [This Document] |
New Code values in the range 1-253 are to be assigned via Specification Required policy [RFC8126].¶
This document requests IANA to allocate a new Class-Num value from the "ICMP Extension Object Classes and Class Sub-types" registry established by [RFC4884].¶
| Class-Num | Class Name | Reference |
|---|---|---|
| TBD2 | Long-haul CNP Extension | [This Document] |
Within this Class-Num, the following C-Type values are defined:¶
| C-Type | Name | Reference |
|---|---|---|
| 0 | Reserved | [This Document] |
| 1 | Timestamp | [This Document] |
| 2 | Device Identifier | [This Document] |
| 3 | Path Identifier | [This Document] |
| 4-253 | Unassigned | |
| 254-255 | Experimental | [This Document] |
New C-Type values in the range 4-253 are to be assigned via Specification Required policy [RFC8126].¶
This document requests IANA to create a new registry titled "Long-haul CNP Congestion Metric Type Values". The initial contents of this registry are:¶
| Value | Name | Unit | Reference |
|---|---|---|---|
| 0 | Unspecified | N/A | [This Document] |
| 1 | Queue Depth | Kilobytes | [This Document] |
| 2 | Queue Growth Rate | Kilobytes/ms | [This Document] |
| 3 | ECN Marking Rate | Percentage (0-100) | [This Document] |
| 4 | RTT-based Metric | Microseconds | [This Document] |
| 5-253 | Unassigned | ||
| 254-255 | Experimental | [This Document] |
New values in the range 5-253 are to be assigned via Specification Required policy [RFC8126].¶
The RoCEv2 Long-haul CNP format defined in this document proposes the use of the most significant bit of the 6-bit reserved field following the BECN bit in the BTH as an Extension Present (E) bit. The RoCEv2 BTH format is defined by the InfiniBand Trade Association (IBTA), and the reserved field is under IBTA governance.¶
This document respectfully requests that the IBTA consider allocating this bit as the "Long-haul Extension Present" indicator for CNP packets (Opcode 0x81). The E bit definition specified in this document is a proposal intended to facilitate IBTA review; it does not constitute a unilateral allocation by the IETF of IBTA-governed protocol space.¶
Until IBTA formally approves this allocation, implementations of the RoCEv2 format defined in this document are considered experimental and MUST only be deployed in controlled environments where all endpoints and intermediate nodes are known to support this extension. Specifically, implementations MUST NOT send Long-haul CNP packets in RoCEv2 format to endpoints that have not been explicitly configured or negotiated to accept them.¶