| Internet-Draft | Rate Advice for RoCEv2 | July 2026 |
| Zhao & Zhou | Expires 3 January 2027 | [Page] |
This document describes the applicable scenarios of the Standard Communication Protocol for Network Elements (SCONE) in RoCEv2 networks. SCONE defines a mechanism that enables network elements on the forwarding path to deliver throughput guidance to RoCEv2 endpoints. This document further specifies the method for carrying Rate Advice in RoCEv2 packets. The Rate Advice is generated by network nodes (e.g., switches), which can be either rate limits defined by network policies or quantitative rate adjustment recommendations derived from network status information.¶
This document specifies the packet format for Rate Advice and the calculation method for the advised rate.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 January 2027.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Remote Direct Memory Access (RDMA) enables data to be read from and written to remote memory without involving the CPU, which effectively reduces latency, improves throughput and lowers computing overhead of devices. It is widely deployed in high-performance computing and artificial intelligence scenarios. As a native lossless network for HPC and AI, InfiniBand [INFINIBAND] natively supports RDMA. The RDMA over Converged Ethernet (RoCE) standard introduces RDMA capabilities to Ethernet. Among its versions, RoCEv2 carries the InfiniBand transport layer over UDP/IP and has become the most widely adopted version in the industry.¶
Data Center Quantized Congestion Notification (DCQCN) is the default congestion control algorithm for RoCEv2 networks [DCQCN]. When a DCQCN flow starts transmitting, it sends data at the physical line rate by default. It reduces the transmission rate upon receiving Congestion Notification Packets (CNP) and restores the rate quickly when the network is free of congestion, forming an end-to-end transmission rate adjustment mechanism. However, the binary nature of the Explicit Congestion Notification (ECN) mechanism can only indicate the presence or absence of network congestion, which limits the timeliness and effectiveness of rate adjustment on the sender side.¶
Based on the concept of SCONE [SCONE-PROTOCOL], providing advised transmission rates for RoCEv2 endpoints is of great practical significance. This document defines a Rate Advice mechanism adapted to the RoCEv2 protocol, to deliver quantitative rate recommendations.¶
The following terms are used in this document:¶
Inside an intelligent computing data center, RoCEv2 flows are generally exchanged within a single data center, and the network is mainly built with two-tier Leaf-Spine or three-tier Clos architectures. Network nodes such as Spine switches and Leaf switches can monitor the egress queue depth, queuing delay and packet loss rate. When an uplink link gets congested, the switch generates Rate Advice Messages based on real-time monitoring results and sends them directly to the senders, acting as a supplement to DCQCN.¶
In this deployment scenario, SCONE-enabled network nodes reside on the data plane of the data center network and coexist with the existing DCQCN protocol stack. Senders can receive both CNP congestion notifications from DCQCN and Rate Advice from SCONE. The recommended implementation policy is that senders give priority to responding to emergency congestion indications carried in CNP, and conduct gradual rate adjustment with reference to the advised rate in SCONE Rate Advice.¶
For interconnection of distributed intelligent computing clusters, multiple intelligent computing data centers are interconnected via WAN links (e.g., dedicated lines, SRv6 tunnels) to form a cross-domain logical intelligent computing pool. With the continuous expansion of large language models, a single data center is restricted by power supply, heat dissipation and physical space. Cross-data center collaborative training has become an inevitable industry demand. For latest progress, Google DeepMind adopted Decoupled DiLoCo [DiLoCo] to train a 12-billion-parameter model across four regions in the United States, achieving more than 20x faster training speed compared with traditional synchronous methods.¶
In this deployment model, data center gateways or WAN edge routers act as aggregation nodes for RoCEv2 traffic. The Round-Trip Time (RTT) of WAN links is much longer than that inside data centers, which can reach tens of milliseconds. The long RTT leads to delayed rate reduction. In extreme cases, when the RTT is 20 ms and the packet loss rate is 0.1%, the throughput may drop to nearly zero [URDMA].¶
Deploying SCONE-enabled network nodes on central gateways or WAN edge routers to generate path-based Rate Advice provides valuable reference for the transmission rate of endpoints. Two deployment modes are available:¶
Near-source Gateway Generation: The source gateway sends Rate Advice to senders, reflecting the congestion status of the WAN ingress path.¶
Remote Gateway Generation: The WAN egress gateway sends backpressure signals to the source gateway, and then the source gateway forwards Rate Advice to senders.¶
The near-source gateway mode features a shorter control loop and faster response, which is the preferred solution.¶
This section describes the overall architecture of the Rate Advice framework.¶
The Rate Advice mechanism introduces a dedicated direct signaling path from network nodes (switches) to senders, as shown in Figure 1.¶
+----------+ +-----------+ +----------+
| <---------> Network <---------> |
| Sender | RoCEv2 | Element | RoCEv2 | Receiver |
| | Data | | Data | |
+-----^----+ +-----+-----+ +----------+
| |
| Rate Advice Msg |
+--------------------+
The sender indicates its support for the Rate Advice capability in the RoCEv2 packet header. Network nodes parse this indication and enable the Rate Advice function only for capable senders. Network devices on the data path calculate the advised rate for each flow according to pre-defined rate limits of network policies or real-time egress queue status. The network device encapsulates the Rate Advice into a SCONE-RoCEv2 message and transmits it to the sender. The sender parses the advised rate from the SCONE-RoCEv2 message and adjusts its transmission rate accordingly.¶
The advised rate can be obtained in two ways: acquiring the advised rate or rate upper limit for each flow according to the rate policies configured on network elements, or calculating the value based on the egress queue depth, queuing delay and packet loss rate of network elements.¶
Network nodes directly obtain the advised rate or rate upper limit for each flow according to pre-configured rate policies set by administrators, such as per-flow rate limiting and priority bandwidth allocation. This method applies to scenarios where operators have explicit Service Level Agreement (SLA) and traffic engineering policies.¶
Network nodes calculate the advised rate based on the real-time status of local egress queues, including queue depth, queuing delay and packet loss rate. This method is applicable to Rate Advice scenarios that require real-time response to network congestion.¶
This section defines the format of the SCONE-RoCEv2 Rate Advice Message.¶
The Base Transport Header (BTH) is the transport layer header of InfiniBand, which is adopted by both RoCEv1 and RoCEv2. A standard RoCEv2 packet carries a UDP header, with the structure as follows:¶
[ETH + IP + UDP(dport 4791) + IB(BTH + ExtHDR + PAYLOAD + CRC)]¶
The Rate Advice Message reuses the long BTH header format of RoCEv2 and is identified by a new OpCode value (RATE_ADVICE). The structure of the message is defined below:¶
[ETH + IP + UDP(dport 4791) +
IB(BTH(OpCode = RATE_ADVICE) + Rate Advice Packet + ICRC)]¶
As a control message type of RoCEv2, the Rate Advice Message is encapsulated in the UDP payload with the UDP destination port set to 4791. The complete encapsulation sequence from link layer to transport layer is as follows:¶
Rate Advice Packet {
Rate (32),
Version (32),
Destination QP ID (32),
Source QP ID (32),
}¶
A new OpCode value (e.g., 0x1D) shall be assigned to the BTH OpCode field to identify this packet as a Rate Advice Message. Other fields in BTH (such as P_Key, FECN, etc.) shall be set in compliance with standard RoCEv2 specifications.¶
Rate (32 bits): The advised transmission rate, measured in Mbps. This value can be calculated by network nodes based on congestion metrics (queue depth, queuing delay, packet loss rate), or directly derived from rate policies configured by administrators.¶
Version (32 bits): Version number. The initial version defined in this specification is 0x00000001. The version number will be incremented for future backward-incompatible modifications.¶
Destination QP ID (32 bits): Destination Queue Pair Identifier. This field should be filled with the Queue Pair (QP) number corresponding to the sender. When generating a Rate Advice Message, the network node extracts this field from the BTH of the original RoCEv2 data packet and copies the value. Upon reception, the sender uses this field to associate the Rate Advice with the corresponding flow.¶
Source QP ID (32 bits): Source Queue Pair Identifier. This field is generally set to the QP number used by the receiver or the network node.¶
(TBD)¶
This document does not require any IANA actions.¶
The following people have substantially contributed to this document:¶