<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rfc [
<!ENTITY nbsp "&#160;">
<!ENTITY zwsp "&#8203;">
<!ENTITY nbhy "&#8209;">
<!ENTITY wj "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc compact="yes"?>
<?rfc text-list-symbols="o*+-"?>
<?rfc subcompact="no"?>
<?rfc sortrefs="no"?>
<?rfc symrefs="yes"?>
<?rfc strict="yes"?>
<?rfc toc="yes"?>
<rfc category="info" consensus="true"
     docName="draft-zz-scone-rate-advice-rocev2-00" ipr="trust200902"
     obsoletes="" sortRefs="false" submissionType="IETF" symRefs="true"
     tocInclude="true" updates="" version="3" xml:lang="en"
     xmlns:xi="http://www.w3.org/2001/XInclude">
  <front>
    <title abbrev="Rate Advice for RoCEv2">SCONE-Based Rate Advice for RoCEv2
    Networks</title>

    <seriesInfo name="Internet-Draft"
                value="draft-zz-scone-rate-advice-rocev2-00"/>

    <author fullname="Guangyu Zhao" initials="G." surname="Zhao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>zhaoguangyu@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Cheng Zhou" initials="C." surname="Zhou">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>zhouchengyjy@chinamobile.com</email>
      </address>
    </author>

    <date day="2" month="July" year="2026"/>

    <workgroup>scone</workgroup>

    <abstract>
      <t>This document describes the applicable scenarios of the Standard
      Communication Protocol for Network Elements (SCONE) in RoCEv2 networks.
      SCONE defines a mechanism that enables network elements on the
      forwarding path to deliver throughput guidance to RoCEv2 endpoints. This
      document further specifies the method for carrying Rate Advice in RoCEv2
      packets. The Rate Advice is generated by network nodes (e.g., switches),
      which can be either rate limits defined by network policies or
      quantitative rate adjustment recommendations derived from network status
      information.</t>

      <t>This document specifies the packet format for Rate Advice and the
      calculation method for the advised rate.</t>
    </abstract>

    <note removeInRFC="false">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in RFC 2119.</t>
    </note>
  </front>

  <middle>
    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>

      <t>Remote Direct Memory Access (RDMA) enables data to be read from and
      written to remote memory without involving the CPU, which effectively
      reduces latency, improves throughput and lowers computing overhead of
      devices. It is widely deployed in high-performance computing and
      artificial intelligence scenarios. As a native lossless network for HPC
      and AI, InfiniBand <xref target="INFINIBAND"/> natively supports RDMA.
      The RDMA over Converged Ethernet (RoCE) standard introduces RDMA
      capabilities to Ethernet. Among its versions, RoCEv2 carries the
      InfiniBand transport layer over UDP/IP and has become the most widely
      adopted version in the industry.</t>

      <t>Data Center Quantized Congestion Notification (DCQCN) is the default
      congestion control algorithm for RoCEv2 networks <xref target="DCQCN"/>.
      When a DCQCN flow starts transmitting, it sends data at the physical
      line rate by default. It reduces the transmission rate upon receiving
      Congestion Notification Packets (CNP) and restores the rate quickly when
      the network is free of congestion, forming an end-to-end transmission
      rate adjustment mechanism. However, the binary nature of the Explicit
      Congestion Notification (ECN) mechanism can only indicate the presence
      or absence of network congestion, which limits the timeliness and
      effectiveness of rate adjustment on the sender side.</t>

      <t>Based on the concept of SCONE <xref target="SCONE-PROTOCOL"/>,
      providing advised transmission rates for RoCEv2 endpoints is of great
      practical significance. This document defines a Rate Advice mechanism
      adapted to the RoCEv2 protocol, to deliver quantitative rate
      recommendations.</t>
    </section>

    <section anchor="terminology" numbered="true" toc="default">
      <name>Terminology</name>

      <t>The following terms are used in this document:</t>

      <ul>
        <li>
          <t>Rate Advice: A quantitative rate adjustment recommendation
          generated by network nodes (e.g., switches) and delivered to the
          sender.</t>
        </li>

        <li>
          <t>Rate Advice Message: An extended message based on the RoCEv2
          format, which carries Rate Advice information from network nodes to
          the sender.</t>
        </li>
      </ul>
    </section>

    <section anchor="deployment-scenarios" numbered="true" toc="default">
      <name>Deployment Scenarios</name>

      <section numbered="true" toc="default">
        <name>Deployment within Intelligent Computing Data Centers</name>

        <t>Inside an intelligent computing data center, RoCEv2 flows are
        generally exchanged within a single data center, and the network is
        mainly built with two-tier Leaf-Spine or three-tier Clos
        architectures. Network nodes such as Spine switches and Leaf switches
        can monitor the egress queue depth, queuing delay and packet loss
        rate. When an uplink link gets congested, the switch generates Rate
        Advice Messages based on real-time monitoring results and sends them
        directly to the senders, acting as a supplement to DCQCN.</t>

        <t>In this deployment scenario, SCONE-enabled network nodes reside on
        the data plane of the data center network and coexist with the
        existing DCQCN protocol stack. Senders can receive both CNP congestion
        notifications from DCQCN and Rate Advice from SCONE. The recommended
        implementation policy is that senders give priority to responding to
        emergency congestion indications carried in CNP, and conduct gradual
        rate adjustment with reference to the advised rate in SCONE Rate
        Advice.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Deployment across Intelligent Computing Data Centers</name>

        <t>For interconnection of distributed intelligent computing clusters,
        multiple intelligent computing data centers are interconnected via WAN
        links (e.g., dedicated lines, SRv6 tunnels) to form a cross-domain
        logical intelligent computing pool. With the continuous expansion of
        large language models, a single data center is restricted by power
        supply, heat dissipation and physical space. Cross-data center
        collaborative training has become an inevitable industry demand. For
        latest progress, Google DeepMind adopted Decoupled DiLoCo <xref
        target="DiLoCo"/> to train a 12-billion-parameter model across four
        regions in the United States, achieving more than 20x faster training
        speed compared with traditional synchronous methods.</t>

        <t>In this deployment model, data center gateways or WAN edge routers
        act as aggregation nodes for RoCEv2 traffic. The Round-Trip Time (RTT)
        of WAN links is much longer than that inside data centers, which can
        reach tens of milliseconds. The long RTT leads to delayed rate
        reduction. In extreme cases, when the RTT is 20 ms and the packet loss
        rate is 0.1%, the throughput may drop to nearly zero <xref
        target="URDMA"/>.</t>

        <t>Deploying SCONE-enabled network nodes on central gateways or WAN
        edge routers to generate path-based Rate Advice provides valuable
        reference for the transmission rate of endpoints. Two deployment modes
        are available:</t>

        <ul>
          <li>
            <t>Near-source Gateway Generation: The source gateway sends Rate
            Advice to senders, reflecting the congestion status of the WAN
            ingress path.</t>
          </li>

          <li>
            <t>Remote Gateway Generation: The WAN egress gateway sends
            backpressure signals to the source gateway, and then the source
            gateway forwards Rate Advice to senders.</t>
          </li>
        </ul>

        <t>The near-source gateway mode features a shorter control loop and
        faster response, which is the preferred solution.</t>
      </section>
    </section>

    <section anchor="framework-overview" numbered="true" toc="default">
      <name>Overview of Rate Advice Framework</name>

      <t>This section describes the overall architecture of the Rate Advice
      framework.</t>

      <t>The Rate Advice mechanism introduces a dedicated direct signaling
      path from network nodes (switches) to senders, as shown in Figure 1.</t>

      <figure anchor="fig-signaling-path">
        <name>Rate Advice Signaling Path</name>

        <artwork align="center" xml:space="preserve">+----------+         +-----------+         +----------+
|          &lt;---------&gt;  Network  &lt;---------&gt;          |
|  Sender  |  RoCEv2 |  Element  |  RoCEv2 | Receiver |
|          |  Data   |           |  Data   |          |
+-----^----+         +-----+-----+         +----------+
      |                    |
      |  Rate Advice Msg   |
      +--------------------+          </artwork>
      </figure>

      <t>The sender indicates its support for the Rate Advice capability in
      the RoCEv2 packet header. Network nodes parse this indication and enable
      the Rate Advice function only for capable senders. Network devices on
      the data path calculate the advised rate for each flow according to
      pre-defined rate limits of network policies or real-time egress queue
      status. The network device encapsulates the Rate Advice into a
      SCONE-RoCEv2 message and transmits it to the sender. The sender parses
      the advised rate from the SCONE-RoCEv2 message and adjusts its
      transmission rate accordingly.</t>
    </section>

    <section anchor="advised-rate-acquisition" numbered="true" toc="default">
      <name>Acquisition of Advised Rate</name>

      <t>The advised rate can be obtained in two ways: acquiring the advised
      rate or rate upper limit for each flow according to the rate policies
      configured on network elements, or calculating the value based on the
      egress queue depth, queuing delay and packet loss rate of network
      elements.</t>

      <section numbered="true" toc="default">
        <name>Rate Policy Based on Network Element Configuration</name>

        <t>Network nodes directly obtain the advised rate or rate upper limit
        for each flow according to pre-configured rate policies set by
        administrators, such as per-flow rate limiting and priority bandwidth
        allocation. This method applies to scenarios where operators have
        explicit Service Level Agreement (SLA) and traffic engineering
        policies.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Calculation Based on Egress Bandwidth Congestion Status</name>

        <t>Network nodes calculate the advised rate based on the real-time
        status of local egress queues, including queue depth, queuing delay
        and packet loss rate. This method is applicable to Rate Advice
        scenarios that require real-time response to network congestion.</t>
      </section>
    </section>

    <section anchor="rate-advice-message" numbered="true" toc="default">
      <name>Rate Advice Message</name>

      <t>This section defines the format of the SCONE-RoCEv2 Rate Advice
      Message.</t>

      <section numbered="true" toc="default">
        <name>SCONE-RoCEv2 Packet Format</name>

        <t>The Base Transport Header (BTH) is the transport layer header of
        InfiniBand, which is adopted by both RoCEv1 and RoCEv2. A standard
        RoCEv2 packet carries a UDP header, with the structure as follows:</t>

        <artwork align="center" xml:space="preserve">  [ETH + IP + UDP(dport 4791) + IB(BTH + ExtHDR + PAYLOAD + CRC)]    </artwork>

        <t>The Rate Advice Message reuses the long BTH header format of RoCEv2
        and is identified by a new OpCode value (RATE_ADVICE). The structure
        of the message is defined below:</t>

        <artwork align="center" xml:space="preserve">  [ETH + IP + UDP(dport 4791) + 
        	IB(BTH(OpCode = RATE_ADVICE) + Rate Advice Packet + ICRC)]    </artwork>

        <t>As a control message type of RoCEv2, the Rate Advice Message is
        encapsulated in the UDP payload with the UDP destination port set to
        4791. The complete encapsulation sequence from link layer to transport
        layer is as follows:</t>

        <artwork align="center" xml:space="preserve">   Rate Advice Packet {
       Rate (32),
       Version (32),
       Destination QP ID (32),
       Source QP ID (32),
   }       </artwork>

        <t>A new OpCode value (e.g., 0x1D) shall be assigned to the BTH OpCode
        field to identify this packet as a Rate Advice Message. Other fields
        in BTH (such as P_Key, FECN, etc.) shall be set in compliance with
        standard RoCEv2 specifications.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Definition of SCONE-RoCEv2 Fields</name>

        <ul>
          <li>
            <t>Rate (32 bits): The advised transmission rate, measured in
            Mbps. This value can be calculated by network nodes based on
            congestion metrics (queue depth, queuing delay, packet loss rate),
            or directly derived from rate policies configured by
            administrators.</t>
          </li>

          <li>
            <t>Version (32 bits): Version number. The initial version defined
            in this specification is 0x00000001. The version number will be
            incremented for future backward-incompatible modifications.</t>
          </li>

          <li>
            <t>Destination QP ID (32 bits): Destination Queue Pair Identifier.
            This field should be filled with the Queue Pair (QP) number
            corresponding to the sender. When generating a Rate Advice
            Message, the network node extracts this field from the BTH of the
            original RoCEv2 data packet and copies the value. Upon reception,
            the sender uses this field to associate the Rate Advice with the
            corresponding flow.</t>
          </li>

          <li>
            <t>Source QP ID (32 bits): Source Queue Pair Identifier. This
            field is generally set to the QP number used by the receiver or
            the network node.</t>
          </li>
        </ul>
      </section>
    </section>

    <section anchor="security-considerations" numbered="true" toc="default">
      <name>Security Considerations</name>

      <t>(TBD)</t>
    </section>

    <section anchor="iana-considerations" numbered="true" toc="default">
      <name>IANA Considerations</name>

      <t>This document does not require any IANA actions.</t>
    </section>

    <section anchor="contributors" numbered="true" toc="default">
      <name>Contributors</name>

      <t>The following people have substantially contributed to this
      document:</t>

      <author fullname="Hongwei Yang" initials="H." surname="YANG">
        <organization>China Mobile</organization>

        <address>
          <postal>
            <city>Beijing</city>

            <country>China</country>
          </postal>

          <email>yanghongwei@chinamobile.com</email>
        </address>
      </author>

      <author fullname="Zhiqiang Li" initials="Z." surname="Li">
        <organization>China Mobile</organization>

        <address>
          <postal>
            <city>Beijing</city>

            <country>China</country>
          </postal>

          <email>lizhiqiangyjy@chinamobile.com</email>
        </address>
      </author>
    </section>
  </middle>

  <back>
    <references>
      <name>Informative References</name>

      <reference anchor="SCONE-PROTOCOL">
        <front>
          <title>Standard Communication with Network Elements (SCONE)
          Protocol</title>

          <author fullname="M. Thomson" initials="M." surname="Thomson"/>

          <date month="May" year="2025"/>
        </front>

        <seriesInfo name="Internet-Draft" value="draft-ietf-scone-protocol-04"/>

        <annotation>Work in Progress</annotation>
      </reference>

      <reference anchor="DiLoCo">
        <front>
          <title>Decoupled DiLoCo for Resilient Distributed
          Pre-training</title>

          <author fullname="A. Douillard" initials="A." surname="Douillard"/>

          <author fullname="K. Rush" initials="K." surname="Rush"/>

          <author fullname="Y. Donchev" initials="Y." surname="Donchev"/>

          <date month="April" year="2026"/>
        </front>
      </reference>

      <reference anchor="INFINIBAND">
        <front>
          <title>InfiniBand Architecture Specification Volume 1, Release
          1.5</title>

          <author>
            <organization>InfiniBand Trade Association</organization>
          </author>

          <date month="June" year="2020"/>
        </front>
      </reference>

      <reference anchor="DCQCN">
        <front>
          <title>Congestion Control for Large-Scale RDMA Deployments</title>

          <author fullname="Y. Zhu" initials="Y." surname="Zhu"/>

          <date year="2015"/>
        </front>

        <seriesInfo name="ACM SIGCOMM" value="Proceedings"/>
      </reference>

      <reference anchor="URDMA">
        <front>
          <title>URDMA technologies for wide-area high-throughput
          network</title>

          <author fullname="X D Duan" initials="X D" surname="Duan"/>

          <author fullname="L Lu" initials="L" surname="Lu"/>

          <author fullname="T Sun" initials="T" surname="Sun"/>

          <author fullname="Z Q Li" initials="Z Q" surname="Li"/>

          <author fullname="H W Yang" initials="H W" surname="Yang"/>

          <author fullname="Z P Du" initials="Z P" surname="Du"/>

          <date month="June" year="2024"/>
        </front>

        <seriesInfo name="Journal" value="ZTE Technology Journal"/>

        <seriesInfo name="Volume" value="30"/>

        <seriesInfo name="Issue" value="6"/>

        <seriesInfo name="Pages" value="23-30"/>
      </reference>
    </references>
  </back>
</rfc>
