| Internet-Draft | RDMA Multicast over SRv6 | February 2026 |
| Li, et al. | Expires 1 September 2026 | [Page] |
This document specifies SRv6 (Segment Routing over IPv6) extensions for multicast delivery of RDMA (Remote Direct Memory Access) Reliable Connection (RC) traffic. It defines a new SRv6 endpoint behavior, End.MT, that performs per-receiver RDMA Base Transport Header (BTH) modifications at edge nodes of the multicast tree. It also specifies procedures for hop-by-hop aggregation of RDMA ACK, NACK, and CNP response messages along the reverse path. Together, these extensions allow RDMA RC endpoints to communicate using standard point-to-point Queue Pair (QP) semantics while the network distributes data packets over an IP multicast tree. Target deployment scenarios include multi-replica distributed storage writes, HPC collective communications, AI training parameter distribution, and large-scale inference KV cache distribution.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 1 September 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Large-scale distributed computing deployments, including data center interconnection, distributed AI training and inference, and national-scale computing networks, rely on high-throughput data transport. RDMA (Remote Direct Memory Access) provides kernel-bypass data transfer with low CPU overhead on both sending and receiving hosts. RDMA one-sided operations, where receive buffers are pre-registered at the receiver's Network Interface Card (NIC), further reduce CPU involvement at the receiving end.¶
Many distributed applications exhibit one-to-many traffic patterns, including multi-replica storage writes, HPC collective communications (broadcast, scatter), AI training parameter distribution, and KV cache distribution in inference pipelines. IP multicast delivery of such traffic can reduce total network bandwidth consumption compared to per-receiver unicast replication at the source.¶
The RDMA Reliable Connection (RC) transport mode is the most widely adopted RDMA mode because it supports the complete set of RDMA operations: Read, Write, and Atomic. However, each RC Queue Pair (QP) is a point-to-point association between exactly one sending QP and one receiving QP. RC packets carry per-connection identifiers in the Base Transport Header (BTH), specifically the Destination Queue Pair Number (QPN) and Packet Sequence Number (PSN). These per-connection fields prevent direct application of IP multicast replication to RC traffic because each receiver requires its own QPN and independently tracks PSN state.¶
Existing application-layer approaches in distributed frameworks (MPI, NCCL, Spark) address this limitation in two ways: by opening separate RC QP connections to each receiver, which results in source bandwidth consumption proportional to the number of receivers, or by constructing application-layer relay trees (tree or ring topologies), which introduce per-hop host-stack traversal latency and additional memory copy overhead at relay nodes.¶
This document specifies SRv6 extensions that bridge the gap between RDMA RC point-to-point semantics and IP multicast one-to-many delivery. Edge nodes of the multicast tree execute a new SRv6 endpoint behavior (End.MT) that rewrites per-receiver RDMA BTH fields in replicated packet copies. Intermediate and edge nodes aggregate reverse-path RDMA response messages (ACK, NACK, CNP) before they reach the source. RDMA RC endpoints are not required to implement any multicast-specific extensions.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document uses the following terms. Familiarity with SRv6 terminology from [RFC8402], [RFC8754], and [RFC8986] is assumed.¶
The extensions specified in this document apply to networks that meet all of the following conditions:¶
These extensions do not modify RDMA endpoint behavior. Hosts run unmodified RoCEv2 protocol stacks and establish standard RC QP connections. All multicast-related packet transformations occur within the network.¶
This specification defines the following network roles:¶
S1
|
N6 (transit)
/ \
(transit) N4 N5 (transit)
/ \ / \
(edge) N1 N2 N3 (edge)
/ \ | / \
R1 R2 R3 R4 R5
In Figure 1, S1 is the multicast source. R1 through R5 are multicast receivers. N1, N2, and N3 are edge nodes executing the End.MT behavior. N4, N5, and N6 are transit nodes performing IP multicast replication.¶
Prior to data transmission, a multicast group MUST be established as follows:¶
The source-to-receiver data path operates as follows:¶
When a packet arrives at an edge node whose local End.MT SID matches the IPv6 Destination Address, the edge node MUST execute the End.MT behavior (Section 6):¶
Receivers generate three types of response messages toward the source: ACK (acknowledgment of successful reception), NACK (request for retransmission), and CNP (congestion notification). These responses MUST be aggregated hop-by-hop at intermediate nodes before reaching the source, so that the source's retransmission and rate control logic operates correctly without multicast-specific modifications.¶
The source MUST receive an AckPSN value satisfying the following invariant: for every receiver Ri and every packet with PSN less than or equal to AckPSN, Ri has confirmed successful reception.¶
Each intermediate node (edge or transit) MUST maintain a record of the most recent AckPSN reported by each downstream branch. When an ACK is received from a downstream branch, the node MUST update the stored AckPSN for that branch. The node MUST forward an ACK upstream carrying an AckPSN equal to the minimum of all downstream branches' stored AckPSN values. The node adjacent to the source (N6 in Figure 1) MUST write this minimum value into the AETH AckPSN field of the ACK forwarded to the source.¶
When a receiver detects missing packets, it sends a NACK containing an expected PSN (ePSN) indicating the start of the retransmission range. The source MUST receive an ePSN satisfying the following invariant: for every receiver Ri, all packets with PSN less than ePSN have been successfully received by Ri.¶
Each intermediate node MUST maintain a per-branch record of ePSN values. For branches that have sent only ACKs (no NACK), the effective ePSN SHOULD be treated as AckPSN + 1. The NACK forwarded upstream MUST carry the minimum ePSN across all downstream branches.¶
Each intermediate node MUST maintain a per-branch counter (CCount) that records the number of CNP messages received from each downstream branch within a configurable time window T.¶
At the expiration of each time window T, the node MUST select the branch with the highest CCount value and forward a single CNP upstream representing the most congested downstream path. All CCount values MUST be reset to zero at the start of each new time window.¶
The time window T MAY be adjusted dynamically based on observed network conditions. The node adjacent to the source MUST rewrite CNP packet headers so that the source processes the CNP as a standard RoCEv2 congestion notification.¶
End.MT is an SRv6 endpoint behavior instantiated at edge nodes of the RDMA multicast tree. When a node N receives a packet whose IPv6 Destination Address matches a locally instantiated End.MT SID, N performs the processing described in Section 6.2.¶
End.MT combines the following operations: SRH segment processing, End.MT TLV parsing, per-receiver packet replication, BTH Destination QPN replacement, IPv6 Destination Address replacement, and ICRC recomputation.¶
The following pseudocode follows the conventions of [RFC8986] Section 4.¶
When N receives a packet destined to S, where S is a local
End.MT SID, N does:
S01. If NH=SRH and SL > 0 {
S02. Decrement SL
S03. Update the IPv6 DA with SRH[SL] ;; Ref1
S04. Parse the End.MT TLV associated with S
S05. Let RecvList = list of (IPv6_Addr, QPN) from TLV
S06. For each entry (Addr_i, QPN_i) in RecvList {
S07. Copy the packet ;; Ref2
S08. In the copy, set IPv6 DA = Addr_i
S09. In the copy, set BTH.DestQPN = QPN_i
S10. Recompute ICRC over the modified headers
S11. Forward the copy based on Addr_i ;; Ref3
S12. }
S13. } Else {
S14. Drop the packet ;; Ref4
S15. }
Ref1: Standard SRH processing per RFC 8754 Section 4.3.1.1.
Ref2: The copy includes all payload beyond the outer IPv6
and SRH headers that are relevant to the inner RDMA
frame.
Ref3: FIB lookup on Addr_i determines the outgoing
interface.
Ref4: A packet arriving with SL=0 or without SRH is not
valid for End.MT processing.
¶
This specification uses the standard IPv6 Segment Routing Header (SRH) as defined in [RFC8754]. The SRH carries the End.MT SID in its Segment List and the End.MT TLV in its Optional TLV field.¶
The End.MT TLV is carried in the Optional TLV field of the SRH and conveys per-edge-node receiver information. Its format is as follows:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Edge Node Address (128 bits IPv6) | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Num Receivers | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Receiver 1 Address (128 bits IPv6) | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receiver 1 QPN (24 bits) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Receiver N Address (128 bits IPv6) | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receiver N QPN (24 bits) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields are defined as follows:¶
Multiple End.MT TLVs MAY be present in a single SRH, one per edge node in the multicast tree. Each End.MT TLV is associated with the End.MT SID of the corresponding edge node.¶
Each intermediate node (both edge and transit) participating in reverse-path response aggregation MUST maintain the following per-multicast-group state:¶
The amount of state is proportional to the number of downstream branches at each node, not to the total number of receivers in the multicast group. Edge nodes additionally maintain the receiver information (addresses and QPNs) learned from the End.MT TLV or from control-plane provisioning.¶
The security considerations of [RFC8754] and [RFC8986] apply to all SRv6 aspects of this specification. The following additional considerations are specific to RDMA multicast delivery.¶
The End.MT TLV carries receiver IPv6 addresses and QPNs in the SRH. An on-path attacker able to read SRH contents can obtain receiver topology and RDMA connection identifiers. Implementations operating outside a single administrative trust domain SHOULD protect SRH integrity and confidentiality using the HMAC TLV defined in Section 7 of [RFC8754] or IPsec Encapsulating Security Payload (ESP) encapsulation.¶
Intermediate nodes maintain per-branch ACK/NACK/CNP aggregation state. An attacker injecting forged response messages could corrupt this state, causing the source to prematurely consider data as acknowledged (via inflated AckPSN) or to trigger unnecessary retransmissions (via forged NACKs). Nodes SHOULD validate that reverse-path response messages originate from addresses within the expected downstream receiver set. BCP 38 [RFC2827] ingress filtering SHOULD be applied at network boundaries.¶
An attacker injecting a high volume of forged CNP messages could force the source into continuous rate reduction, creating a denial-of-service condition. Intermediate nodes SHOULD implement per-branch CNP rate limiting. The configurable time window T for CNP aggregation provides an inherent dampening effect.¶
If End.MT TLV contents are modified in transit, packets could be delivered to incorrect RDMA QPs, resulting in data corruption or information disclosure at unintended receivers. The SRH HMAC TLV [RFC8754] provides integrity protection for this purpose. Edge nodes SHOULD verify HMAC before processing End.MT TLVs when operating across trust domain boundaries.¶
This document requests IANA to allocate a new codepoint in the "SRv6 Endpoint Behaviors" sub-registry under the "Segment Routing" registry group [RFC8986]:¶
| Value | Hex | Endpoint Behavior | Reference | Change Controller |
|---|---|---|---|---|
| TBD1 | TBD1 | End.MT | [This document] | IETF |
This document requests IANA to allocate a new Type value in the "Segment Routing Header TLVs" registry [RFC8754]:¶
| Value | Description | Reference |
|---|---|---|
| TBD2 | End.MT TLV | [This document] |
The authors thank the members of the SPRING and RTGWG working groups for their review and feedback.¶