SRv6 Extensions for RDMA Multicast Delivery

Internet-Draft	RDMA Multicast over SRv6	February 2026
Li, et al.	Expires 1 September 2026	[Page]

Abstract

This document specifies SRv6 (Segment Routing over IPv6) extensions for multicast delivery of RDMA (Remote Direct Memory Access) Reliable Connection (RC) traffic. It defines a new SRv6 endpoint behavior, End.MT, that performs per-receiver RDMA Base Transport Header (BTH) modifications at edge nodes of the multicast tree. It also specifies procedures for hop-by-hop aggregation of RDMA ACK, NACK, and CNP response messages along the reverse path. Together, these extensions allow RDMA RC endpoints to communicate using standard point-to-point Queue Pair (QP) semantics while the network distributes data packets over an IP multicast tree. Target deployment scenarios include multi-replica distributed storage writes, HPC collective communications, AI training parameter distribution, and large-scale inference KV cache distribution.¶

1. Introduction

Large-scale distributed computing deployments, including data center interconnection, distributed AI training and inference, and national-scale computing networks, rely on high-throughput data transport. RDMA (Remote Direct Memory Access) provides kernel-bypass data transfer with low CPU overhead on both sending and receiving hosts. RDMA one-sided operations, where receive buffers are pre-registered at the receiver's Network Interface Card (NIC), further reduce CPU involvement at the receiving end.¶

Many distributed applications exhibit one-to-many traffic patterns, including multi-replica storage writes, HPC collective communications (broadcast, scatter), AI training parameter distribution, and KV cache distribution in inference pipelines. IP multicast delivery of such traffic can reduce total network bandwidth consumption compared to per-receiver unicast replication at the source.¶

The RDMA Reliable Connection (RC) transport mode is the most widely adopted RDMA mode because it supports the complete set of RDMA operations: Read, Write, and Atomic. However, each RC Queue Pair (QP) is a point-to-point association between exactly one sending QP and one receiving QP. RC packets carry per-connection identifiers in the Base Transport Header (BTH), specifically the Destination Queue Pair Number (QPN) and Packet Sequence Number (PSN). These per-connection fields prevent direct application of IP multicast replication to RC traffic because each receiver requires its own QPN and independently tracks PSN state.¶

Existing application-layer approaches in distributed frameworks (MPI, NCCL, Spark) address this limitation in two ways: by opening separate RC QP connections to each receiver, which results in source bandwidth consumption proportional to the number of receivers, or by constructing application-layer relay trees (tree or ring topologies), which introduce per-hop host-stack traversal latency and additional memory copy overhead at relay nodes.¶

This document specifies SRv6 extensions that bridge the gap between RDMA RC point-to-point semantics and IP multicast one-to-many delivery. Edge nodes of the multicast tree execute a new SRv6 endpoint behavior (End.MT) that rewrites per-receiver RDMA BTH fields in replicated packet copies. Intermediate and edge nodes aggregate reverse-path RDMA response messages (ACK, NACK, CNP) before they reach the source. RDMA RC endpoints are not required to implement any multicast-specific extensions.¶

1.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

2. Terminology

This document uses the following terms. Familiarity with SRv6 terminology from [RFC8402], [RFC8754], and [RFC8986] is assumed.¶

RDMA:: Remote Direct Memory Access, as specified in the InfiniBand Architecture [ROCEV2].¶
RoCEv2:: RDMA over Converged Ethernet version 2, an RDMA transport encapsulated in UDP/IPv6 (or UDP/IPv4).¶
RC:: Reliable Connection, an RDMA transport mode providing connection-oriented reliable delivery.¶
QP:: Queue Pair, the RDMA communication endpoint consisting of a Send Queue (SQ) and a Receive Queue (RQ).¶
QPN:: Queue Pair Number, a 24-bit identifier for a QP.¶
BTH:: Base Transport Header, the RDMA transport header containing the opcode, Destination QPN, PSN, and other fields.¶
AETH:: ACK Extended Transport Header, an RDMA header carrying acknowledgment information.¶
PSN:: Packet Sequence Number, a 24-bit sequence number used for ordering and acknowledgment in RDMA reliable transport.¶
ACK:: Acknowledgment, an RDMA response confirming successful reception.¶
NACK:: Negative Acknowledgment, an RDMA response requesting retransmission.¶
CNP:: Congestion Notification Packet, used in the RoCEv2 ECN-based congestion control mechanism.¶
SRv6:: Segment Routing over IPv6 [RFC8402].¶
SRH:: Segment Routing Header [RFC8754].¶
SID:: Segment Identifier, a 128-bit IPv6 address in SRv6.¶
End.MT:: A new SRv6 endpoint behavior defined in this document for RDMA multicast header transformation at edge nodes.¶
Designated QPN:: A QPN value agreed upon by all multicast group participants during group setup, used as the Destination QPN in data packets traversing the multicast tree.¶
Proxy Address:: An IPv6 address used as the common destination by the source and all receivers to represent the multicast group at the RDMA layer.¶

4. Architecture

4.1. Network Roles

This specification defines the following network roles:¶

Multicast Source (S):: The RDMA sending host. S establishes a standard RDMA RC QP to the Proxy Address using the Designated QPN. S obtains the unicast IPv6 addresses and QPNs of all receivers via a control plane (the control plane protocol is out of scope).¶
Multicast Receivers (R1..Rn):: RDMA receiving hosts. Each Ri establishes a standard RDMA RC QP to the Proxy Address using its own locally assigned QPN.¶
Edge Nodes:: SRv6-capable network nodes adjacent to receivers (e.g., N1-N3 in Figure 1). Edge nodes instantiate the End.MT SID and perform: (a) RDMA BTH Destination QPN replacement, (b) IPv6 Destination Address replacement, (c) ICRC recomputation, and (d) reverse-path ACK/NACK/CNP aggregation.¶
Transit Nodes:: Intermediate forwarding nodes in the multicast tree (e.g., N4-N6 in Figure 1). Transit nodes replicate and forward packets using IP multicast procedures and participate in reverse-path response aggregation.¶

4.2. Reference Topology

                            S1
                             |
                            N6 (transit)
                          /    \
               (transit) N4     N5 (transit)
                        /  \    /  \
              (edge)  N1   N2    N3 (edge)
                     / \    |   / \
                   R1   R2  R3 R4  R5

Figure 1: Reference Network Topology

In Figure 1, S1 is the multicast source. R1 through R5 are multicast receivers. N1, N2, and N3 are edge nodes executing the End.MT behavior. N4, N5, and N6 are transit nodes performing IP multicast replication.¶

5. Data Plane Specification

5.1. Multicast Group Setup

Prior to data transmission, a multicast group MUST be established as follows:¶

All participants (source S1 and receivers R1..Rn) MUST each create an RDMA RC QP directed at the Proxy Address. All QPs MUST use the Designated QPN as the remote (destination) QPN.¶
The source S1 MUST obtain the unicast IPv6 address and the actual local QPN of each receiver Ri via the control plane. The control plane protocol and its signaling procedures are outside the scope of this document.¶
Each edge node MUST be configured to receive IP multicast traffic addressed to the Proxy Address.¶
S1 MUST encode each edge node's associated receiver information (unicast IPv6 addresses and QPNs) into End.MT TLVs within the SRH of data packets, as specified in Section 7.2.¶

5.2. Downstream Data Forwarding

The source-to-receiver data path operates as follows:¶

S1 constructs RDMA RC data packets with the IPv6 Destination Address set to the Proxy Address and the BTH Destination QPN set to the Designated QPN. S1 encapsulates the packets in an outer IPv6 header with an SRH containing the End.MT SID(s) and associated TLVs.¶
Transit nodes forward the traffic according to their IP multicast forwarding tables, performing tree replication as needed.¶
When a packet arrives at an edge node whose local End.MT SID matches the IPv6 Destination Address, the edge node MUST execute the End.MT behavior (Section 6):¶
1. Parse the End.MT TLV from the SRH to obtain the list of downstream receivers (unicast addresses and QPNs).¶
2. Create one copy of the inner packet for each downstream receiver.¶
3. In each copy, replace the IPv6 Destination Address with the receiver's unicast address.¶
4. In each copy, replace the BTH Destination QPN with the receiver's actual QPN.¶
5. Recompute the Invariant CRC (ICRC) and any other affected checksums.¶
6. Forward each modified copy toward its destination.¶
Each receiver Ri receives a standard RDMA RC unicast packet addressed to its own IPv6 address and QPN. No multicast-specific behavior is required at Ri.¶

5.3. Reverse-Path Response Processing

Receivers generate three types of response messages toward the source: ACK (acknowledgment of successful reception), NACK (request for retransmission), and CNP (congestion notification). These responses MUST be aggregated hop-by-hop at intermediate nodes before reaching the source, so that the source's retransmission and rate control logic operates correctly without multicast-specific modifications.¶

5.3.1. ACK Aggregation

The source MUST receive an AckPSN value satisfying the following invariant: for every receiver Ri and every packet with PSN less than or equal to AckPSN, Ri has confirmed successful reception.¶

Each intermediate node (edge or transit) MUST maintain a record of the most recent AckPSN reported by each downstream branch. When an ACK is received from a downstream branch, the node MUST update the stored AckPSN for that branch. The node MUST forward an ACK upstream carrying an AckPSN equal to the minimum of all downstream branches' stored AckPSN values. The node adjacent to the source (N6 in Figure 1) MUST write this minimum value into the AETH AckPSN field of the ACK forwarded to the source.¶

5.3.2. NACK Aggregation

When a receiver detects missing packets, it sends a NACK containing an expected PSN (ePSN) indicating the start of the retransmission range. The source MUST receive an ePSN satisfying the following invariant: for every receiver Ri, all packets with PSN less than ePSN have been successfully received by Ri.¶

Each intermediate node MUST maintain a per-branch record of ePSN values. For branches that have sent only ACKs (no NACK), the effective ePSN SHOULD be treated as AckPSN + 1. The NACK forwarded upstream MUST carry the minimum ePSN across all downstream branches.¶

5.3.3. CNP Aggregation

Each intermediate node MUST maintain a per-branch counter (CCount) that records the number of CNP messages received from each downstream branch within a configurable time window T.¶

At the expiration of each time window T, the node MUST select the branch with the highest CCount value and forward a single CNP upstream representing the most congested downstream path. All CCount values MUST be reset to zero at the start of each new time window.¶

The time window T MAY be adjusted dynamically based on observed network conditions. The node adjacent to the source MUST rewrite CNP packet headers so that the source processes the CNP as a standard RoCEv2 congestion notification.¶

6. SRv6 End.MT Behavior

6.1. Definition

End.MT is an SRv6 endpoint behavior instantiated at edge nodes of the RDMA multicast tree. When a node N receives a packet whose IPv6 Destination Address matches a locally instantiated End.MT SID, N performs the processing described in Section 6.2.¶

End.MT combines the following operations: SRH segment processing, End.MT TLV parsing, per-receiver packet replication, BTH Destination QPN replacement, IPv6 Destination Address replacement, and ICRC recomputation.¶

6.2. Pseudocode

The following pseudocode follows the conventions of [RFC8986] Section 4.¶

When N receives a packet destined to S, where S is a local
End.MT SID, N does:

  S01. If NH=SRH and SL > 0 {
  S02.   Decrement SL
  S03.   Update the IPv6 DA with SRH[SL]         ;; Ref1
  S04.   Parse the End.MT TLV associated with S
  S05.   Let RecvList = list of (IPv6_Addr, QPN) from TLV
  S06.   For each entry (Addr_i, QPN_i) in RecvList {
  S07.     Copy the packet                        ;; Ref2
  S08.     In the copy, set IPv6 DA = Addr_i
  S09.     In the copy, set BTH.DestQPN = QPN_i
  S10.     Recompute ICRC over the modified headers
  S11.     Forward the copy based on Addr_i       ;; Ref3
  S12.   }
  S13. } Else {
  S14.   Drop the packet                          ;; Ref4
  S15. }

Ref1: Standard SRH processing per RFC 8754 Section 4.3.1.1.

Ref2: The copy includes all payload beyond the outer IPv6
      and SRH headers that are relevant to the inner RDMA
      frame.

Ref3: FIB lookup on Addr_i determines the outgoing
      interface.

Ref4: A packet arriving with SL=0 or without SRH is not
      valid for End.MT processing.

7. Packet Formats

7.1. SRH Usage

This specification uses the standard IPv6 Segment Routing Header (SRH) as defined in [RFC8754]. The SRH carries the End.MT SID in its Segment List and the End.MT TLV in its Optional TLV field.¶

7.2. End.MT TLV Format

The End.MT TLV is carried in the Optional TLV field of the SRH and conveys per-edge-node receiver information. Its format is as follows:¶

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Type     |     Length    |           Reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|          Edge Node Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Num Receivers |                  Reserved                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|         Receiver 1 Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Receiver 1 QPN (24 bits)            |   Reserved   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~                            ...                                ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|         Receiver N Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Receiver N QPN (24 bits)            |   Reserved   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 2: End.MT TLV Format

The fields are defined as follows:¶

Type (8 bits):: SRH TLV type code, to be assigned by IANA (see Section 10).¶
Length (8 bits):: Length of the Value field in octets, not including the Type and Length fields.¶
Edge Node Address (128 bits):: The IPv6 unicast address of the edge node associated with this TLV entry.¶
Num Receivers (8 bits):: The number of receiver entries following this field.¶
Receiver i Address (128 bits):: The IPv6 unicast address of the i-th receiver.¶
Receiver i QPN (24 bits):: The RDMA Queue Pair Number of the i-th receiver.¶
Reserved:: MUST be set to zero on transmission and MUST be ignored on reception.¶

Multiple End.MT TLVs MAY be present in a single SRH, one per edge node in the multicast tree. Each End.MT TLV is associated with the End.MT SID of the corresponding edge node.¶

8. Intermediate Node State Requirements

Each intermediate node (both edge and transit) participating in reverse-path response aggregation MUST maintain the following per-multicast-group state:¶

Per-branch AckPSN:: The most recent AckPSN value received from each downstream branch. Initial value: 0.¶
Per-branch ePSN:: The most recent expected PSN from any NACK received from each downstream branch. For branches that have not sent a NACK, this value SHOULD be set to AckPSN + 1.¶
Per-branch CCount:: A counter of CNP messages received from each downstream branch within the current time window T. Reset to zero at the start of each new time window.¶

The amount of state is proportional to the number of downstream branches at each node, not to the total number of receivers in the multicast group. Edge nodes additionally maintain the receiver information (addresses and QPNs) learned from the End.MT TLV or from control-plane provisioning.¶

9. Security Considerations

The security considerations of [RFC8754] and [RFC8986] apply to all SRv6 aspects of this specification. The following additional considerations are specific to RDMA multicast delivery.¶

The End.MT TLV carries receiver IPv6 addresses and QPNs in the SRH. An on-path attacker able to read SRH contents can obtain receiver topology and RDMA connection identifiers. Implementations operating outside a single administrative trust domain SHOULD protect SRH integrity and confidentiality using the HMAC TLV defined in Section 7 of [RFC8754] or IPsec Encapsulating Security Payload (ESP) encapsulation.¶

Intermediate nodes maintain per-branch ACK/NACK/CNP aggregation state. An attacker injecting forged response messages could corrupt this state, causing the source to prematurely consider data as acknowledged (via inflated AckPSN) or to trigger unnecessary retransmissions (via forged NACKs). Nodes SHOULD validate that reverse-path response messages originate from addresses within the expected downstream receiver set. BCP 38 [RFC2827] ingress filtering SHOULD be applied at network boundaries.¶

An attacker injecting a high volume of forged CNP messages could force the source into continuous rate reduction, creating a denial-of-service condition. Intermediate nodes SHOULD implement per-branch CNP rate limiting. The configurable time window T for CNP aggregation provides an inherent dampening effect.¶

If End.MT TLV contents are modified in transit, packets could be delivered to incorrect RDMA QPs, resulting in data corruption or information disclosure at unintended receivers. The SRH HMAC TLV [RFC8754] provides integrity protection for this purpose. Edge nodes SHOULD verify HMAC before processing End.MT TLVs when operating across trust domain boundaries.¶

Table 1: SRv6 Endpoint Behavior Registration
Value	Hex	Endpoint Behavior	Reference	Change Controller
TBD1	TBD1	End.MT	[This document]	IETF

Table 2: SRH TLV Type Registration
Value	Description	Reference
TBD2	End.MT TLV	[This document]

11. References

11.1. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC8402]: Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, July 2018, <https://www.rfc-editor.org/info/rfc8402>.
[RFC8754]: Filsfils, C., Ed., Dukes, D., Ed., Previdi, S., Leddy, J., Matsushima, S., and D. Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI 10.17487/RFC8754, March 2020, <https://www.rfc-editor.org/info/rfc8754>.
[RFC8986]: Filsfils, C., Ed., Camarillo, P., Ed., Leddy, J., Voyer, D., Matsushima, S., and Z. Li, "Segment Routing over IPv6 (SRv6) Network Programming", RFC 8986, DOI 10.17487/RFC8986, February 2021, <https://www.rfc-editor.org/info/rfc8986>.
[RFC9524]: Voyer, D., Ed., Filsfils, C., Parekh, R., Bidgoli, H., and Z. Zhang, "Segment Routing Replication for Multipoint Service Delivery", RFC 9524, DOI 10.17487/RFC9524, 28 February 2026, <https://www.rfc-editor.org/info/rfc9524>.

11.2. Informative References

[RFC2827]: Ferguson, P. and D. Senie, "Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing", BCP 38, RFC 2827, DOI 10.17487/RFC2827, May 2000, <https://www.rfc-editor.org/info/rfc2827>.
[RFC3168]: Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, <https://www.rfc-editor.org/info/rfc3168>.
[RFC8279]: Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, DOI 10.17487/RFC8279, November 2017, <https://www.rfc-editor.org/info/rfc8279>.
[RFC8296]: Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation for Bit Index Explicit Replication (BIER) in MPLS and Non-MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 2018, <https://www.rfc-editor.org/info/rfc8296>.
[I-D.xiao-rtgwg-rocev2-fast-cnp]: Min, X. and H. Li, "Fast Congestion Notification Packet (CNP) in RoCEv2 Networks", Work in Progress, Internet-Draft, draft-xiao-rtgwg-rocev2-fast-cnp-04, 28 February 2026, <https://datatracker.ietf.org/doc/html/draft-xiao-rtgwg-rocev2-fast-cnp-04>.
[I-D.liu-nfsv4-rocev2]: Liu, Y., "RoCEv2-based Collective Communication Offloading", Work in Progress, Internet-Draft, draft-liu-nfsv4-rocev2-00, 28 February 2026, <https://datatracker.ietf.org/doc/html/draft-liu-nfsv4-rocev2-00>.
[I-D.hu-rtgwg-rocev2-fcn]: Hu, Z. and Y. Zhu, "Fast Congestion Notification for Distributed RoCEv2 Network Based on SRv6", Work in Progress, Internet-Draft, draft-hu-rtgwg-rocev2-fcn-00, 28 February 2026, <https://datatracker.ietf.org/doc/html/draft-hu-rtgwg-rocev2-fcn-00>.
[ROCEV2]: InfiniBand Trade Association, "Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1 - Annex A17: RoCEv2", 28 February 2026.