Internet-Draft AI DCI Adaptive LB July 2026
Li, et al. Expires 4 January 2027 [Page]
Workgroup:
srv6ops
Internet-Draft:
draft-lll-srv6ops-dci-srv6-lb-00
Published:
Intended Status:
Informational
Expires:
Authors:
J. Li
China Mobile
Y. Liu
China Mobile
C. Lin
New H3C Technologies
Q. Xiong
ZTE Corporation
K. Zhang
Huawei Technologies

SRv6-based Adaptive Load Balancing for AI DCI

Abstract

This document describes an SRv6-based adaptive load balancing architecture for AI Data Center Interconnection (DCI) scenarios, where RoCEv2 elephant flows traverse WAN between storage and compute sites under the storage-compute separation paradigm. The architecture employs a controller-driven closed loop: telemetry-based flow and path monitoring, SL-level imbalance detection, and BGP Flowspec-based steering with QP-level matching granularity and Segment List-level action precision. This supplements the default QP-aware hash-based SL selection with dynamic, explicit flow steering to resolve hash collisions and persistent load imbalance.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 4 January 2027.

Table of Contents

1. Introduction

The rapid growth of AI large-model training has driven the adoption of RDMA over Converged Ethernet v2 (RoCEv2) in Data Center fabrics for high-performance, low-latency communication between GPU servers. When AI training workloads are distributed across geographically separated sites — a deployment pattern known as storage-compute separation — RoCEv2 traffic must traverse the Wide Area Network (WAN) for Data Center Interconnection (DCI).

These cross-site AI flows exhibit characteristics fundamentally different from traditional Internet traffic: sustained multi-Gbps throughput, long-lived connections bound to stable Queue Pair (QP) identifiers, and strict loss sensitivity. Traditional Equal-Cost Multi-Path (ECMP) load balancing, which relies on static 5-tuple hashing, provides insufficient entropy for such flows. Multiple elephant flows frequently hash to the same path, creating persistent hotspots and underutilizing available bandwidth.

Segment Routing over IPv6 (SRv6) [RFC8986] with SRv6 Policy [RFC9256] provides explicit path programming: an ingress device can steer traffic into a Policy containing multiple candidate Segment Lists (SLs), each representing a distinct path through the WAN. However, two limitations remain:

This document describes a controller-driven adaptive load balancing architecture that addresses both limitations. The controller continuously monitors flow-level and path-level state via Telemetry, detects SL-level load imbalance, and enforces corrective steering via extended BGP Flowspec with QP-level matching and SL-level redirect precision. This dynamic mechanism operates as a supplement to the default hash-based SL selection, resolving hash collisions and persistent imbalance in real time.

2. Use Case: Storage-Compute Separation for AI DCI

2.1. Scenario

In many enterprise AI training scenarios, training sample data constitutes core intellectual property or contains sensitive information. Enterprises prefer to keep data on-premises rather than uploading it to a remote smart computing center for persistent storage. Beyond data privacy concerns, the construction and ongoing maintenance of large-scale smart computing centers — including GPU clusters, high-performance storage, and associated power and cooling infrastructure — represent substantial capital and operational expenditure. Many enterprises find it impractical to co-locate both data and compute at such facilities.

The storage-compute separation paradigm addresses both concerns by streaming sample data from enterprise local storage to remote GPU servers in real time via encrypted channels. Data is loaded directly into GPU memory (VRAM) for iterative training without being persisted at the remote site, reducing both the data exposure surface and the enterprise’s infrastructure investment.

This pattern generates sustained RoCEv2 elephant flows across the WAN: bandwidth consumption at Gbps+ levels, transfer durations of hours to days, and strict requirements for lossless delivery to avoid GPU idle time.

2.2. Traffic Characteristics

The WAN in a storage-compute separation deployment carries:

  • Elephant flows: RoCEv2-based sample data transfers, each bound to one or more QPs, consuming Gbps-level bandwidth continuously.

  • Mixed traffic: Operational traffic (management, monitoring, conventional IP services) coexisting with elephant flows on the same WAN infrastructure.

The coexistence of these traffic types creates load balancing challenges: elephant flows dominate bandwidth and are prone to path polarization under static hashing, while mice flows are sensitive to latency spikes caused by congestion from mis-scheduled elephants.

3. Reference Topology

A typical deployment topology is shown in Figure 1.

                                  +------------+
                                  | Controller |
                                  +------+-----+
                                (Telemetry+ BGP FS)
                                         |
                +------------------------------------------------+
                |                      +---+                     |
                |                 +----| P1|---+                 |
                |                 |    +---+   |                 |
   +------------+                 |    +---+   |                 +------------+
   | Enterprise |                 +----| P2|---+                 |   Smart    |
   |    DC      |                 |    +---+   |                 | Computing  |
   |           +--—+    +--+--+   |    +---+   |   +--+--+    +--—+ Center    |
   | +--------+|GW |----| PE1 |---+----| P3|---+---| PE2 |----|GW |+--------+ |
   | |Storage |+---+    +--+--+   |    +---+   |   +--+--+    +--—+|  GPU   | |
   | |Servers | |                 |    +---+   |                 | |Servers | |
   | +--------+ |                 +----| P4|---+                 | +---+----+ |
   +------------+                 |    +---+   |                 +------------+
                |                 |    +---+   |                 |
                |                 +----| P5|---+                 |
                |                      +---+                     |
                +------------------------------------------------+

Figure 1: Reference Topology for AI DCI

The key network elements are:

GW (AI DCI Gateway)

Deployed at the DC-WAN boundary. Responsible for RoCEv2 packet inspection, elephant flow identification, Telemetry reporting to the controller, and executing Flowspec steering policies. GW also acts as the SRv6 Policy headend.

P (Provider)

Transit routers along SRv6 paths. Each PE-to-PE path through a distinct set of P routers constitutes one Segment List.

Controller

Subscribes to Telemetry streams from GWs and path-state feeds from PE/P devices. Computes scheduling decisions and distributes steering policies via BGP Flowspec to GWs. Manages SRv6 path programming across PE and P devices.

4. Controller-driven Adaptive Load Balancing

The adaptive load balancing architecture operates as a closed loop with three phases: Monitoring, Decision, and Enforcement. A fourth component — QP-aware hash-based SL selection — serves as the default path selection mechanism, with Flowspec-based steering providing dynamic correction when hash-based selection produces imbalances.

The overall pipeline is illustrated in Figure 2.

  +-------+   Telemetry    +------------+   BGP Flowspec   +-------+
  |  GW   |===============>| Controller |=================>|  GW   |
  |       |  (flow info)   |            |  (QP -> SL)      |       |
  +-------+                +-----+------+                  +-------+
                                 |
                           Path state
                           subscription
                                 |
                       +---------+---------+
                       |    PE / P nodes   |
                       +-------------------+
Figure 2: Controller-driven Closed-loop Pipeline

4.1. Monitoring: Telemetry-based Flow and Path State Collection

The AI Computing Gateway continuously identifies elephant flows by monitoring per-flow bandwidth. Flows exceeding a configured threshold within a measurement period are classified as elephant flows. For each identified elephant flow, the gateway extracts key attributes including outer IPv6 addresses, Flow Label, and — critically — the inner RoCEv2 Queue Pair identifier from the InfiniBand transport header.

The gateway reports elephant flow information to the controller via a Telemetry stream. The YANG model for elephant flow reporting includes per-flow packet and byte counters (enabling rate computation), SRv6 Policy and Segment List association, and inner header fields including the RoCEv2 QP identifier. The detailed YANG model definition is beyond the scope of this document.

In parallel, the controller subscribes to real-time path state from PE and P devices along each SRv6 path, including per-Segment-List link utilization, latency, and loss metrics. This provides the controller with a complete view of both demand (flow-level) and supply (path-level capacity).

4.2. Decision: SL-level Load Imbalance Detection

With visibility into both elephant flow attributes and per-SL utilization, the controller correlates the two dimensions:

  1. For each SRv6 Policy, the controller examines the utilization of each constituent Segment List.

  2. When the utilization of a specific SL exceeds a configured threshold while other SLs within the same Policy have available capacity, the controller identifies an imbalance condition.

  3. The controller selects one or more elephant flows currently assigned to the overloaded SL as candidates for migration, prioritizing flows with the largest bandwidth contribution.

  4. The controller computes a target assignment: which elephant flow (identified by QP) should be steered to which Segment List to restore balance.

This decision process operates continuously, enabling the system to adapt to dynamic changes in traffic patterns — new elephant flows appearing, existing flows terminating, or path capacity changing due to failures or maintenance.

4.3. Enforcement: Flowspec-based QP-aware Steering to Segment List

The controller enforces its scheduling decisions by distributing BGP Flowspec policies to the AI Computing Gateway (SRv6 Policy headend). This requires two protocol extensions beyond standard Flowspec capabilities:

  • QP-level matching [I-D.lll-idr-flowspec-filter-qp]: A new Flowspec component type (Destination-QP) enables the Flowspec filter to match traffic by its RoCEv2 Queue Pair identifier. This allows the controller to target specific elephant flows — rather than all traffic matching a 5-tuple — for steering actions.

  • SL-level redirect [I-D.ll-idr-flowspec-redirect-sidlist]: A new ID-Type in the Flowspec redirect extended community enables the action to target a specific Segment List within an SRv6 Policy, rather than the Policy as a whole. This provides the precision needed for fine-grained load rebalancing.

Upon receiving the Flowspec policy, the GW installs a forwarding rule that matches incoming RoCEv2 packets by QP and steers matching traffic into the designated Segment List for SRv6 encapsulation and forwarding.

4.3.1. Motivation for QP-level Flowspec Matching

AI computing traffic is predominantly RoCEv2, and the server NIC may split a large data transfer across multiple QPs. From the WAN perspective, each QP represents a distinct sub-flow that can be independently scheduled. Although these per-QP sub-flows are still large compared to conventional Internet traffic, the granularity is significantly finer than scheduling the entire aggregate.

Standard 5-tuple Flowspec matching cannot distinguish between QPs sharing the same source/destination addresses and ports. Without QP-level matching, the controller would have to steer all QPs of a flow together, losing the ability to distribute sub-flows across different paths.

4.4. Default Path Selection: QP-aware Hash within SL

In the absence of explicit Flowspec steering, the GW selects a Segment List for each packet using a hash-based mechanism. When deep packet inspection identifies a RoCEv2 packet (UDP destination port 4791), the GW extracts the Destination QP from the InfiniBand transport header and incorporates it into a 6-tuple hash: (Source IP, Destination IP, Source Port, Destination Port, Protocol, Dest QP). The resulting hash value is written into the IPv6 Flow Label of the outer SRv6 header (carried in the Segment Routing Header [RFC8754]).

Subsequent P routers along the path include the outer Flow Label in their forwarding hash, ensuring that all packets of the same QP follow the same path (preserving packet ordering) while different QPs are distributed across available SLs.

4.4.1. Relationship Between Hash and Flowspec Steering

QP-aware hash provides the static baseline: it distributes flows across SLs without controller involvement and works for all traffic without per-flow state at the controller. However, hash-based selection has inherent limitations:

  • Hash collisions: Multiple QPs may hash to the same SL, especially when the number of active QPs is small or when the endpoint does not map QP to the UDP source port (reducing input entropy).

  • No global visibility: Each GW hashes independently without knowledge of the load state of downstream SLs. A hash outcome that is locally uniform may still produce global imbalance when multiple GWs feed the same WAN paths.

  • Static mapping: Hash outcomes are deterministic and do not adapt to changing path conditions. A persistent collision remains until the flow terminates or the hash input changes.

Flowspec-based steering operates as the dynamic correction layer on top of the hash baseline. When the controller detects that hash outcomes have produced SL-level imbalance, it issues explicit QP-to-SL mappings via Flowspec that override the default hash selection for the affected flows. When the imbalance resolves (e.g., a conflicting flow terminates), the controller withdraws the Flowspec rule and the flow reverts to hash-based selection.

This two-layer design — hash for steady-state distribution, Flowspec for dynamic correction — provides both scalability (the controller does not need to make per-flow decisions for all traffic) and precision (the controller can surgically correct specific imbalances).

4.5. End-to-End Example

Consider a deployment where the GW has an SRv6 Policy toward the remote DC with three Segment Lists: SL1, SL2, and SL3. Four elephant flows (QP1, QP2, QP3, QP4) are active.

  1. Initial state (hash-based): The GW hashes the four QPs. Due to a hash collision, QP1 and QP3 both land on SL1. SL2 carries QP2, SL3 carries QP4. SL1 utilization is 65%, SL2 is 30%, SL3 is 30%.

  2. Monitoring: The GW reports flow-level Telemetry to the controller, including per-QP byte counts and current SL assignment. The controller observes SL1 overload.

  3. Decision: The controller selects QP3 (the smaller of the two flows on SL1) for migration to SL2.

  4. Enforcement: The controller issues a BGP Flowspec policy matching Destination-QP = QP3 with action redirect to SL2. The GW installs the rule and steers QP3 traffic into SL2.

  5. Result: SL1 carries QP1 (35%), SL2 carries QP2 + QP3 (55%), SL3 carries QP4 (30%). Load is substantially more balanced.

5. Operational Considerations

5.1. AI DCI Gateway Requirements

The GW requires:

  • Deep packet inspection capability to parse InfiniBand transport headers and extract RoCEv2 QP identifiers from passing traffic.

  • Programmable hash engines supporting configurable 6-tuple input (including Dest QP) with Flow Label writeback.

  • Telemetry agent for streaming elephant flow reports to the controller at sub-second intervals.

  • BGP Flowspec receiver supporting the Destination-QP component and SL-level redirect extended community.

  • Sufficient TCAM/SRAM for concurrent elephant flow classification and Flowspec rule installation.

5.2. Controller Requirements

The controller requires:

  • Telemetry collector capable of ingesting per-flow reports from multiple GWs and correlating them with SRv6 Policy and SL state.

  • Real-time path-state monitoring via gRPC or streaming Telemetry from PE and P devices.

  • Scheduling algorithm that correlates elephant flow bandwidth with per-SL utilization to compute optimal QP-to-SL reassignments.

  • BGP Flowspec speaker for distributing steering policies to GWs.

5.3. Non-RoCEv2 Traffic

For traffic that is not RoCEv2 (i.e., UDP destination port is not 4791), the system reverts to standard 5-tuple hash-based SL selection. Flowspec policies targeting Destination-QP do not match non-RoCEv2 traffic, which falls through to the default hash behavior.

5.4. Incremental Deployment

The hash-based SL selection and the controller-driven Flowspec steering can be deployed independently. An operator may begin with hash-based selection alone and introduce the controller loop progressively as Telemetry and Flowspec capabilities are enabled on the GW and controller.

6. Security Considerations

TBD

7. IANA Considerations

This document has no IANA actions. The protocol extensions it references are specified in [I-D.lll-idr-flowspec-filter-qp] and [I-D.ll-idr-flowspec-redirect-sidlist], which contain the respective IANA requests.

8. References

8.1. Normative References

[RFC8986]
Filsfils, C., Ed., Camarillo, P., Ed., Leddy, J., Voyer, D., Matsushima, S., and Z. Li, "Segment Routing over IPv6 (SRv6) Network Programming", RFC 8986, DOI 10.17487/RFC8986, , <https://www.rfc-editor.org/rfc/rfc8986>.

8.2. Informative References

[I-D.ietf-idr-flowspec-path-redirect]
Van de Velde, G., Patel, K., and Z. Li, "Flowspec Indirection-id Redirect", Work in Progress, Internet-Draft, draft-ietf-idr-flowspec-path-redirect-13, , <https://datatracker.ietf.org/doc/html/draft-ietf-idr-flowspec-path-redirect-13>.
[I-D.ll-idr-flowspec-redirect-sidlist]
Li, J., "BGP Flow Specification Redirect to SRv6 Segment List", Work in Progress, Internet-Draft, draft-ll-idr-flowspec-redirect-sidlist-01, , <https://datatracker.ietf.org/doc/draft-ll-idr-flowspec-redirect-sidlist/>.
[I-D.lll-idr-flowspec-filter-qp]
Li, J., Liu, Y., and R. Chen, "BGP Flow Specification Filtered by Destination-QP", Work in Progress, Internet-Draft, draft-lll-idr-flowspec-filter-qp-01, , <https://datatracker.ietf.org/doc/draft-lll-idr-flowspec-filter-qp/>.
[RFC8754]
Filsfils, C., Ed., Dukes, D., Ed., Previdi, S., Leddy, J., Matsushima, S., and D. Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI 10.17487/RFC8754, , <https://www.rfc-editor.org/rfc/rfc8754>.
[RFC9256]
Filsfils, C., Talaulikar, K., Ed., Voyer, D., Bogdanov, A., and P. Mattes, "Segment Routing Policy Architecture", RFC 9256, DOI 10.17487/RFC9256, , <https://www.rfc-editor.org/rfc/rfc9256>.

Appendix A. Acknowledgements

The authors would like to thank the contributors from Huawei Technologies, ZTE Corporation, H3C Technologies for their valuable feedback on the SRv6-based adaptive load balancing for AI DCI.

Appendix B. Document History

-00 Initial version.

Authors' Addresses

Jiming Li
China Mobile
Yisong Liu
China Mobile
Changwang Lin
New H3C Technologies
Quan Xiong
ZTE Corporation
Ka Zhang
Huawei Technologies