| Internet-Draft | Switching Efficiency | April 2026 |
| Ye, et al. | Expires 21 October 2026 | [Page] |
This document specifies the Switching Efficiency Framework, a measurement methodology designed to evaluate network efficiency in AI Data Centers (AIDCs). Conventional network metrics, such as bandwidth utilization or network throughput, fail to directly link network activity to computational progress, as they cannot distinguish computationally effective data that directly advances neural network computing from the redundant traffic induced by both multi-hop forwarding and the algorithmic overhead of collective operations.¶
To address this, this document introduces the Switching Efficiency Framework, a novel measurement methodology designed to dissect and evaluate AIDC network efficiency. The core metric, Switching Efficiency, quantifies the computationally effective data throughput delivered per unit of provisioned switching capacity. To facilitate precise diagnostic analysis, the framework further decomposes this core metric into three fine-grained factors: Data Efficiency, Routing Efficiency, and Port Utilization.¶
This framework provides network operators with standardized quantitative metrics to pinpoint communication bottlenecks and evaluate topology-traffic alignment.¶
This note is to be removed before publishing as an RFC.¶
Status information for this document may be found at https://datatracker.ietf.org/doc/draft-ye-ippm-switching-efficiency/.¶
Discussion of this document takes place on the ippm Working Group mailing list (mailto:ippm@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/ippm/. Subscribe at https://www.ietf.org/mailman/listinfo/ippm/.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 21 October 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In hyperscale AI Data Centers (AIDCs), network communication is frequently the primary performance bottleneck for training Large Language Models (LLMs). While diverse network topologies and communication algorithms (e.g., In-Network Computing) are being deployed, operators lack a standardized, quantitative methodology to evaluate how effectively raw physical switching resources are converted into actual training progress.¶
Conventional performance metrics, such as bandwidth utilization or network throughput, are inadequate for this environment because they measure absolute network "busyness" rather than useful work. Specifically, they treat all transferred bytes equally, failing to isolate "computationally effective data"—the net data that directly advances neural network computing. For example, during an All-Reduce operation, significant volumes of data are transferred across the fabric only to be discarded after mathematical reduction (algorithmic overhead). Similarly, when the physical topology fails to match the spatial distribution of the workload—such as forcing logically localized, high-volume traffic to cross the broader scale-out fabric—data must traverse an excessive number of forwarding hops (multi-hop overhead). Because traditional metrics conflate these redundancies with effective data delivery, operators cannot accurately quantify how well a specific network architecture aligns with its intended AI traffic patterns.¶
To bridge this gap, this document defines the Switching Efficiency Framework [SwitchingEfficiencyPaper], which relates the throughput of effective data to the aggregate switching capacity of the network through its core metric, Switching Efficiency ($\eta$). This top-level metric is further decomposed into three diagnostic factors to evaluate specific architectural design choices: Data Efficiency ($\gamma$) tests the communication algorithm, verifying whether it delivers computationally effective data or generates redundant bytes; Routing Efficiency ($\delta$) tests the topology-traffic alignment, revealing whether the physical network provides direct paths or forces traffic into excessive multi-hop detours; and Port Utilization ($\theta$) tests hardware resource allocation, assessing whether the provisioned switching capacity is actively utilized rather than wasted.¶
By formalizing these metrics, this document equips network operators and telemetry systems with a standardized, mathematically precise toolset to diagnose AIDC network performance, pinpoint communication bottlenecks, and optimize infrastructure design.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Computationally Effective Data (CED): The net volume of data yielded by a communication operation that is directly consumed by the subsequent neural network computation phase. CED explicitly EXCLUDES any unreduced, or protocol-overhead data transmitted across the network during the operation.¶
For non-reduction operations (e.g., All-Gather, All-to-All dispatch), CED equals the aggregate newly received data volume at the endpoints.¶
For reduction operations (e.g., All-Reduce, Reduce-Scatter, All-to-All combine), CED is quantified strictly by the final mathematically reduced output retained by the endpoints.¶
Switching Capacity: The aggregate theoretical data forwarding rate of all electrical packet switch ports within the evaluated network domain. To accurately reflect the heterogeneous hardware of modern AI Data Centers, this capacity MUST encompass all functional transit components, specifically:¶
Standalone network switches (e.g., standard Ethernet or InfiniBand switches acting as Top-of-Rack, Leaf, or Spine).¶
Embedded switching elements within a single compute chassis (e.g., NVSwitch interconnecting GPUs within a server).¶
Forwarding ports residing natively on the compute accelerators (e.g., Google TPUs).¶
In-Network Computing (INC): A network architecture paradigm where mathematical or logical operations (such as data reduction in collective communications) are executed within the network data plane (e.g., by programmable switches) while data is in transit. In the context of AI Data Centers, INC is typically deployed to offload collective communication reductions (e.g., performing arithmetic operations for All-Reduce directly on the switch), thereby eliminating the transmission of unreduced data and delivering only the reduced results to the endpoints.¶
This section defines the Switching Efficiency Framework. The detailed mathematical derivations supporting this framework are provided in [SwitchingEfficiencyPaper]. For operational measurement, the following metrics are formulated as cumulative volumes over a defined observation window $T$.¶
The framework relies on four primary operational metrics collected over the measurement window $T$:¶
$V_{CED}$ (Total CED Volume): The aggregate volume of Computationally Effective Data yielded by all communication primitives completed during $T$.¶
$V_{RECV}$ (Total Received Volume): The aggregate volume of data successfully received by the network interfaces (e.g., NICs) of all compute nodes during $T$.¶
$V_{FWD}$ (Total Forwarded Volume): The aggregate volume of data forwarded by all packet switching ports across the network domain during $T$.¶
$C_{TOTAL}$ (Aggregate Switching Capacity): The sum of the theoretical maximum unidirectional egress data forwarding rates of all packet switching ports, denoted as $\sum R_p$, where $R_p$ represents the theoretical maximum data rate of an individual port $p$.¶
Switching Efficiency ($\eta$) is the top-level metric quantifying how effectively a network translates its raw physical capacity into computational progress. It is defined as the ratio of the CED throughput over observation window $T$ to the aggregate switching capacity of the network.¶
V_CED / T
η = -----------
C_TOTAL
¶
A high $\eta$ indicates that a large proportion of the network's provisioned hardware capacity is successfully contributing to the delivery of computationally effective data. It serves as a holistic macro-indicator of end-to-end network effectiveness.¶
To enable diagnostic analysis and isolate specific performance bottlenecks, $\eta$ is mathematically decomposed into three independent efficiency factors ($\eta = \gamma \cdot \delta \cdot \theta$):¶
Data Efficiency evaluates the effectiveness of implementing the communication primitives. It specifies the ratio of Computationally Effective Data ($V_{CED}$) to the total received volume ($V_{RECV}$).¶
V_CED
γ = -----------
V_RECV
¶
Diagnostic Focus: Identifies data reception redundancy. A value of $\gamma < 1$ indicates that compute endpoints receive unreduced data (e.g., during All-Reduce operations without INC). Executing mathematical reductions within the network data plane via INC resolves this redundancy, driving $\gamma$ to its theoretical maximum of 1.¶
Routing Efficiency quantifies the topological alignment between the physical network architecture and the AI Workload Traffic patterns.¶
V_RECV
δ = ---------
V_FWD
¶
Diagnostic Focus: Identifies multi-hop forwarding overhead and potential packet retransmissions. Mathematically, assuming a perfectly lossless network environment, $\delta$ represents the inverse of the volume-weighted average hop count. A value of $\delta < 1$ indicates that traffic either traverses multiple switching ports or experiences network congestion leading to drops and subsequent retransmission overhead.¶
Port Utilization measures the spatial and temporal engagement of the provisioned switching capacity.¶
V_FWD
θ = ---------------
C_TOTAL * T
¶
Diagnostic Focus: Identifies underutilized switching capacity. A low $\theta$ indicates that the provisioned hardware ($C_{TOTAL}$) operates below its theoretical maximum data rate over the observation window $T$, due to either spatial traffic imbalance or temporal idleness.¶
This section specifies the operational procedures for collecting the variables required to compute the efficiency metrics. Accurate measurement requires tight time synchronization (e.g., via the Precision Time Protocol (PTP) [IEEE1588]) across all network and compute endpoints, as well as an observation window ($T$) sufficiently large to dilute telemetry polling variance.¶
The four core variables span the network, endpoint, and application planes, and are collected as follows:¶
$C_{TOTAL}$ (Aggregate Switching Capacity): Derived from the static topology inventory. It requires summing the operational link speeds of all packet switching ports within the measured network.¶
$V_{FWD}$ (Total Forwarded Volume): Collected from the network plane. Operators MUST extract the aggregate egress byte counters from the switching hardware (e.g., switch Application-Specific Integrated Circuits (ASICs)). This is typically achieved via push-based streaming telemetry (e.g., the gRPC Network Management Interface (gNMI) built upon the gRPC Remote Procedure Call framework) or the Simple Network Management Protocol (SNMP) [RFC3411].¶
$V_{RECV}$ (Total Received Volume): Collected from the endpoint plane. Operators MUST extract the aggregate ingress byte counters from the host network interfaces, such as Remote Direct Memory Access (RDMA) capable Network Interface Cards (NICs) or the compute accelerators themselves.¶
$V_{CED}$ (Total CED Volume): Collected from the application plane. To avoid the prohibitive overhead of parsing verbose logs, operators SHOULD utilize lightweight collection mechanisms. Recommended approaches include host-side telemetry agents, Extended Berkeley Packet Filter (eBPF) hooks dynamically attached to collective communication APIs, or native metrics endpoints exposed by standard communication libraries (e.g., Message Passing Interface (MPI), or vendor-specific equivalents like NCCL/RCCL).¶
The operational deployment of this measurement framework raises the following security and privacy considerations:¶
Data Confidentiality: Collecting $V_{CED}$ and $V_{RECV}$ can inadvertently expose proprietary AI workload characteristics (e.g., model architecture or training strategies). Telemetry data MUST be transported over encrypted channels, such as Transport Layer Security (TLS) [RFC8446] or Internet Protocol Security (IPsec) [RFC4301], and securely stored.¶
Measurement Integrity: Falsifying the underlying counters ($V_{FWD}$, $V_{RECV}$, $V_{CED}$) will manipulate the calculated efficiency metrics. Robust authentication and authorization MUST be enforced for all telemetry endpoints to prevent data poisoning.¶
This document has no IANA actions.¶
We are grateful to the valuable discussions and inputs from the community. We thank the support from NSFC.¶