Problem Statement and Requirements for Fast Network Event Notification in Distributed AI Training and Inference

Internet-Draft	Fast Network Event Notification	July 2026
Sang, et al.	Expires 3 January 2027	[Page]

Abstract

Distributed AI training and inference rely on tightly coordinated communication across large-scale AI fabrics, making timely awareness of network conditions essential to application performance. Network events, including congestion, link degradation, path changes, and device failures, can significantly affect collective communication efficiency, job completion time, and overall resource utilization. Existing network event notification mechanisms are primarily designed for general-purpose IP networks and do not adequately address the timeliness, semantics, and coordination requirements of distributed AI workloads.¶

This document identifies the problem space for fast network event notification in distributed AI training and inference environments. It presents representative use cases, identifies gaps in existing approaches, and derives a set of functional and operational requirements for timely, reliable, and interoperable dissemination of network events across AI fabrics. These requirements are intended to facilitate future work on network architectures and protocols for AI networking. This document does not specify a protocol, signaling mechanism, or protocol extension.¶

1. Introduction

1.1. Background and Motivation

Recent advances in foundation models have accelerated the deployment of distributed AI training and inference across large-scale computing infrastructures. Compared with conventional cloud applications, distributed AI workloads generate sustained high-bandwidth traffic and rely on tightly synchronized communication among a large number of computing nodes. As a result, application performance is highly sensitive to network conditions, particularly during collective communication operations.¶

To support these workloads, modern data centers increasingly deploy dedicated high-performance networking infrastructures, commonly referred to as AI Fabrics. An AI Fabric integrates high-speed network interconnects, accelerators, and scheduling systems to provide scalable communication for large GPU clusters. Technologies such as Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) are widely adopted to reduce communication latency and improve transport efficiency for distributed AI applications.¶

Distributed AI workloads depend on collective communication primitives, including AllReduce, AllGather, ReduceScatter, and pipeline-parallel communication, which require coordinated participation from hundreds or thousands of compute nodes. The overall execution time of these operations is often determined by the slowest participant. Consequently, transient network events, such as congestion, link degradation, path changes, or device failures, can interrupt communication synchronization, create straggler nodes, and significantly reduce overall training and inference efficiency.¶

Existing network monitoring and event notification mechanisms are primarily designed for general-purpose IP networks, where traffic is relatively elastic and applications are generally tolerant of transient network fluctuations. In contrast, distributed AI workloads require timely and consistent awareness of network events to enable rapid adaptation by communication libraries, runtime systems, schedulers, or network controllers. As AI Fabrics continue to increase in scale and complexity, existing mechanisms provide limited support for the responsiveness and coordination required by these environments, motivating the need to identify requirements for fast network event notification.¶

1.2. Scope

This document focuses on the problem space of fast network event notification for distributed AI training and inference deployed over AI Fabrics. It examines the communication characteristics of distributed AI workloads, identifies limitations of existing network event notification mechanisms, and derives a set of functional and operational requirements from representative deployment scenarios.¶

The scope of this document is limited to problem statement, use case analysis, and requirement identification. It does not define a network protocol, signaling mechanism, routing or forwarding behavior, traffic engineering algorithm, YANG data model, or implementation approach. Protocol specification and solution design are considered out of scope.¶

The objective of this document is to provide a common understanding of the problem space and associated requirements, serving as input to future work on AI networking architectures, protocols, and management models. It is intended to facilitate discussion and interoperability across implementations rather than prescribe a specific technical solution.¶

The requirements identified in this document are intended to be technology-neutral and areapplicable to Al networking environments regardless of the underlying transport technology or network implementation.¶

1.3. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].¶

2. Terminology

The terminology defined in this document is intended for the purpose of this document and does not redefine existing IETF terminology.¶

AI Fabric: A networking infrastructure designed to interconnect large-scale AI computing resources and support distributed AI workloads. An AI Fabric provides high-performance communication among compute nodes and is optimized for large-scale collective communication and accelerator-centric traffic patterns.¶

AI Job: A distributed training or inference task executed across one or more compute nodes within an AI Fabric. An AI job typically requires coordinated communication and resource allocation throughout its execution.¶

Distributed AI Training: A computing paradigm in which model training is distributed across multiple compute nodes to accelerate the training of large-scale machine learning models. Distributed AI training relies on frequent synchronization and collective communication to maintain model consistency.¶

Distributed AI Inference: A deployment model in which inference workloads are distributed across multiple compute nodes to improve scalability, throughput, or latency. Such deployments may require communication and synchronization among participating nodes.¶

Network Event: A change in network state that may affect the communication performance of distributed AI workloads. Examples include congestion, link degradation, path changes, packet loss, and device failures.¶

Fast Network Event Notification: A mechanism for disseminating network events to relevant entities with sufficiently low latency to enable timely adaptation by distributed AI applications, communication libraries, runtime systems, or network controllers.¶

Telemetry: A mechanism for collecting and exporting network state information, including traffic statistics, device status, and link performance, for network monitoring and operational purposes.¶

Control Plane: The set of network functions responsible for topology discovery, routing, path computation, policy distribution, and other control functions that determine network behavior.¶

Data Plane: The set of network functions responsible for forwarding packets and carrying application traffic between communicating endpoints.¶

3. Problem Statement

This section examines the problem space for fast network event notification in AI Fabrics. It describes the communication characteristics of distributed AI workloads, discusses the limitations of existing network event notification mechanisms, and identifies the capability gaps that motivate the functional and operational requirements presented in the following section.¶

3.1. AI Fabric Traffic and Workload Characteristics

Distributed AI training and inference exhibit communication characteristics that differ significantly from those of conventional data-center applications. Rather than being limited by raw network throughput alone, the performance of distributed AI workloads depends heavily on timely and coordinated communication among a large number of participating compute nodes. Consequently, transient network events that have little impact on conventional applications may substantially affect collective communication efficiency, accelerator utilization, and overall job completion time. The following subsections summarize the communication characteristics that motivate the need for fast network event notification in AI Fabrics.¶

Collective Communication Dependency: Distributed AI training relies extensively on collective communication operations, including AllReduce, AllGather, All-to-All, and pipeline-parallel communication. These operations require coordinated participation from a large number of compute nodes, and the completion time of each communication round is often determined by the slowest participant. Consequently, network events affecting a single node or communication path may delay the entire collective operation and reduce overall application efficiency. Timely dissemination of such events enables communication libraries and runtime systems to react before performance degradation propagates across the workload.¶

Bursty Traffic and Communication Imbalance: Distributed AI workloads generate communication patterns that differ from conventional client-server traffic. Collective operations frequently produce many-to-one traffic bursts, while model synchronization creates long-lived, high-bandwidth elephant flows. These traffic patterns are sensitive to transient congestion and localized performance degradation. Detecting and disseminating significant network events in a timely manner can help reduce the impact of communication imbalance on distributed AI execution.¶

Sensitivity to Network Latency and Transient Degradation: AI workloads are highly sensitive not only to network failures but also to transient performance degradation, including latency variation, packet loss, and path-quality changes. Even short-lived network events may interrupt communication synchronization, reduce accelerator utilization, and increase overall job completion time. Compared with conventional applications, distributed AI workloads therefore require faster awareness of network conditions to support timely adaptation.¶

Dynamic Runtime Adaptation: Modern AI systems continuously adapt workload placement, communication patterns, and resource allocation according to runtime conditions. Such adaptation increasingly depends on timely information about network state, including congestion, path degradation, and device availability. Efficient dissemination of network events enables runtime systems, communication libraries, and schedulers to coordinate their responses and improve the resilience and efficiency of distributed AI execution.¶

The characteristics described above demonstrate that distributed AI workloads require more timely and application-aware dissemination of network events than conventional data-center applications. The following section examines the extent to which existing network event notification mechanisms satisfy these requirements.¶

3.2. Limitations of Existing Network Monitoring and Notification Mechanisms

Existing network event notification and monitoring mechanisms provide valuable capabilities for congestion indication, fault detection, routing recovery, and operational visibility in general-purpose IP networks. These mechanisms have been successfully deployed in a wide range of operational environments. However, they were not specifically designed to support the communication characteristics of distributed AI workloads described in the previous section. As a result, several limitations become apparent when they are applied to AI Fabric environments.¶

Explicit Congestion Notification (ECN)[RFC3168] provides lightweight in-band congestion indication and enables transport protocols to react before packet loss occurs. However, ECN conveys only limited congestion information and does not distinguish event severity, affected communication groups, or the operational impact on distributed AI workloads. In addition, ECN is primarily designed to signal congestion rather than other network events, such as path degradation or device anomalies. Consequently, ECN alone cannot provide sufficient information for AI runtimes and schedulers to perform workload-aware adaptation.¶

In-band telemetry mechanisms, such as INT and IOAM[RFC9197], provide detailed visibility into packet forwarding paths and network conditions. These mechanisms are primarily intended for network measurement and diagnostics rather than timely dissemination of network events. Furthermore, continuous telemetry collection may introduce considerable processing and operational overhead in large-scale AI clusters. As a result, telemetry alone does not provide an efficient event-driven notification mechanism for distributed AI workloads.¶

Streaming telemetry continuously exports network state information to external monitoring systems and improves the timeliness of operational visibility compared with periodic polling. However, it focuses on exporting measurements rather than communicating actionable network events. Distributed AI workloads typically require concise and timely notification of significant network state changes instead of continuous streams of telemetry data.¶

Bidirectional Forwarding Detection (BFD)[RFC5880] provides rapid detection of link and neighbor failures and plays an important role in improving network resiliency. However, distributed AI workloads are often affected by transient performance degradation rather than complete failures. Conditions such as latency variation, packet loss, or localized congestion may significantly reduce collective communication efficiency while remaining outside the scope of BFD notifications.¶

Routing protocols restore network connectivity following topology changes or failures through protocol convergence. Although these mechanisms improve network availability, they primarily address reachability rather than communication quality. Furthermore, routing convergence is typically triggered after topology changes rather than transient network degradation. Consequently, routing mechanisms alone do not provide the timely, application-aware event dissemination required by distributed AI workloads.¶

The mechanisms discussed above provide complementary capabilities for congestion indication, telemetry, fault detection, and routing recovery. Nevertheless, none of them individually, nor their straightforward combination, fully satisfies the communication characteristics of distributed AI workloads described in Section 3.1. The following section summarizes the common capability gaps observed across these mechanisms.¶

3.3. Capability Gap Analysis for AI Fabric Scenarios

Based on the workload characteristics described in Section 3.1 and the limitations of existing mechanisms discussed in Section 3.2, this section identifies the common capability gaps that prevent current network monitoring and notification mechanisms from fully supporting distributed AI workloads. These gaps motivate the functional and operational requirements presented in Section 4.¶

Notification Timeliness: Distributed AI workloads require network events to be delivered quickly enough to support runtime adaptation during communication-intensive operations. Existing mechanisms are often optimized for monitoring, diagnostics, or protocol convergence, resulting in notification latency that may exceed the timescale of AI communication iterations. Delayed notification limits the ability of communication libraries, runtime systems, and schedulers to mitigate the impact of transient network degradation before application performance is affected.¶

Event Granularity: Existing mechanisms primarily expose network status at the device, interface, or path level. Distributed AI workloads, however, often require finer-grained visibility into communication flows and collective operations in order to identify the affected participants and communication context. Insufficient event granularity limits the ability to perform targeted workload adaptation and localized performance optimization.¶

Event Semantics: Current network event notifications primarily describe network-centric conditions, such as congestion, packet loss, or link failures. However, distributed AI applications require richer event semantics that enable runtime systems to understand the operational impact of network events, including whether collective communication may be affected or whether adaptive actions should be initiated. Without such semantics, network events cannot be efficiently consumed by upper-layer AI software.¶

Cross-layer Coordination: Distributed AI workloads increasingly rely on coordinated interaction among communication libraries, runtime systems, schedulers, and network infrastructure. Existing notification mechanisms generally operate within the networking domain and provide limited support for efficient dissemination of network events across these components. As a result, network conditions cannot always be translated into timely workload adaptation or resource management decisions.¶

Interoperability: AI Fabrics are increasingly deployed across heterogeneous environments involving equipment from multiple vendors and diverse operational domains. Existing notification mechanisms often employ implementation-specific event formats, interfaces, or operational models, making consistent dissemination and interpretation of network events difficult. Improving interoperability is therefore important for enabling portable and vendor-neutral AI networking solutions.¶

The capability gaps described above indicate that existing mechanisms provide useful building blocks but do not collectively satisfy the operational requirements of distributed AI workloads. Addressing these gaps does not necessarily require replacing existing technologies. Instead, it motivates the definition of a common set of functional and operational requirements for fast network event notification in AI Fabric environments.¶

3.4. Problem Summary

The analysis presented in this section indicates that distributed AI workloads introduce communication characteristics that are not fully addressed by existing network monitoring and notification mechanisms. Although current mechanisms provide valuable capabilities for congestion indication, telemetry, fault detection, and routing recovery, they do not collectively satisfy the requirements for timely, fine-grained, semantically rich, and interoperable dissemination of network events in AI Fabric environments. These observations motivate the need for a common set of functional and operational requirements for fast network event notification, which are presented in the following section.¶

4. Representative Use Cases

The capability gaps identified in Section 3 arise in a variety of operational scenarios in distributed AI training and inference. This section presents representative use cases that illustrate these scenarios and highlights where timely network event notification can improve coordination between the network and AI runtime systems. The observations from these use cases provide the basis for the requirements defined in Section 5.¶

4.1. UC1: Congestion Escalation During Collective Communication

Background: Distributed AI training relies heavily on collective communication operations such as AllReduce and AllGather. These operations generate synchronized many-to-one traffic bursts and long-lived elephant flows, making communication performance highly sensitive to transient congestion within the AI Fabric.¶

Network Event: Transient congestion develops on one or more forwarding paths during collective communication. Although the congestion may not immediately result in packet loss, it increases communication latency and delays synchronization across participating compute nodes, leading to straggler effects and reduced training throughput.¶

Limitation of Existing Mechanisms: Existing mechanisms such as ECN provide limited congestion indication, while telemetry mechanisms primarily support monitoring and post-event analysis. They do not provide sufficiently timely and workload-aware notification for AI communication libraries or runtime systems.¶

Implication for Fast Network Event Notification: The network should rapidly notify congestion escalation together with sufficient context to identify affected communication activities, enabling AI runtimes to react before communication performance deteriorates significantly.¶

4.2. UC2: Communication Performance Degradation

Background: Distributed AI workloads depend on stable communication quality over long-running training and inference sessions. Performance degradation may originate from link jitter, intermittent packet loss, NIC anomalies, or bandwidth fluctuation without causing complete connectivity failures.¶

Network Event: Communication quality gradually degrades because of transient or progressive network impairments. These impairments increase retransmissions and synchronization delays while remaining difficult to detect using traditional fault detection mechanisms.¶

Limitation of Existing Mechanisms: Current monitoring mechanisms primarily detect complete failures or export statistical measurements. They provide limited support for identifying gradual communication degradation or correlating such events with ongoing AI workloads.¶

Implication for Fast Network Event Notification: The notification mechanism should report communication quality degradation in a timely manner, allowing AI runtime systems to initiate workload adaptation before application performance is significantly affected.¶

4.3. UC3: Node and Path Failure

Background: Distributed AI applications rely on large numbers of compute nodes interconnected through redundant network paths. Failures affecting either compute nodes or forwarding paths may interrupt collective communication and delay workload execution.¶

Network Event: A compute node, network device, or forwarding path becomes unavailable, requiring communication sessions to recover through runtime adaptation or network rerouting.¶

Limitation of Existing Mechanisms: Existing failure detection and routing mechanisms focus primarily on restoring connectivity. They generally do not provide workload-aware notification that enables AI runtimes to coordinate communication recovery with network recovery.¶

Implication for Fast Network Event Notification: Fast notification of node and path failures should enable communication libraries, runtime systems, and schedulers to coordinate recovery actions and minimize the impact on distributed AI execution.¶

4.4. UC4: Runtime-driven Network Adaptation

Background: Modern AI platforms continuously perform workload placement, scaling, migration, and resource scheduling according to runtime conditions. These decisions increasingly depend on current network conditions.¶

Network Event: Network conditions change because of congestion, resource contention, or topology changes, requiring runtime systems to adjust communication patterns or workload placement.¶

Limitation of Existing Mechanisms: Existing monitoring systems primarily export measurements rather than delivering actionable events. Consequently, network information cannot always be incorporated into runtime adaptation in a timely manner.¶

Implication for Fast Network Event Notification: Network events should be disseminated in a form that can be efficiently consumed by AI runtime systems and schedulers to support coordinated workload adaptation.¶

4.5. UC5 Cross-domain AI Fabric Operation

Background: Large-scale AI deployments increasingly span multiple administrative domains and heterogeneous network infrastructures. Consistent dissemination of network events becomes more challenging in these environments.¶

Network Event: Network events occur within different operational domains and must be interpreted consistently across heterogeneous devices and management systems.¶

Limitation of Existing Mechanisms: Existing notification mechanisms often rely on implementation-specific event formats and interfaces, limiting interoperability across vendors and operational domains.¶

Implication for Fast Network Event Notification: Fast network event notification should support interoperable event representation and dissemination, enabling consistent interpretation of network events across heterogeneous AI Fabric environments.¶

The scenarios presented above illustrate representative situations in which timely and interoperable dissemination of network events can improve the operation of distributed AI workloads. Although the scenarios involve different types of network events, they collectively demonstrate the need for common capabilities in fast network event notification. These observations motivate the functional and operational requirements described in the following section.¶

5. Requirements

This section defines a set of functional and operational requirements for fast network event notification in AI Fabric environments. These requirements are derived from the capability gaps identified in Section 3 and the representative use cases described in Section 4. They are intended to guide the design and evaluation of future solutions rather than prescribe a specific protocol or implementation.¶

5.1. REQ-1: Timely Event Dissemination

Requirement:
The system SHOULD deliver network event notifications to subscribed consumers with sufficiently low latency to enable runtime or scheduling actions before transient network conditions significantly impact application performance. The system SHOULD adopt an event-driven push model for significant network state changes.¶

Discussion:
AI distributed workloads rely on tightly synchronized communication patterns. Delayed visibility of network conditions reduces the effectiveness of runtime adaptation and may lead to performance degradation in collective communication operations.¶

5.2. REQ-2: Event Granularity

Requirement:
The system SHOULD support event notifications that include sufficient context to identify the scope of affected communication entities, such as links, paths, nodes, or communication groups, when such information is available.¶

Discussion:
Fine-grained event context enables runtime systems to localize performance issues and apply targeted mitigation strategies, reducing unnecessary impact on unaffected workloads.¶

5.3. REQ-3: Rich Event Semantics

Requirement:
The system SHOULD support extensible event metadata that describes the operational significance of network events in a machine-readable format. The event representation SHOULD be independent of vendor-specific interpretations.¶

Discussion:
AI runtime systems require semantic context beyond raw network state to determine whether adaptation actions are necessary.¶

5.4. REQ-4: Cross-layer Coordination

Requirement:
The system SHOULD enable coordination between network infrastructure, communication libraries, runtime systems, and scheduling components through standardized event dissemination interfaces, without requiring tight coupling between these layers.¶

Discussion:
AI workload optimization increasingly depends on coordinated actions across multiple system layers, requiring consistent visibility of network events.¶

5.5. REQ-5: Interoperability

Requirement:
The system SHOULD define a standardized representation of network events that can be interpreted consistently across heterogeneous AI Fabric deployments.¶

Discussion:
Heterogeneous hardware and multi-vendor environments require consistent event interpretation to ensure portable workload behavior.¶

5.6. REQ-6: Scalability

Requirement:
The system SHOULD support event dissemination in AI Fabric environments with large-scale deployments (e.g., thousands of compute nodes) without introducing disproportionate communication or processing overhead.¶

Discussion:
AI clusters are expected to continue scaling in size and complexity, requiring notification mechanisms that remain efficient under increasing event volume and node count.¶

5.7. REQ-7: Reliability

Requirement:
The system SHOULD ensure reliable delivery of critical network event notifications, minimizing loss or inconsistent delivery when such events may affect workload correctness or performance.¶

Discussion:
Reliable event dissemination improves the consistency of distributed decision-making in AI workloads.¶

5.8. REQ-8: Security

Requirement:
The system SHOULD ensure the authenticity, integrity, and controlled delivery of network event notifications, while maintaining acceptable notification latency.¶

Discussion:
Event notifications may directly influence scheduling and runtime behavior, requiring protection against unauthorized modification or injection.¶

5.9. REQ-9: Extensibility

Requirement:
The system SHOULD support extensible event types and metadata fields to accommodate future AI networking technologies and deployment models without requiring changes to the core notification mechanism.¶

Discussion:
AI networking systems are rapidly evolving, requiring forward-compatible event representation mechanisms.¶

6. Architecture Considerations

This section describes the high-level architecture of an AI-oriented fast network event notification system for AI Fabrics. It defines a conceptual model of system entities, their interactions, and a layered organization for event-driven notification across distributed AI environments. This section does not specify protocol mechanisms or implementation details.¶

6.1. Architectural Model

The system consists of three logical roles: event producers, event consumers, and an AI scheduling and coordination function.¶

Event Producers are entities that generate network event notifications based on observed changes in network state. They operate at the data plane or host network layer and are responsible for detecting link status changes, congestion signals, and communication anomalies, and emitting corresponding event notifications.¶

Event Consumers are entities that receive and process network event notifications in the context of AI workload execution. They may reside in runtime systems, host agents, or cluster control components, and are responsible for deriving workload-level actions from network events.¶

AI Scheduling and Coordination Function aggregates events from multiple producers, correlates events with workload metadata, and performs scheduling or control decisions based on global system state.¶

6.2. Interaction Model

The system follows an event-driven push-based interaction model.¶

Event producers generate notifications upon detection of predefined network conditions. Events are delivered asynchronously to subscribed consumers without requiring polling or explicit query operations.¶

Event delivery MAY be differentiated based on event priority or severity.¶

Consumers process received events by associating them with affected workloads and executing corresponding adaptation actions.¶

6.3. Layered Architecture

The architecture is organized into three layers:¶

Data Plane Event Detection Layer
Responsible for observing network and host communication state and generating event signals.¶

Event Transport and Notification Layer
Responsible for event encapsulation, prioritization, and delivery across system components.¶

Application Semantic and Control Layer
Responsible for interpreting network events in the context of AI workloads and performing control decisions.¶

6.4. Key Design Trade-offs

The architecture incorporates the following trade-offs:¶

Latency vs. overhead: Event granularity is balanced against system overhead through filtering and aggregation mechanisms.¶
Generality vs. specialization: The model supports general event notification while allowing extension for AI-specific semantics.¶
Distributed detection vs. centralized coordination: Event detection is distributed, while coordination is logically centralized.¶

6.5. Summary

This architecture defines a layered, event-driven model for fast network event notification in AI Fabrics. It separates event generation, transport, and semantic interpretation, enabling scalable coordination between network infrastructure and AI workload execution.¶