| Internet-Draft | FANN Framework | June 2026 |
| Song & Dong | Expires 25 December 2026 | [Page] |
Many network applications, ranging from Artificial Intelligence (AI) / Machine Learning (ML) training and inference to large-scale cloud services, require networks with various combinations of high bandwidth, low delay, low jitter, and minimal packet loss. Meeting these requirements depends on the network's ability to adapt rapidly to faults, signal degradation, and congestion. The companion problem statement describes why existing mechanisms are too slow, too coarse, or too resource-intensive to react within the timescales at which modern forwarding hardware can detect and disseminate intended conditions.¶
This document defines a framework for Fast Network Notifications (FANN). It describes a reference architecture, the functional roles involved in generating and consuming notifications, an information model, delivery and scoping models, procedures for discovery, registration, and subscription, and the integration of fast network notifications with existing Layer 2 to 4 mechanisms. This framework is intended to guide the development of one or more fast network notification protocol specifications.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 25 December 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Modern high-performance networks, in particular data center (DC) and data center interconnect (DCI) fabrics serving AI/ML and cloud workloads, demand rapid adaptation to changing network conditions. A single fiber link failure, signal degradation, or transient congestion event can stall a distributed training job, waste compute and energy, and degrade service experience [I-D.ietf-rtgwg-net-notif-ps].¶
Contemporary forwarding hardware can detect link failures, signal degradation reported as link errors, queue buildup, microbursts, and output-queue congestion at microsecond to sub-millisecond timescales. However, the time required to disseminate this information to the remote nodes that can act on it typically far exceeds the detection time. This gap between detection and reaction is the central problem that fast network notifications address.¶
The Fast Network Notifications Problem Statement [I-D.ietf-rtgwg-net-notif-ps] documents the need for a fast notification mechanism and the limitations of existing approaches. The companion requirements [I-D.geng-fantel-fantel-requirements] and gap analysis [I-D.geng-fantel-fantel-gap-analysis] documents elaborate the requirements and the deficiencies of current technologies. Built on these documents, this document defines a framework to describe the overall architecture, the functional roles, the information carried, how notifications are delivered and scoped, and how the mechanism integrates with existing protocols and technologies across layers.¶
This informational document does not define a wire protocol, encoding, or YANG model. Those are expected to be specified in separate protocol and management documents that build on this framework.¶
This framework applies to limited-domain networks under a single administrative control, consistent with the deployment assumptions of the FANN charter. It prioritizes the requirements of DC and DCI networks where rapid responsiveness is critical, while remaining applicable to other deployments such as wide-area backbone networks.¶
The framework initially targets notifications for link failures, signal degradation reported as link errors, and port queue congestion, while remaining extensible to additional conditions in the future. The specific actions a recipient takes in response to a notification (for example fast reroute, adaptive load balancing, or rate adjustment) are out of scope of this framework; they are the responsibility of the consuming subsystem and the protocols that realize those actions.¶
In this document, "fast" does not denote a single rigid numerical threshold. It characterizes a class of mechanisms designed to minimize notification delivery time so that the latency is on the order of microseconds to milliseconds, depending on the operational objective and the diameter of the notification domain, and is substantially shorter than the Round-Trip Time (RTT) of the affected traffic.¶
This framework is solution-agnostic. It defines the functional roles, information model, and delivery and scoping models that a fast network notification solution is expected to instantiate, but it does not specify, mandate, or endorse any particular protocol, encoding, or solution document. It is intentionally general so that a range of realization approaches can conform to it, potentially in combination, without conflicting with one another or with this framework. Consistent with the FANN charter, fast generation and consumption in the forwarding plane (ideally in hardware) is the primary design point and the means of meeting the latency targets described above; consumption by the control plane or management plane is a secondary objective, permitted only where it preserves routing stability and does not compromise forwarding-plane responsiveness. Specific solutions are developed in separate documents; such documents are expected to map their behavior onto the roles and models defined here, and any capability they require that is not yet covered is expected to be accommodated as an extension of this framework rather than a departure from it.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document uses the following terms.¶
The terms BFD [RFC5880], ECN [RFC3168], FRR [RFC4090] [RFC5714], and IOAM [RFC9197] are used as defined in their respective references.¶
This chapter defines the core of the framework: its design principles, the deployment scenarios it serves, the functional reference architecture, the information carried in notifications, and the models for delivering, scoping, discovering, and controlling them.¶
The framework is guided by the following principles, derived from the problem statement [I-D.ietf-rtgwg-net-notif-ps] and requirements [I-D.geng-fantel-fantel-requirements].¶
Fast network notifications apply across a range of network scenarios, but the time budget, processing constraints, and the mechanisms that are practical differ substantially between them. This framework does not assume one-size-fits-all: the scenarios below have materially different characteristics, and a given deployment is expected to select the delivery mode, scope, and realization mechanism appropriate to its scenario rather than apply a single mechanism everywhere.¶
The functional roles (Section 3.3), information model (Section 3.4), and delivery modes (Section 3.5) defined in this document are common across scenarios, but their realization and the achievable latency are not. Where a requirement (for example a sub-millisecond target) is stated, it SHOULD be understood as scoped to the scenario in which it is feasible.¶
The framework defines four functional roles. A single physical or virtual network element MAY implement more than one role.¶
+-------------------------------------------------+
| Notification Domain |
| |
| [Detect] [Distribute] [Act] |
| |
| +----------+ +-----------+ +-----------+ |
| |Originator|->| Relay |--->| Consumer | |
| | (detect/ | | (forward/ | | (receive/ | |
| | generate)| | filter/ | | action) | |
| +----+-----+ | damp) | +-----+-----+ |
| | +-----+-----+ | |
| | | | |
| v v v |
| ........................................ |
| : Notification Controller : |
| : (discovery / registration / policy / : |
| : global optimization) : |
| ........................................ |
+-------------------------------------------------+
A fast network notification proceeds through the following stages.¶
The mechanisms by which a node detects an event are out of scope of this framework, but the framework assumes their existence and depends on their characteristics. Two assumptions are important.¶
First, the E2E responsiveness of a fast notification system is bounded by detection time as well as delivery time: a notification cannot be faster than the moment the originating node becomes aware of the condition. Detection latency, accuracy, and false-positive behavior therefore directly shape what the notification system can achieve, and an event that is detected slowly or unreliably limits the value of fast delivery.¶
Second, detection itself has a cost that interacts with scaling. For example, achieving fast liveness detection by running BFD [RFC5880] at very short transmit intervals consumes forwarding and control resources and does not by itself notify any node beyond the BFD endpoints. Driving detection intervals down to obtain faster notification can impose significant load, and this trade-off between detection speed and detection cost SHOULD be considered together with the notification load discussed in Section 4.4. Where hardware can detect a condition directly (for example loss of signal, FEC errors, or queue-occupancy thresholds), it is generally preferable to detection mechanisms that rely on periodic message exchange such as BFD. The relevant distinction is between hardware-based and protocol-session-based detection in terms of speed and overhead, rather than between polling and non-polling as such: a hardware mechanism may itself poll internally, but its detection speed and per-event cost are typically far lower than those of a protocol session driven to an aggressive interval.¶
A fast network notification carries one or more information elements. For a given scenario some elements are mandatory and others optional; the framework does not require all elements in every notification. The detailed encoding is left to protocol specifications. The information elements are:¶
A consistent information model across implementations is necessary for interoperability; defining the normative model and encodings is a task for the protocol specification.¶
Depending on the position and number of consumers, the framework supports the following delivery modes. A scenario MAY use more than one.¶
Delivery MAY reuse existing messaging and transport mechanisms or a new lightweight mechanism MAY be defined where existing ones cannot meet the latency or forwarding-plane processing targets. Regardless of the underlying transport, the delivery mechanism is responsible for timely delivery to the intended consumers and for bounding the load it introduces.¶
Because notifications are most valuable precisely when the network is under stress, the transport MUST support prioritization so that notifications are not delayed or dropped behind the very congestion they report. A notification that is queued behind the congested traffic loses most of its value. Prioritization can be realized using existing forwarding-plane mechanisms, including:¶
The chosen marking and per-hop behavior MUST be consistent across the notification domain so that priority is honored E2E within the domain. Operators MUST be able to configure the marking, and the markings used for notifications SHOULD be reserved so that ordinary traffic cannot claim the same priority and so that notification traffic itself cannot be abused to obtain preferential treatment (Section 5). Because notifications occupy a high-priority class, their volume MUST be bounded by the rate limiting, damping, and filtering of Section 3.7 to avoid starving other control traffic.¶
Reliability requirements vary by scenario: some events warrant best-effort, low-latency delivery, while others (for example recovery state) may warrant acknowledgement or periodic refresh.¶
Fast network notifications are confined to a notification domain. The framework requires mechanisms to:¶
Domain scoping bounds the blast radius of both legitimate notification storms and malicious injection, and it aligns the trust boundary with the single administrative control assumed by the charter.¶
To deliver notifications only to interested and authorized consumers, the framework supports the following procedures. A deployment MAY use configuration, dynamic signaling, or a combination.¶
These procedures MAY be realized by reusing existing protocols where appropriate, or by new mechanisms defined in the protocol specification work.¶
Because relays may forward notifications and consumers may relay further, the solution MUST provide for:¶
This chapter describes how the framework relates to existing technologies, the candidate mechanisms that could realize it, the applications it enables, and the scaling and operational considerations that apply when deploying it. It is informational and does not mandate any particular realization.¶
A central goal of the framework is integration with existing mechanisms across layers, as required by the charter. Fast network notifications are complementary to these mechanisms.¶
The interaction with each technology, including any required protocol extensions, is expected to be developed in the relevant IETF working groups.¶
This section surveys, non-normatively, classes of mechanism that could realize fast network notifications. It does not endorse a specific approach; the choice depends on the deployment scenario (Section 3.2), and a solution MAY combine more than one.¶
Each approach trades off latency, hardware dependence, protocol reuse, and impact on routing stability differently, and fits some scenarios in Section 3.2 better than others. Coordination when multiple recipients act on the same notification is out of scope and for further study.¶
This section sketches, non-normatively, applications that fast network notifications enable. The actions themselves are out of scope (Section 1.1); they illustrate what the information in Section 3.4 makes possible.¶
The solution must remain effective as the network grows. Scaling pressure arises from network size (the number of nodes and links that may report events), the volume and rate of change of reported information, and the number of consumers. The design assumption is that if anything can go wrong it will, so the system must cope with a high proportion of nodes and links reporting simultaneously.¶
The framework addresses scale through subscription (delivering only relevant information), scoping and domain isolation (bounding propagation), relay-based filtering and aggregation, damping of rapidly changing conditions, and transport prioritization and rate limiting. Protocol specifications SHOULD quantify the load their mechanisms place on the forwarding and control planes under worst-case event conditions.¶
Fast network notifications introduce additional traffic. During the failures and congestion events they report, the notification system MUST NOT exacerbate the situation and SHOULD actively assist in mitigating it. Operators SHOULD be able to configure which event types trigger notifications, the delivery modes and scopes used, damping and rate-limiting parameters, and prioritization, so that notification behavior aligns with network operation policies.¶
Management and configuration of the solution are expected to be supported by YANG modules, to be defined as a separate deliverable consistent with the charter. Manageability includes observability of the notification system itself (counts, drops, damping events) so operators can verify it is helping rather than harming.¶
If not properly authenticated and rate-limited, fast network notifications could be a denial-of-service vector: an attacker that injects or floods spurious notifications could trigger unnecessary re-convergence, path changes, or repeated state updates, and could induce state flapping to keep an originator busy. Notifications may also reveal sensitive operational information, whether by inspection or by an adversary registering as a consumer.¶
Accordingly, solutions built on this framework MUST provide integrity protection and origin authentication of notifications, MUST apply rate controls on both sending and receiving, and MUST address trust boundaries around domains and subscriptions, authorization of notification sources, and protection of sensitive operational data. Because stronger security can add latency, the trade-off between notification latency and security strength is considered per scenario. Domain identification and isolation (Section 3.5) are central to confining notifications to the trusted administrative boundary.¶
The charter's restriction to a single administrative control reduces, but does not eliminate, the threat surface. Because the operator controls every originator, relay, and consumer and the trust boundary coincides with the notification domain (Section 3.5), the boundary can drop notifications arriving from outside it (constraining external injection and spoofing), exposure of operational data to third parties is bounded, and trust for discovery, registration, and subscription can reuse existing intra-domain infrastructure. This lets the design favor lightweight, low-latency mechanisms internally while concentrating stronger enforcement at the domain boundary.¶
This assumption does not remove the need for in-domain protection: insider threats, compromised or malfunctioning nodes, and the self-inflicted denial of service of a flapping link all originate inside the boundary. The requirements above therefore still apply within the domain, and the single-administrative-control premise should be treated as defense in depth rather than a substitute for them.¶
This document requires no IANA actions.¶