| Internet-Draft | Congestion-Aware Flow Switching | February 2026 |
| Li, et al. | Expires 1 September 2026 | [Page] |
This document defines a congestion-aware adaptive flow table switching mechanism for Equal-Cost Multi-Path (ECMP) routing. The mechanism periodically assesses the congestion state of egress ports and progressively adjusts flow table mappings based on quantified congestion levels. This addresses the port congestion issues that occur in traditional ECMP load balancing when traffic patterns change suddenly or multicast traffic is present, while maintaining packet ordering within flows.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 1 September 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Equal-Cost Multi-Path (ECMP) routing is a widely deployed load balancing technology in data center networks [RFC2991]. Traditional ECMP distributes traffic across multiple equal-cost paths by hashing packet header fields, typically the five-tuple. To ensure packet ordering within a flow, the mapping between a flow and its egress port typically remains unchanged throughout the flow's lifetime.¶
However, this static mapping approach exhibits significant limitations in the following scenarios:¶
Traffic Surge Scenario: Network traffic is highly dynamic and may cause sudden increases on certain ports. The flow table mapping cannot be adjusted in time to alleviate congestion.¶
Multicast Traffic Scenario: The replication characteristics of multicast traffic may cause it to concentrate on a small number of ports, exacerbating load imbalance.¶
Existing congestion response strategies typically adopt two extreme approaches: either no switching (maintaining the original mapping until flow aging) or full switching (simultaneously migrating all flows on a congested port). The former cannot respond to congestion in a timely manner, while the latter may cause congestion transfer and resource fluctuations.¶
This document defines a congestion-aware adaptive flow table switching mechanism that quantifies port congestion levels and progressively adjusts flow table mappings to achieve dynamic optimization of load balancing while preserving packet ordering.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Traditional ECMP load balancing uses static hash mapping. Once a flow is assigned to a port, the mapping remains unchanged throughout the flow's lifetime. This design has the following deficiencies:¶
Delayed Response: When a port becomes congested, flows already mapped to that port cannot be migrated in time, causing congestion to persist.¶
Load Imbalance: The randomness of traffic and the presence of elephant flows may cause severe load imbalance between ports.¶
Flowlet Switching: This mechanism switches based on inter-packet gaps within a flow and relies on manually configured time thresholds. If the threshold is too large, it degrades to traditional ECMP; if too small, it may cause packet reordering.¶
Full-Switch Strategy: Migrating all relevant flows simultaneously when congestion is detected may cause the target port to be instantly overloaded, resulting in congestion transfer.¶
A mechanism is needed that can:¶
This mechanism defines two core functional components:¶
Port Congestion Assessment: Periodically assesses the congestion state of each egress port and generates a Congestion Quantification Index (CQI).¶
Adaptive Flow Table Migration: Progressively migrates flow table entries from congested ports to less loaded ports based on the CQI value.¶
The fundamental design principle is that the higher the CQI value, the more flow table entries are allowed to migrate in the current assessment interval. For each entry migrated, the CQI is decremented by 1 until the CQI reaches zero or no more entries need migration.¶
Implementations MUST support a configurable assessment interval. The RECOMMENDED default value is between 10ms and 100ms.¶
Implementations MAY adaptively adjust the assessment interval based on overall traffic levels: shortening the interval during high traffic to improve responsiveness, and lengthening it during low traffic to reduce overhead.¶
CQI calculation SHOULD be based on one or more of the following metrics: port egress queue depth, port buffer utilization, and port packet drop counter increment.¶
The CQI value range is 0 to CQI_MAX. The RECOMMENDED value for CQI_MAX is 16.¶
The recommended CQI calculation method is:¶
CQI = min(CQI_MAX, floor(queue_depth / congestion_threshold))¶
where congestion_threshold is the congestion determination threshold, RECOMMENDED to be 10% of queue capacity.¶
At the end of each assessment interval, the Port Congestion Assessment component MUST synchronize each port's CQI value to the Flow Table Migration component.¶
When a packet arrives, implementations MUST process it according to the following rules:¶
Rule 1 (Flow Table Does Not Exist): Perform normal flow table learning and select the port with the lightest current load.¶
Rule 2 (Port Failure): If the flow table exists but the corresponding port is unavailable, a new port MUST be selected.¶
Rule 3 (No Congestion): If the flow table exists and the corresponding port's CQI is 0, the implementation MUST continue using the current port and MUST NOT perform migration.¶
Rule 4 (Congestion Exists): If the flow table exists and the corresponding port's CQI is greater than 0, the implementation SHOULD perform flow table migration.¶
When migration is triggered, implementations MUST perform the following steps:¶
Step 1: Select the port with the smallest CQI from all available ports as the target. If multiple candidate ports have the same CQI, implementations MAY use random selection or round-robin.¶
Step 2: Update the flow table entry's egress port to the target port.¶
Step 3: Decrement the original port's CQI by 1.¶
A key property of this mechanism is that the migration quantity is proportional to the congestion level. When the CQI value is high, more flow table entries may be migrated within a single assessment interval. When the CQI value is low, the migration quantity decreases accordingly.¶
Implementations MUST ensure that within a single assessment interval, the number of flow table entries migrated from a port does not exceed that port's initial CQI value.¶
If the CQI does not drop to 0 within an assessment interval, subsequent assessment intervals will recalculate the CQI. If congestion persists, migration will continue; if congestion is alleviated, migration will decrease or stop.¶
A flow table entry MUST contain the following fields: flow identifier (obtained through hash calculation), egress port identifier, valid bit, and timestamp (for aging).¶
The port status table MUST contain the following fields: port identifier, port status (UP/DOWN), current CQI value, and queue depth.¶
Implementations MUST perform the following at startup:¶
The packet processing flow is as follows:¶
This mechanism is an enhancement extension to traditional ECMP, adding congestion awareness and adaptive migration capabilities on top of ECMP. Implementations MAY overlay this mechanism on existing ECMP implementations.¶
This mechanism MAY be used in conjunction with flowlet switching. Flowlet uses inter-packet gaps within a flow for switching, while this mechanism uses port congestion state to trigger switching. The two can be complementary.¶
This mechanism operates at the forwarding layer and is orthogonal to end-to-end congestion control mechanisms such as ECN and DCQCN. Implementations SHOULD consider coordination with congestion control mechanisms.¶
Attackers may induce frequent migration by forging traffic, consuming device resources.¶
Mitigation Measures: Implementations SHOULD set a maximum number of migrations per unit time. Implementations SHOULD use smoothing algorithms for CQI calculation to avoid overreaction to instantaneous fluctuations.¶
CQI values and migration decisions may reveal network topology or traffic pattern information.¶
Mitigation Measures: Implementations MUST implement access control for related data. Inter-module communication SHOULD use security mechanisms.¶
Mitigation Measures: Implementations MUST ensure configuration parameter integrity. Implementations SHOULD log configuration changes.¶
This document does not require IANA to allocate any resources.¶