Internet-Draft C-SCORE March 2026
Joung, et al. Expires 1 October 2026 [Page]
Workgroup:
DetNet Working Group
Internet-Draft:
draft-joung-detnet-stateless-fair-queuing-09
Published:
Intended Status:
Standards Track
Expires:
Authors:
J. Joung
Sangmyung University
J. Ryoo
ETRI
T. Cheung
ETRI
Y. Li
Huawei
P. Liu
China Mobile

Latency Guarantee with Stateless Fair Queuing

Abstract

This document specifies the implementation details for the framework specified in ITU-T Y.3129 [Y.3129] and ITU-T Y.3148 [Y.3148]. The framework guarantees end-to-end (E2E) latency bounds to flows. The schedulers in core nodes do not need to maintain flow states. Instead, the entrance node of a flow marks an ideal service completion time according to a fluid model, called Finish Time (FT), of a packet in the packet header. The subsequent core nodes update the FT by adding a delay factor, which is a function of the flow and the nodes. The packets in the queue of the scheduler are served in the ascending order of FT. This mechanism is called the stateless fair queuing. The result is that flows are isolated from each other almost perfectly. The latency bound of a flow depends only on the flow's intrinsic parameters such as the maximum burst size and the service rate, except the link capacities and the maximum packet length among other flows sharing each output link with the flow. This document specifies the metadata, formats of metadata, the admission control procedure, and an approximation of stateless fair queuing implemented via a strict priority (SP) scheduler.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 1 October 2026.

Table of Contents

1. Introduction

Recommendation ITU-T Y.3129 [Y.3129] specifies the requirements and framework for deterministic networking with a set of work conserving packet schedulers that guarantees end-to-end latency bounds to flows. The schedulers in core nodes do not need to maintain flow states. Instead, the entrance node of a flow marks an ideal service completion time according to a fluid model, called finish time (FT), of a packet in the packet header. The subsequent core nodes update FT by adding a delay factor, which is a function of the flow and upstream nodes. The packets in the queue of the scheduler are served in the ascending order of FT. This mechanism is called stateless fair queuing. The result is that flows are isolated from each other almost perfectly. The latency bound of a flow depends only on the flow's intrinsic parameters, except the maximum packet length among other flows sharing each output link with the flow.

Recommendation ITU-T Y.3148 [Y.3148] further specifies the functional architecture, functional entities, operational procedures, and packet metadata to fulfil the requirements and framework specified in Recommendation ITU-T Y.3129. It enables deterministic networking with a set of work conserving packet schedulers in the data plane, which guarantees end-to-end (E2E) latency bounds to flows.

In the framework specified in [Y.3129], the edge node through which a flow enters a network is called the entrance node. The entrance node for a flow generates FT for a packet and records it in the packet. A core node, based on these records, updates FT, without per-flow state, by adding a delay factor that is a function of parameters of the node and the flow. This framework is called work conserving stateless core fair queuing (C-SCORE) [C-SCORE]. C-SCORE is work conserving and has the property that, for a certain choice of the delay factor, the expression for E2E latency bound can be found. This E2E latency bound function is the same as that of a network with stateful fair queuing schedulers in all the nodes.

This document specifies the implementation details for realizing C-SCORE. This document specifies the metadata, formats of metadata, the admission control procedures, and an approximation of stateless fair queuing implemented via a strict priority (SP) scheduler.

A key component of C-SCORE is the packet state that is carried as metadata. C-SCORE does not need to maintain flow states at core nodes, yet it works as one of the fair queuing schedulers, which is known to provide the best flow isolation performance. The metadata to be carried in the packet header is simple and can be updated during the stay in the queue or before joining the queue.

2. Terminology

2.1. Terms Used in This Document

2.2. Abbreviations

3. Conventions Used in This Document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. Assumptions

In this document, we assume there are only two classes of traffic. The high priority or equivalently DetNet traffic requires guarantee on latency upper bounds. All the other traffic is considered to be the low priority or Best Effort (BE) traffic. High priority traffic is our only concern. However, packets of BE traffic cannot be preempted by the high priority traffic.

All the flows conform to their traffic specification (TSpec) parameters. In other words, with the maximum burst size Bi and the arrival rate ai, the accumulated arrival from flow i in any arbitrary time interval [t1, t2], t1 < t2, does not exceed Bi+(t2-t1)ai. An actual allocated service rate to a flow, ri, can be larger than or equal to the arrival rate of the flow. As it will be shown in (6) that, by adjusting the service rate to a flow, the E2E latency bound of the flow can be adjusted. Note that ri is used interchangeably with the symbol r to denote the service rate of a flow under observation. Total allocated service rate to all the flows in a node does not exceed the link capacity of the node. These assumptions make the resource reservation and the admission control mandatory.

A node, or equivalently a server, means an output port module of a switching device.

The entrance node for a flow is the node located at the edge of a network, from which the flow enters into the network. A core node for a flow is a node in the network, which is traversed by the flow and is not the entrance node. Note that a single node can be both an entrance node to a flow and a core node for another flow.

A packet is defined as arrived or serviced when its final bit is received by or transmitted from the node, respectively.

5. Framework

Fair queuing (FQ) schedulers utilize the concept of finish time (FT) that is used as the service order assigned to a packet. The packet with the minimum FT in a buffer is served first.

The key idea of the FQ is to calculate the ideal service completion times of packets in an imaginary fluid service model and use them as the service order in the real packet-based scheduler.

While having the excellent flow isolation property, FQ schedulers need to maintain the flow state, F(p-1). Upon the arrival of each packet, the system must perform flow identification and retrieve the FT associated with the preceding packet in that flow. The system is required to update the flow state to be F(p) concurrently with the packet's departure.

[Y.3129] and [Y.3148] specify a FQ scheduler, which does not need to maintain the flow state in core nodes. In the entrance node, the FTs are obtained with the following equation, where 0 denotes the entrance node of the flow under observation.

F0(p) = max{F0(p-1), A0(p)}+L(p)/r.    (1)

In a core node h, the FT of a packet is increased by an amount, d(h-1)(p), that depends on the previous node and the packet.

Fh(p) = F(h-1)(p) + d(h-1)(p).    (2)

The framework achieves a bounded E2E latency by defining dh(p) as follows.

 dh(p) = Lh/Rh + L/r + delta_h(p),        (3)

where Lh is the observed maximum packet length in the node h over all the flows, Rh is the link capacity of the node h, L is the maximum packet length of the flow, r is the service rate of the flow, and delta_h(p) is the function that represents the time difference between node h and h+1.

The time difference function of packet p, delta_h(p), is defined as the difference between the service completion time at node h and the arrival time at node h+1 of the packet. It includes the clock discrepancy and the propagation delay between nodes. Note that this function is relatively stable over packets, thus can be approximated as a constant value delta_h.

When dh(p) is decided by (3), then it is proven that

Dh(p) <= (B-L)/r + sum_(j=0)^h{Lj/Rj + L/r} + sum_(j=0)^(h-1){delta_j(p)},(4)

where Dh(p) is the latency experienced by p from the arrival at the node 0 to the departure from node h [KAUR], [C-SCORE]. B is the maximum burst size of the flow under observation that p belongs to. The term sum_(j=0)^(h-1){delta_j(p)} includes the clock discrepancies and the propagation delays between the nodes. Considering that the clock discrepancies can be neglected in an absolute time reference, this term effectively represents the E2E propagation delay.

Note that the latency bound in (4) is the same to the network where every node has a stateful FQ scheduler. The parameters in the latency bound are all intrinsic to the flow, except Lh/Rh and propagation delays.

6. Operational Procedures

6.1. Metadata

[Y.3129] and [Y.3148] specifies the metadata to be carried by a packet, which are Fh(p), L, and r. Fh(p) is dynamic and needs to be updated every hop, in accordance with (2). L and r are static values for a flow.

As a packet arrives at a core node h, it carries metadata Fh(p), L, and r. Fh(p) is pre-calculated at node h-1 with d(h-1)(p), which is a function of node h-1. dh(p) can be obtained by the summation of L/r, and the node specific parameters Lh/Rh and delta_h. Lh/Rh and delta_h values should be maintained by the node h. At node h, F(h+1)(p) value is calculated and replaces Fh(p).

The detailed operations regarding metadata creation and updates in the entrance node and core nodes are specified in [Y.3129] and [Y.3148]. The considerations for the mitigation of the complexity at the entrance node can also be found in these references.

6.3. Admission control for latency guarantee

The following is a recommended procedure for admission of a flow in the C-SCORE framework.

1)
An application requests a flow with T-Spec (Traffic Specification) and R-Spec (Requested Specification). T-Spec includes service rate r, maximum packet size L, and maximum burst size B. R-Spec includes the E2E latency and jitter bounds.
2)
The entrance node sends a PATH message toward the egress. This message contains the T-Spec and R-Spec.
3)
The entrance node acts as the admission controller. It maintains per-flow state and performs the initial shaping, if necessary.
4)
Core nodes check if the requested service rate r is within the aggregate bandwidth threshold of the outgoing interface.
5)
Each core node adds its per-hop max latency value (L/r + Lh/Rh) to a field in the PATH message. This allows the egress node to calculate the E2E latency bound.
6)
The egress node receives the PATH message. If the E2E latency bound meets the application's requirement, it generates a RESV message.
7)
The RESV message travels back to the entrance node. The core nodes in the path confirms the admission of the flow.
8)
If the RESV message reaches the entrance node, the flow is admitted.

In a C-SCORE architecture utilizing the RSVP-like admission process described above, it is not only possible but highly efficient to find out and negotiate the optimal available service rate for a flow. The admission process can probe the network to determine the maximum supportable rate without violating the deterministic latency bounds of existing flows. The discovery of the available rate occurs during the PATH message traversal. The following is an alternative admission process, which identifies the tightest capacity link in the path.

1)
An application requests a flow with T-Spec (Traffic Specification) and R-Spec (Requested Specification).
2
The entrance node sends a PATH message with a Desired Rate (r_desired) and a Minimum Acceptable Rate (r_min), along with T-Spec and R-Spec.
3)
Each core node on the path calculates its residual capacity (r_avail). This is the total capacity of the link minus the sum of the service rates (r) of all already admitted flows.
4)
As the PATH message moves hop-by-hop, it maintains a field called Path-Available-Rate (r_path). At each hop h, the router performs: r_path = min(r_path, r_avail_h), while the subscript h denotes the node.
5)
By the time the PATH message reaches the egress, r_path represents the maximum service rate the entire end-to-end path can support at that specific moment.
6)
The optimal rate is not always the highest possible rate. The egress node can select the optimal rate, which can be less than r_path.
7)
Once the optimal service rate (r = r_opt) is determined, the egress sends the RESV message back to the entrance node, carrying the r_opt value.
8)
The core nodes in the path confirms the r_opt value for the flow.
9)
If the RESV message reaches the entrance node, the flow is admitted.
10)
The entrance node commits this rate to its per-flow state.

The optimal rate should be between r_path and r_min. The egress node can select the optimal rate based on the following criteria:

-
Latency requirements: Higher service rates result in smaller Finish Time (FT) increments, reducing the per-hop queuing delay.
-
Buffer constraints: The rate must be balanced with the max burst (B) parameter to ensure the core nodes' buffers do not overflow. Note that the burst accumulates linearly with the service rate.
-
Network availability: The egress may choose a rate lower than r_path to leave room for other flows that can join later.

Since the core nodes only need to know the final committed rate r in the packet header, the complex discovery logic is restricted to the control plane and edge nodes. Large-scale networks can use this mechanism to perform the aggregate rate discovery, where the r_path represents the available capacity for an entire bundle of flows of the same path. By using this approach, the admission process also effectively maps the current congestion state of the network onto a single service rate parameter.

6.4. Compensation algorithm of time difference between nodes

There are time differences between nodes, including the clock discrepancies and the propagation delays. This time difference can be defined as the difference between the service completion time of a packet measured at the upstream node and the arrival time of the packet measured at the current node. In other words,

   delta_h(p) = A(h+1)(p) - Ch(p),

where delta_h(p) is the time difference between node h and h+1, and Ch(p) is the service completion time measured at node h, for packet p respectively.

FT does not need to be precise. It is used just to indicate the packet service order. Therefore, if we can assume that the propagation delay is constant and the clocks do not drift, then delta_h(p) can be simplified to a constant value, delta_h.

The time difference delta_h may be updated only once in a while.

The procedure for obtaining the departure time from node h, Ch(p), at node h+1 will be elaborated in a later version of this draft. The Ch(p) information can be obtained on demand, or be reported periodically. This information exchange can be based on a piggybacking mechanism with packet metadata.

Alternatively, the explicit messages can be exchanged. In this case the message format will follow that of Network Time Protocol (NTP), but the procedure will be simpler.

Note that the time difference between non-adjacent nodes can also be obtained similarly. This feature is useful when there are non-compliant nodes in between. In this case, however, the variable queuing delay from the non-compliant nodes should be taken into account. One possible solution is to sample the time difference values over an enough interval, and take the maximum value.

7. Characteristics

7.1. Taxonomy

The framework in this document, C-SCORE, is a flow level, rate based, work conserving, asynchronous, non-periodic, and in-time solution, according to the taxonomy suggested by [I-D.ietf-detnet-dataplane-taxonomy].

[I-D.ietf-detnet-dataplane-taxonomy] also defines seven suitable categories for deterministic networking. A category is defined to be a set of solutions that is put together by one or more criteria, where a criterion is a principle or standard by which a solution can be judged or decided to be put into a certain category.

C-SCORE belongs to the "flow level rate based unbounded category", which is one of the seven suitable categories, according to this categorization.

7.2. Strengths

C-SCORE's per hop latency dominant factor is the maximum packet length divided by the service rate of the flow. This is independent of other flows' parameters. As such, its most distinguishable strength is the flow isolation capability. It can assign a fine tuned E2E latency bound to a flow, by controlling the flow's own parameters such as the service rate. Once the latency bound is assigned to the flow, then it remains almost the same in spite of the network situation changes, such as other flows' join and leave.

It is work conserving, thus enjoys the statistical multiplexing gain without wasting bandwidth, which is the key to the Internet's success. The consequence is a smaller average latency. The observable maximum latency is also much smaller than the theoretical latency bound. Note that, with a work conserving solution, observing the theoretical latency bound is extremely difficult in real situations. It is because the worst latency is an outcome of a combination of multiple rare events, e.g. a maximum burst from a flow collides with the maximum bursts from all other flows at every node. In contrast, non-work conserving solutions make it common to observe their latency bounds.

It is rate based, thus the admission condition check process is simple, which is dependent only on the service rates of flows. This process aligns well with existing protocols.

Overall, C-SCORE suits large scale networks, at any utilization level, with various types of flows join and leave dynamically.

8. Approximate C-SCORE via strict priority schedulers

8.1. General description

C-SCORE requires a priority queue that sorts the packets in a queue, in accordance with their finish times. This may restrict the overall maximum throughput of a system, when compared to the strict priority (SP) schedulers used in the current practices of switching nodes. SP schedulers are usually composed of 8 to 32 queues, and schedule the packets of higher priority queue first whenever they are present. The packets in the same queue are served on a first in first out (FIFO) basis. It is common to find the SP schedulers in hardware chips for current switching nodes.

It would be desirable if C-SCORE can be implemented with such an SP scheduler. In this section, the architecture and algorithms for the Approximate C-SCORE with rotating SP schedulers are specified. The E2E latency bound of the network of the approximate C-SCOREs is also specified.

8.2. Transit node architecture

The architecture of an approximate C-SCORE transit node, in which the SP scheduler with a limited number of queues behaves as an approximate priority queue, is specified in this section.

The input port module classifies the deterministic flows and best effort (BE) flows. The deterministic flows are put into one of N queues that work as rotating SP scheduler queues. If the queue k, 0 <= k <= N-1, is the highest priority queue at time t, then the queue (k+N-1)(mod N) has the lowest priority at t. BE flows are put into its own queue in the output port module. BE queue is served only when there is no packet in queues 0 to N-1.

Assume that the time is divided into fixed-length slots. Let us say that a slot is allocated to a certain queue. Further, let T_i denote the terminal boundary of Slot i. An arriving packet p is assigned to Slot i if its finish time, F(p), falls within the interval (T_(i-1),T_i]. This mapping corresponds to a discrete set of hardware queues where packets are buffered and processed according to a First-In-First-Out (FIFO) discipline. The system utilizes a strict priority (SP) scheduler to arbitrate across these queues, granting precedence to those representing the earliest finish-time. The scheduler operates in a work-conserving manner, providing service at line rate whenever the system is backlogged.

8.3. Algorithms

To provide differentiated quality-of-service, the per-hop latency must be modulated according to the specific service rates of traversing flows. By reducing the scheduling granularity, or slot length (S), the system can provide tiered latency bounds based on the number of slots occupied by a flow's maximum virtual service interval (L/r). Specifically, for a flow f where (n-1)S < L/r <= nS, the per-hop latency is bounded by (n+1)S. This mechanism ensures fairness by granting lower latency to flows with higher service rates (smaller L/r), thereby aligning temporal performance with bandwidth allocation. The scheduling granularity, or slot duration S, is defined such that S >= min_p[L(p)/r(p)]. While a finer granularity generally minimizes end-to-end latency, reducing S below this lower bound, which is determined by the minimum packet transmission time relative to the flow rate, yields no additional scheduling benefit. In practical implementations, N hardware queues are allocated and managed in a cyclic manner to accommodate these slots. Consequently, the selection of S involves a fundamental trade-off: a coarser slot duration allows for a reduced number of physical queues, N, at the expense of decreased scheduling precision.

In an idealized preemptive system, the scheduler ensures that all packets mapped to a specific slot are fully serviced before its terminal boundary, T_i. Consequently, the service interval for any packet p is strictly contained within its assigned slot. However, in a realistic non-preemptive environment, the completion time is extended by a factor of L_h/R_h, where L_h is the maximum packet length and R_h is the link rate of the output link h, respectively. Thus, the service is guaranteed to complete no later than T_i + L_h/R_h.

The approximation follows the architecture of C-SCORE. In other words, equations (1) and (2) are still used. However (3) is replaced by the following equation (5).

   d_h(p) = (L_h^max)/R_h + (n_(f,h)+1)S_h + delta_h(p),    (5)

8.4. E2E latency bound of the approximate C-SCORE

The E2E latency of a network with the approximate C-SCORE schedulers is upper bounded by

   B/r + sum_(h=0)^H{(n_(f,h)+1)S_h + L_h/R_h} + sum_(h=0)^(H-1){delta_h(p)},

where S_h is the slot length at node h, n_(f,h) is an integer specific to flow f at node h, which meets (n_(f,h)-1)S_h < L_f/r <= n_(f,h)S_h. In other words, n_(f,h)=ceiling[L_f/(r*S_h)]. The term sum_(h=0)^(H-1){delta_h(p)} includes the clock discrepancies and the propagation delays between the nodes. Considering that the clock discrepancies can be neglected in an absolute time, this term actually represents the E2E propagation delay.

9. Considerations for non-compliant nodes

There can be non-compliant end nodes and relay nodes in the network. There can be naturally end nodes without necessary signalling capabilities for admission control. There also can be legacy nodes or the nodes with dataplane enhancement solutions other than C-SCORE. How these can be compensated, or how much these nodes degrade the performance of C-SCORE will be discussed in a later version.

10. Relationships to ITU-T Standards

This document is based on two ITU-T standards, Y.3129 [Y.3129] and Y.3148 [Y.3148].

Recommendation ITU-T Y.3129 specifies the requirements and framework for C-SCORE. The requirements are for generation, update and properties of FT. The framework describes a mechanism to guarantee E2E latency bounds, while meeting the requirements. It specifies how to obtain FT in core nodes, select the delay factor, and configure a network for latency guarantee.

Recommendation ITU-T Y.3148 specifies the functional architecture, functional entities, operational procedures, and packet metadata to fulfil the requirements and framework specified in Recommendation ITU-T Y.3129. The operational procedures are for the stateless FQ in relay nodes, including entrance node and core nodes with creation and update of FT values of packets.

This document bridges the gap between the high-level ITU-T standards and deployable protocols. It achieves this by defining the metadata structures and the corresponding header formats for IPv6 and MPLS. Furthermore, it specifies the procedures required for admission control and time-difference compensation, alongside the technical methodology for C-SCORE approximation.

11. IANA Considerations

There might be matters that require IANA considerations associated with metadata. If necessary, relevant text will be added in a later version.

12. Security Considerations

This section will be described later.

13. Acknowledgements

14. Contributor

15. References

15.1. Normative References

[I-D.ietf-mpls-mna-detnet]
Song, X., Mirsky, G., Varga, B., Gandhi, R., and Q. Xiong, "MPLS Network Action for Deterministic Networking", Work in Progress, Internet-Draft, draft-ietf-mpls-mna-detnet-00, , <https://datatracker.ietf.org/doc/html/draft-ietf-mpls-mna-detnet-00>.
[I-D.ietf-mpls-mna-ps-hdr]
Rajamanickam, J., Gandhi, R., Zigler, R., Dong, J., and J. Bhattacharya, "Post-Stack MPLS Network Action (MNA) Solution", Work in Progress, Internet-Draft, draft-ietf-mpls-mna-ps-hdr-07, , <https://datatracker.ietf.org/doc/html/draft-ietf-mpls-mna-ps-hdr-07>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8200]
Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", STD 86, RFC 8200, DOI 10.17487/RFC8200, , <https://www.rfc-editor.org/info/rfc8200>.
[Y.3129]
International Telecommunication Union, "Requirements and framework for stateless fair queuing in large scale networks including IMT-2020 and beyond", ITU-T Recommendation Y.3129, .
[Y.3148]
International Telecommunication Union, "Functional architecture for stateless fair queuing in large scale networks including IMT-2020 and beyond", ITU-T Recommendation Y.3148, .

15.2. Informative References

[C-SCORE]
Joung, J., Kwon, J., Ryoo, J., and T. Cheung, "Scalable flow isolation with work conserving stateless core fair queuing for deterministic networking", IEEE Access, vol. 11, pp. 105225 - 105247, doi:10.1109/ACCESS.2023.3318479, .
[I-D.ietf-detnet-dataplane-taxonomy]
Joung, J., Geng, X., Peng, S., and T. T. Eckert, "Dataplane Enhancement Taxonomy", Work in Progress, Internet-Draft, draft-ietf-detnet-dataplane-taxonomy-05, , <https://datatracker.ietf.org/doc/html/draft-ietf-detnet-dataplane-taxonomy-05>.
[KAUR]
Kaur, J. and H.M. Vin, "Core-stateless guaranteed rate scheduling algorithms", in Proc. INFOCOM, vol.3, pp. 1484-1492, .

Authors' Addresses

Jinoo Joung
Sangmyung University
Jeong-dong Ryoo
ETRI
Taesik Cheung
ETRI
Yizhou Li
Huawei
Peng Liu
China Mobile