Network Working Group R. Miao Internet-Draft Meta Intended status: Informational S. Anubolu Expires: 6 January 2025 Broadcom Inc R. Pan AMD J. Lee Google B. Gafni J. Tantsura NVIDIA A. Alemania Intel Y. Shpigelman NVIDIA 5 July 2024 Inband Telemetry for HPCC++ draft-miao-ccwg-hpcc-info-03 Abstract Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and network stability in high-speed networks. However, the existing high-speed CC schemes have inherent limitations for reaching these goals. In this document, we describe HPCC++ (High Precision Congestion Control), a new high-speed CC mechanism which achieves the three goals simultaneously. HPCC++ leverages inband telemetry to obtain precise link load information and controls traffic precisely. By addressing challenges such as delayed signaling during congestion and overreaction to the congestion signaling using inband and granular telemetry, HPCC++ can quickly converge to utilize all the available bandwidth while avoiding congestion, and can maintain near-zero in- network queues for ultra-low latency. HPCC++ is also fair and easy to deploy in hardware, implementable with commodity NICs and switches. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Miao, et al. Expires 6 January 2025 [Page 1] Internet-Draft HPCC++ July 2024 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 6 January 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Inband telemetry padding at the network switches . . . . . . 4 2.1. Inband telemetry on IFA2.0 . . . . . . . . . . . . . . . 4 2.2. Inband telemetry on IOAM . . . . . . . . . . . . . . . . 5 2.3. Inband telemetry on P4.org INT . . . . . . . . . . . . . 6 3. Inband telemetry on CSIG . . . . . . . . . . . . . . . . . . 7 3.1. How to use CSIG for HPCC++ . . . . . . . . . . . . . . . 8 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 10 8. Normative References . . . . . . . . . . . . . . . . . . . . 10 9. Informative References . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 Miao, et al. Expires 6 January 2025 [Page 2] Internet-Draft HPCC++ July 2024 1. Introduction The link speed in data center networks has grown from 1Gbps to 100Gbps in the past decade, and this growth is continuing. Ultralow latency and high bandwidth, which are demanded by more and more applications, are two critical requirements in today's and future high-speed networks. Given that traditional software-based network stacks in hosts can no longer sustain the critical latency and bandwidth requirements as described in [Zhu-SIGCOMM2015], offloading network stacks into hardware is an inevitable direction in high-speed networks. As an example, large-scale networks with RDMA (remote direct memory access) often uses hardware-offloading solutions. In some cases, the RDMA networks still face fundamental challenges to reconcile low latency, high bandwidth utilization, and high stability. This document describes a new congestion control mechanism, HPCC++ (Enhanced High Precision Congestion Control), for large-scale, high- speed networks. The key idea behind HPCC++ is to leverage the precise link load information from signaled through inband telemetry to compute accurate flow rate updates. Unlike existing approaches that often require a large number of iterations to find the proper flow rates, HPCC++ requires only one rate update step in most cases. Using precise information from inband telemetry enables HPCC++ to address the limitations in current congestion control schemes. First, HPCC++ senders can quickly ramp up flow rates for high utilization and ramp down flow rates for congestion avoidance. Second, HPCC++ senders can quickly adjust the flow rates to keep each link's output rate slightly lower than the link's capacity, preventing queues from being built-up as well as preserving high link utilization. Finally, since sending rates are computed precisely based on direct measurements at switches, HPCC++ requires merely three independent parameters that are used to tune fairness and efficiency. HPCC++ is an enhanced version of [SIGCOMM-HPCC]. HPCC++ takes into account system constraints and aims to reduce the design overhead and further improves the performance. Detailed specification about HPCC++ can be found at [draft-miao-ccwg-hpcc]. This document describes the architecture changes in switches and end- hosts to support the needed tranmission of inband telemetry and its consumption, that imporves the efficiency in handling network congestion. Miao, et al. Expires 6 January 2025 [Page 3] Internet-Draft HPCC++ July 2024 2. Inband telemetry padding at the network switches HPCC++ only relies on packets to share information across senders, receivers, and switches. The switch should capture inband telemetry information that includes link load (txBytes, qlen, ts) and link spec (switch_ID, port_ID, B) at the egress port. Note, each switch should record all those information at the single snapshot to achieve a precise link load estimate. Inside a data center, the path length is often no more than 5 hops. The overhead of the inband telemetry padding for HPCC++ is considered to be low. As long the above algorithm is met, HPCC++ is open to a variety of inband telemetry format standards, which are orthogonal to the HPCC++ algorithm. Although this document does not mandate a particular inband telemetry header format or encapsulation, we provide concrete implementation specifications using strandard inband telemetry protocols, including IFA [I-D.ietf-kumar-ippm-ifa], IETF IOAM [RFC9179], and P4.org INT [P4-INT]. In fact, the emerging inband telemetry protocols inform the evolution for a broader range of protocols and network functions, where this document leverages the trend to propose the architecture change to support in-network functions like congestion control with high efficiency. 2.1. Inband telemetry on IFA2.0 For more details, please refer to IFA [I-D.ietf-kumar-ippm-ifa] 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | lns | deviceID | rsvd | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Speed | rsvd | rxTimestampSec | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | egressPort | ingressPort | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | rxTimeStampNs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | residenceTime | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | txBytes | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | rsvd | Queue Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | rsvd | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Miao, et al. Expires 6 January 2025 [Page 4] Internet-Draft HPCC++ July 2024 Figure 1: Example IFA header Figure 1 shows the packet format of the INT metadata after UDP and IFA metadata header. The field lns is the local name space and defines the format of the metadata. The field deviceID is a 20-bit field that uniquely identifies the device in the network. The Speed field is an encode field with the following encoding for port speed: 0 - 10G, 1 - 25G, 2 - 40G, 3- 50G, 4 - 100G, 5 - 200G, 6 - 400G. The field cn is the congestion field and denotes if the packet experienced congestion. 2.2. Inband telemetry on IOAM IOAM is the technology adopted by IETF to be used for in-situ telemetry. For the use of HPCC++ we would discuss the IOAM trace option as part of the IOAM architecture. IOAM trace supports both Pre-allocated and Incremental trace Options, meaning that a node in the network may either write data into an already-allocated space in the packet, or may it add the data as an extenation to the IOAM header, respectively. An IOAM data header has a modular design, where the data types written by a node are determined based on the IOAM trace header instruction list. For the full description of the IOAM header design please refer to IETF IOAM [RFC9179] specification. In order to fulfill the requirements set by the HPCC++ architecture we would suggest to use the below trace types: * Hop_Lim and node_id Short * Ingress_if_id and egress_if_id Short * Queue Depth * Timestamp Fraction: To be used as egress timestamp rather than an ingress timestamp * Transmitted Bytes Note that Transmitted Bytes trace type is defined in [I-D.draft-gafni-ippm-ioam-additional-data-fields] as a suggested extension to [RFC9179]. When using the above trace types, the IOAM data header would be constructed as follows: Miao, et al. Expires 6 January 2025 [Page 5] Internet-Draft HPCC++ July 2024 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hop_Lim | node_id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ingress_if_id | egress_if_id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | queue depth | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp fraction | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | tx_bytes | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: Example of an IOAM data header 2.3. Inband telemetry on P4.org INT 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Node ID (Nth hop) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Ingress Interface ID | Egress Interface ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Queue ID | Queue occupnacy | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Egress timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Egress timestamp (cont'd) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Egress interface Tx utilization | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Node ID (N-1th hop) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Node ID (1st hop) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: Example P4.org INT v2.1 per-hop metadata header Miao, et al. Expires 6 January 2025 [Page 6] Internet-Draft HPCC++ July 2024 Figure 3 shows the per-hop metadata format of the P4.org INT-MD mode (following INT v2.1 spec). Each hop switch along the path adds its Node ID for the sender to be able to track the path and detect a path change event. If so, it throws away the existing status records of the flow and builds up new records. Queue occupancy (24 bits) is the current buffer occupancy of the egress port and queue that the flow is going through. Egress timestamp (8 bytes) is used by HPCC++ algorithm to eventually compute interface utilization. Since P4.org INT reports Egress TX utilization in-band, the Egress timestsamp is not mandatory but optional. HPCC++ algorithm today doesn't require Ingress Interface ID. P4.org INT defines Ingress and Egress Interface IDs as one metadata instruction. We keep the Ingress ID for a future use. 3. Inband telemetry on CSIG 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TPID | T |R| S | LM | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0-15| TPID : IEEE allocated Tag Protocol ID for 4 Byte CSIG tag |16-18| T : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD)) |19| R : Reserved |20-24| S : Signal Value: Bucketed (32 configurable buckets) |25-31| LM : Locator Metadata of bottleneck device / port Figure 4: CSIG Compact Header Format CSIG (Congestion SIGnaling) [I-D.ietf-ravi-ippm-csig] is a compact in-band telemetry solution that carries a fixed-size aggregate metric computed over the hop devices. CSIG supports two formats: compact and expanded formats. The expanded format is to be further defined in IETF. This section discusses how to support HPCC++ with the 2B compact format as shown in Figure 4. Note: there is no fundamental difference between the 2B format and the extended 6B format, as both use Type-Value format, carrying one signal at a given packet. The Location Metadata (LM)field can encode link capacity and/or relative location of the bottleneck link such as Clos stage and link orientation (up or down), from which one may look up the speed of the link. The Type-Value format of CSIG Signal allows the sender to choose one signal type of interest for each packet to collect over the hop devices. CSIG draft today suggests 3 signal types: Minimum Available Bandwidth - min(ABW) Miao, et al. Expires 6 January 2025 [Page 7] Internet-Draft HPCC++ July 2024 Maximum Link Utilization - min(ABW/B), where B is the link capacity Maximum Per-Hop Delay - max(PD) 3.1. How to use CSIG for HPCC++ In the HPCC++ algorithm, the per-hop normalized inflight bytes u’ in line 5 is a sum of two terms u1, u2 as follows: u’ = u1 + u2, where u1 = min(ack.L[i].qlen,L[i].qlen) / (ack.L[i].B*T) u2 = txRate / ack.L[i].B Figure 5 u1 is essentially qlen divided by BDP and u2 is a link utilization. The min(current qlen, previous qlen) in u1 is for smoothing out noise and its benefit is marginal as described in the HPCC+ requirement section. Compared to INT/IFA/IOAM discussed above that stack up multiple metadata from all hop devices in each packet, CSIG carries only one signal (either u1 or u2) at a packet. We argue that’s acceptable for lossy networks as follows. (The case for lossless networks will be discussed in a later revision.) In lossy networks not using PFC, queueing happens only when the link is 100% utilized. It’s easy to see that when u1 (qlen/BDP) is non- zero, u2 (link utilization) is 100%. Likewise, when u2 is less than 100%, u1 is zero. This mutual-exclusiveness between u1 and u2 by definition makes CSIG, which carries only one signal (either for u1 or u2) at a packet, as a viable signaling mechanism for HPCC++. The sender of HPCC++ with CSIG can decide which of the two signals to collect from each packet based on its protocol state. Note that because u1’s qlen is an instantaneous value while u2’s link utilization is calculated over a measurement time window, it is possible to momentarily have u1 > 0 and u2 < 100%. In the case of HPCC++ with conventional INT/IFA/IOAM, the measurement window for link utilization is determined by the delta between current timestamp and previous timestamp observed by each connection and the delta varies packet by packet. HPCC++ sums up u1 and u2 ignoring such a temporary incongruity. With CSIG, u2 is estimated by the switches and the measurement window is a network-wide config parameter e.g., a multiple of the network RTT. Miao, et al. Expires 6 January 2025 [Page 8] Internet-Draft HPCC++ July 2024 When a path changes, triggered by either routing changes or by end- host driven path migration (such as PLB), two consecutive packets are assumed to go through the same path for the algorithm to have its inputs consistently - u1 and u2. With CSIG, HPCC++ can optain max(u') by collecting max(u2) or max(u2) per packet as follows. max(u2): the for-loop from line 3 of HPCC++ looks for the max(u1), and if there is no queueing happening throughout the path, it returns max(u2). max(u2)is equivalent to min(ABW/B) in CSIG, hence we can directly use min(ABW/B). max(u1): in computing max(u1), one straightforward way is to introduce a new signal type of max(qlen/B) in CSIG. qlen/B can be interpreted as an ‘expected’ sojourn time for the packet in the tail. HPCC++ assumes ~100% of the link capacity B is used for HPCC++ traffic. max(qlen/B) is maximum per-hop ‘expected’ delay. Note that T (= congestion-free RTT) in u1 is a network-wide config parameter in HPCC++, not specific to each hop. Once CSIG collects max(qlen/B), the end host can easily calculate u1 = max(qlen/B) / T for HPCC++. Another way is to use the max(PD) signal in CSIG to approximate max(qlen/B). The reason why this is an approximation is in two folds: first, the per-hop delay in CSIG is the delay experienced by the signal-carrying packet while qlen in HPCC++ is the length of the queue observed at the dequeue time of CSIG-carrying packet, hence the effect of queueing applies to the packet at the tail, not to the packet carries the CSIG signal. Second, B in qlen/B of HPCC++ is the whole link capacity while the per-hop delay in CSIG reflects the per- queue drain rate affected by other traffic classes depending on the queueing mechanism in use (strict priority or WDRR). The smoothing effect of min(current qlen, previous qlen) in u1 can be supported by CSIG as long as the bottleneck point doesn’t change between the previous packet and the current packet. 4. IANA Considerations This document makes no request of IANA. 5. Acknowledgments The authors would like to thank RTGWG members for their valuable review comments and helpful input to this specification. 6. Contributors The following individuals have contributed to the implementation and evaluation of the proposed scheme, and therefore have helped to validate and substantially improve this specification: Pedro Y. Segura, Roberto P. Cebrian, Robert Southworth and Md Ashiqur Rahman. Miao, et al. Expires 6 January 2025 [Page 9] Internet-Draft HPCC++ July 2024 7. Security Considerations TBD 8. Normative References 9. Informative References [Zhu-SIGCOMM2015] Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M. Zhang, "Congestion Control for Large-Scale RDMA Deployments", ACM SIGCOMM London, United Kingdom, August 2015. [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, v2.0", February 2020, . [RFC9179] "Data Fields for In Situ Operations, Administration, and Maintenance (IOAM)", May 2022, . [I-D.draft-gafni-ippm-ioam-additional-data-fields] "Additional data fields for IOAM Trace Option Types", May 2021, . [I-D.ietf-kumar-ippm-ifa] "Inband Flow Analyzer", February 2019, . [draft-miao-ccwg-hpcc] Miao, R., "HPCC++: Enhanced High Precision Congestion Control", 2024. [I-D.ietf-ravi-ippm-csig] "Congestion Signaling (CSIG)", February 2024, . [SIGCOMM-HPCC] Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M. Yu, "HPCC: High Precision Congestion Control", ACM SIGCOMM Beijing, China, August 2019. Authors' Addresses Miao, et al. Expires 6 January 2025 [Page 10] Internet-Draft HPCC++ July 2024 Rui Miao Meta 1 Hacker Way Menlo Park, CA 94025 United States of America Email: rmiao@meta.com Surendra Anubolu Broadcom, Inc. 1320 Ridder Park San Jose, CA 95131 United States of America Email: surendra.anubolu@broadcom.com Rong Pan AMD 2485 Augustine Dr. Santa Clara, CA 95054 United States of America Email: Rong.Pan@amd.com Jeongkeun Lee Google Headquarters 1600 Amphitheatre Parkway Mountain View, CA 95043 United States of America Email: leejk@google.com Barak Gafni NVIDIA 350 Oakmead Parkway, Suite 100 Sunnyvale, CA 94085 United States of America Email: gbarak@NVIDIA.com Jeff Tantsura NVIDIA United States of America Email: jefftant.ietf@gmail.com Miao, et al. Expires 6 January 2025 [Page 11] Internet-Draft HPCC++ July 2024 Allister Alemania Intel 2200 Mission College Blvd Santa Clara, 95952 United States of America Email: allister.alemania@intel.com Yuval Shpigelman NVIDIA Haim Hazaz 3A Netanya 4247417 Israel Email: yuvals@nvidia.com Miao, et al. Expires 6 January 2025 [Page 12]