Internet-Draft CATS Metric July 2024
Kehan, et al. Expires 9 January 2025 [Page]
Workgroup:
Computing-Aware Traffic Steering
Internet-Draft:
draft-ysl-cats-metric-definition-00
Published:
Intended Status:
Informational
Expires:
Authors:
Y. Kehan
China Mobile
H. Shi, Ed.
Huawei Technologies
C. Li, Ed.
Huawei Technologies

CATS metric Definition

Abstract

This document defines the computing metrics used in Computing-Aware Traffic Steering.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 9 January 2025.

Table of Contents

1. Introduction

Many modern computing services are deployed in a distributed way. In this deployment mode, multiple service instances are deployed in multiple sites to provide equivalent function to end users. In order to provide better service to end users, a framework called CATS (Computing-Aware Traffic Steering) [I-D.ietf-cats-framework] is proposed.

CATS (Computing-Aware Traffic Steering) [I-D.ietf-cats-framework] is a traffic engineering approach that takes into account the dynamic nature of computing resources and network state to optimize service-specific traffic forwarding towards a given service contact instance. Various relevant metrics may be used to enforce such computing-aware traffic steering policies.

To effectively steer traffic to the appropriate service instance, network devices need a model of the service instance's computing status. A common definition of computing metrics is essential for effective coordination between network devices and computing systems. Without standardized computing metrics, devices on the network may interpret and respond to traffic conditions and computing load differently, leading to inefficiencies and potential conflicts. A standardized metric allows both network devices and computing systems to evaluate load consistently, enabling precise traffic steering decisions that optimize resource utilization and improve overall system performance.

Various considerations for metric definition are proposed in [I-D.du-cats-computing-modeling-description], which are useful in defining computing metrics.

Based on the considerations defined in [I-D.du-cats-computing-modeling-description], this document defines relevant computing metrics for CATS by categorizing the metrics into three levels based on their complexity and richness.

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

This document uses terms defined in [I-D.ietf-cats-framework]. We list them below for clarification.

3. Definition of Metrics

Many metrics are discussing and/or defined in routing and computing area. Definition and usage of specific metrics are highly related to the use case, especially in IT use cases. However, when considering distributing compute metrics to network devices, appropriate categorizing and abstraction is required in order to not introduce extra complexity into the network.

Based on the abstraction level of metrics, this document defines three levels of metric to meet different requirements of different use cases:

3.1. Level 0: Raw Metrics

The metrics without any abstraction are Level 0 metrics. Therefore, Level 0 metrics encompass detailed, raw metrics, including but not limit to:

  • CPU: Base Frequency, Number of Cores, Boosted Frequency, Memory Bandwidth, Memory Size, Memory Utilization Ratio, Core Utilization Ratio, Power Consumption.

  • GPU: Frequency, Number of Render Unit, Memory Bandwidth, Memory Size, Memory Utilization Ratio, Core Utilization Ratio, Power Consumption.

  • NPU: Computing Power, Utlization Ratio, Power Consumption.

  • Network: Bandwidth, TXBytes, RXBytes, HostBusUtilization.

  • Storage: Available Space, Read Speed, Write Speed.

  • Delay: Time takes to process a request.

In L0, detailed information of a metric can be encoded into the protocol, and different service has its own metrics with different information elements. This kind of metrics are used widely in IT systems.

Regarding network related raw metrics, IPPM WG has defined many types of metrics in [performance-metrics]. [RFC9439] also defines a lot of metrics of packet performance and Throughput/Bandwidth. Regarding computing metrics, [I-D.rcr-opsawg-operational-compute-metrics] defines a set of cloud resource metrics.

3.2. Level 1: Categorized Metrics.

In Level 1, the metrics will be categorized into different categories, and appropriate abstraction will be applied to each category. The Level 0 raw metrics can be categorized into multiple categories, such as computing, networking, storage and delay. In each category, the metrics are normalized into a value that present the state of the resource, making it as a Level 1 metric. Potential categories are shown below:

  • Computing: A normalized value generating from the computing related L0 metrics, such as CPU/GPU/NPU L0 metrics

  • Networking: A normalized value generating from the network related L0 metrics.

  • Storage: A normalized value generating from the storage L0 metrics.

  • Delay: A normalized value generating from computing/networking/storage metrics, reflecting the processing delay of a request.

Editor note: detailed categories can be updated according to the CATS WG discussion.

The L0 metrics, such as the ones defined in [performance-metrics] ,[RFC9439] and [I-D.rcr-opsawg-operational-compute-metrics] can be categorized into above categories. Each category will use its own method(weighted summary, etc.) to generate the normalized value. In this way, the protocol only care about the metric categories and its normalized value, and avoid to process the detailed metrics.

3.3. Level 2: Fully Normalized Metric.

L2 metric is a one-dimensional value derived from a weighted sum of L1 metrics or from L0 metrics directly. Different service has its own normalization method which might use different metrics with different weight. For the ingress CATS router, it can compare the metric value to make the traffic steering decision (e.g., larger value has higher priority) . In some cases, some implementations may support to configure the ingress CATS router to know the metric normalizing method so that it can decode the affection from the L1 or L0 metrics.

This method simplifies the complexity of transmission and management of multiple metrics by consolidating them into a single, unified measure.

4. Comparison of three layers of metric

From L0 to L1 to L2, the computing metric is consolidated. Different level of abstraction can meet the requirements from different services. Table 1 shows the comparison among metric levels.

Table 1: Comparison among Metrics Levels
Level Encoding Complexity Extensibility Stability Accuracy
Level 0 Complicated Bad Bad Good
Level 1 Medium Medium Medium Medium
Level 2 Simple Good Good Medium

Since Level 0 metrics are raw metrics, therefore, different services may have their own metrics, resulting in hundreds or thousands of metrics in total, this brings huge complexity in protocol encoding and standardization. Therefore, this kind of metrics are always used in customized IT systems case by case. In Level 1 metrics, metrics are categorized into several categories and each category is normalized into a value, therefore they can be encoded into the protocol and standardized. Regarding the Level 2 metrics, all the metrics are normalized into one single metric, it is easier to be encoded in protocol and standardized. Therefore, from the encoding complexity aspect, Level 2 and Level 1 metrics are suggested.

Similarly, when considering extensibility, new services can define their own new L0 metrics, which requires protocol to be extended as needed. Too many metrics type can create a lot of overhead to the protocol resulting in a bad extensibility of the protocol. Level 1 introduce only several metrics categories, which is acceptable for protocol extension. Level 2 metric only need one single metric, so it brings least burden to the protocol. Therefore, from the extensibility aspect, Level 2 and Level 1 metrics are suggested.

Regarding Stability, new Level 0 raw metrics may require new extension in protocol, which brings unstable format for protocol, therefore, this document does not recommend to standardize Level 0 metrics in protocol. Level 1 metrics request only few categories, and Level 2 Metric only introduce one metric to the protocol, so they are preferred from the stability aspect.

In conclusion, for computing-aware traffic steering, it is recommended to use the L2 metric due to its simplicity. If advanced scheduling is needed, L1 metric can be used. L2 metrics are the most comprehensive and dynamic, therefore transferring them to network devices is discouraged due to their high overhead.

Editor notes: this draft can be updated according to the discussion of metric definition in CATS WG.

5. Security Considerations

TBD

6. IANA Considerations

TBD

7. References

7.1. Normative References

[I-D.ietf-cats-framework]
Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ietf-cats-framework-02, , <https://datatracker.ietf.org/doc/html/draft-ietf-cats-framework-02>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

7.2. Informative References

[I-D.du-cats-computing-modeling-description]
Du, Z., Yao, K., Li, C., Huang, D., and Z. Fu, "Computing Information Description in Computing-Aware Traffic Steering", Work in Progress, Internet-Draft, draft-du-cats-computing-modeling-description-03, , <https://datatracker.ietf.org/doc/html/draft-du-cats-computing-modeling-description-03>.
[I-D.rcr-opsawg-operational-compute-metrics]
Randriamasy, S., Contreras, L. M., Ros-Giralt, J., and R. Schott, "Joint Exposure of Network and Compute Information for Infrastructure-Aware Service Deployment", Work in Progress, Internet-Draft, draft-rcr-opsawg-operational-compute-metrics-06, , <https://datatracker.ietf.org/doc/html/draft-rcr-opsawg-operational-compute-metrics-06>.
[performance-metrics]
"performance-metrics", n.d., <https://www.iana.org/assignments/performance-metrics/performance-metrics.xhtml>.
[RFC9439]
Wu, Q., Yang, Y., Lee, Y., Dhody, D., Randriamasy, S., and L. Contreras, "Application-Layer Traffic Optimization (ALTO) Performance Cost Metrics", RFC 9439, DOI 10.17487/RFC9439, , <https://www.rfc-editor.org/rfc/rfc9439>.

Authors' Addresses

Kehan Yao
China Mobile
China
Hang Shi (editor)
Huawei Technologies
China
Cheng Li (editor)
Huawei Technologies
China