Internet Research Task Force                                       M. Li
Internet-Draft                                                   C. Zhou
Intended status: Informational                                   D. Chen
Expires: 8 January 2025                                     China Mobile
                                                             7 July 2024


 Data Generation and Optimization for Digital Twin Network Performance
                                Modeling
           draft-li-nmrg-dtn-data-generation-optimization-02

Abstract

   Digital Twin Network (DTN) can be used as a secure and cost-effective
   environment for network operators to evaluate network performance in
   various what-if scenarios.  Recently, AI models, especially neural
   networks, have been applied for DTN performance modeling.  The
   quality of deep learning models mainly depends on two aspects: model
   architecture and data.  This memo focuses on how to improve the model
   from the data perspective.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 8 January 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Li, et al.               Expires 8 January 2025                 [Page 1]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Acronyms & Abbreviations  . . . . . . . . . . . . . . . . . .   3
   3.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Framework of Data Generation and Optimization . . . . . . . .   4
     4.1.  Data Generation Stage . . . . . . . . . . . . . . . . . .   5
     4.2.  Data Optimization Stage . . . . . . . . . . . . . . . . .   6
   5.  Data Generation . . . . . . . . . . . . . . . . . . . . . . .   6
     5.1.  Network Topology  . . . . . . . . . . . . . . . . . . . .   6
     5.2.  Routing Policy  . . . . . . . . . . . . . . . . . . . . .   7
     5.3.  Traffic Matrix  . . . . . . . . . . . . . . . . . . . . .   7
   6.  Data Optimization . . . . . . . . . . . . . . . . . . . . . .   8
     6.1.  Seed Sample Selection Phase . . . . . . . . . . . . . . .   8
     6.2.  Incremental Optimization Phase  . . . . . . . . . . . . .   9
   7.  Discussion  . . . . . . . . . . . . . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     10.1.  Informative References . . . . . . . . . . . . . . . . .  10
     10.2.  Normative References . . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

1.  Introduction

   Digital twin is a virtual instance of a physical system (twin) that
   is continually updated with the latter's performance, maintenance,
   and health status data throughout the physical system's life cycle.
   Digital Twin Network (DTN) is a digital twin that is used in the
   context of networking [I-D.irtf-nmrg-network-digital-twin-arch].  DTN
   can be used as a secure and cost-effective environment for network
   operators to evaluate network performance in various what-if
   scenarios.  Recently, AI models, especially neural networks, have
   been applied for DTN performance modeling.

   The quality of AI models mainly depends on two aspects: model
   architecture and data.  This memo focuses on the impact of training
   data on the model.  The quality of training data will directly affect
   the accuracy and generalization ability of the model.  This memo


Li, et al.               Expires 8 January 2025                 [Page 2]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   focuses on how to design data generation and optimization methods for
   DTN performance modeling, which can generate simulated network data
   to solve the problem of practical data shortage and select high-
   quality data from various data sources.  Using high-quality data for
   training can improve the accuracy and generalization ability of the
   model.

2.  Acronyms & Abbreviations

   DTN:  Digital Twin Network

   AI:  Artificial Intelligence

   AIGC:  AI-Generated Content

   ToS:  Type of Service

   OOD:  Out-of-Distribution

   FIFO:  First In First Out

   SP:  Strict Priority

   WFQ:  Weighted Fair Queuing

   DRR:  Deficit Round Robin

   BFS:  Breadth-First Search

   CBR:  Constant Bit Rate

3.  Requirements

   Performance modeling is vital in DTN, which is involved in typical
   network management scenarios such as planning, operation,
   optimization, and upgrade.  Recently, some studies have applied AI
   models to DTN performance modeling, such as RouteNet [RouteNet] and
   MimicNet [MimicNet].  AI is a data-driven technology whose
   performance heavily depends on data quality.

   Network data sources are diverse and of varying quality, making it
   difficult to directly serve as training data for DTN performance
   models:

   *  Practical data from production networks: Data from production
      networks usually have high value, but the quantity, type, and
      accuracy are limited.  Moreover, it is not practical in production
      networks to collect data under various configurations;


Li, et al.               Expires 8 January 2025                 [Page 3]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   *  Network simulators: Network simulators (e.g., NS-3 and OMNeT++)
      can be used to generate simulated network data, which can solve
      the problems of quantity, diversity, and accuracy to a certain
      extent.  However, simulation is usually time-consuming.  In
      addition, there are usually differences between simulated data and
      practical data from production networks, which hinders the
      application of trained models to production networks;

   *  Generative AI models: With the development of AI-Generated Content
      (AIGC) technology, generative AI models (e.g., GPT and LLaMA) can
      be used to generate simulated network data, which can solve the
      problems of quantity and diversity to a certain extent.  However,
      the accuracy of the data generated by generative AI models is
      limited and often has gaps with practical data from production
      networks.

   Therefore, data generation and optimization methods for DTN
   performance modeling are needed, which can generate simulated network
   data to solve the problem of practical data shortage and select high-
   quality data from multi-source data.  High-quality data meets the
   requirements of high accuracy, diversity, and fitting the actual
   situation of practical data.  Training with high-quality data can
   improve the accuracy and generalization of DTN performance models.

4.  Framework of Data Generation and Optimization

   The framework of data generation and optimization for DTN performance
   modeling is shown in Figure 1, which includes two stages: the data
   generation stage and the data optimization stage.


Li, et al.               Expires 8 January 2025                 [Page 4]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


          Data generation                   Data optimization
   +---------------------------+ +-------------------------------------+
   |                           | |                                     |
   | +---------+               | |              +---------+            |
   | |         |               | | +----------+ |         |            |
   | | Network |               | | | Practical| | Easy    |            |
   | | topology| +-----------+ | | | data     | | samples |            |
   | |         | |           | | | +-----+----+ |         |            |
   | |         | | Network   | | |       |      |         | +--------+ |
   | |         | | simulator | | | +-----v----+ |         | |        | |
   | | Routing | |           | | | |          | | Hard    | | High   | |
   | | policy  +->           +-+-+-> Candidate+-> samples +-> quality| |
   | |         | |           | | | | data     | |         | | data   | |
   | |         | | Generative| | | |          | |         | |        | |
   | |         | | AI model  | | | +----------+ |         | +--------+ |
   | | Traffic | |           | | |              | OOD     |            |
   | | matrix  | +-----------+ | |              | samples |            |
   | |         | Data generator| |              | (remove)|            |
   | +---------+               | |              |         |            |
   |  Network                  | |              +---------+            |
   |  configuration            | |             Data selection          |
   |                           | |                                     |
   +---------------------------+ +-------------------------------------+

      Figure 1: Framework of Data Generation and Optimization for DTN
                            Performance Modeling

4.1.  Data Generation Stage

   The data generation stage aims to generate candidate data (simulated
   network data) to solve the problem of the shortage of practical data
   from production networks.  This stage first generates network
   configurations and then imports them into data generators to generate
   the candidate data.

   *  Network configurations: Network configurations typically include
      network topology, routing policy, and traffic matrix.  These
      configurations need to be diverse to cover as many scenarios as
      possible.  Topology configurations include the number and
      structure of nodes and edges, node buffers' size and scheduling
      strategy, link capacity, etc.  Routing policy determines the path
      of a packet from the source to the destination.  The traffic
      matrix describes the traffic entering/leaving the network, which
      includes the traffic's source, destination, time and packet size
      distribution, Type of Service (ToS), etc.


Li, et al.               Expires 8 January 2025                 [Page 5]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   *  Data generators: Data generators can be network simulators (e.g.,
      NS-3 and OMNeT++) and/or the generative AI models (e.g., GPT and
      LLaMA).  Network configurations are imported into data generators
      to generate candidate data.

4.2.  Data Optimization Stage

   The data optimization stage aims to optimize the candidate data from
   various sources to select high-quality data.

   *  Candidate data: Candidate data includes simulated network data
      generated in the data generation stage and the practical data from
      production networks.

   *  Data selection: The data selection module investigates the
      candidate data to filter out the easy, hard, and Out-of-
      Distribution (OOD) samples.  Hard examples refer to samples that
      are difficult for the model to accurately predict.  During the
      training process, exposing the model to more hard examples will
      enable it to perform better on such samples later on.  Then the
      easy samples and hard samples are considered valid samples and
      added to the training data.  OOD samples are considered invalid
      and removed.

   *  High-quality data: High-quality data needs to meet the
      requirements of high accuracy, diversity, and fitting the actual
      situation of practical data, which can be verified by expert
      knowledge (such as the ranges of delay, queue utilization, link
      utilization, and average port occupancy).

5.  Data Generation

   This section will describe how to generate network configurations,
   including network topology, routing policy, and traffic matrix.  Then
   these configurations will be imported into data generators to
   generate the candidate data.

5.1.  Network Topology

   Network topologies are generated using the Power-Law Out-Degree
   algorithm, where parameters are set according to real-world
   topologies in the Internet Topology Zoo.


Li, et al.               Expires 8 January 2025                 [Page 6]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   When the flow rate exceeds the link bandwidth or the bandwidth set
   for the flow, the packet is temporarily stored in the node buffer.  A
   larger node buffer size means a larger delay and possibly a lower
   packet loss rate.  The node scheduling policy determines the time and
   order of packet transmission, which is randomly selected from the
   policies such as First In First Out (FIFO), Strict Priority (SP),
   Weighted Fair Queuing (WFQ), and Deficit Round Robin (DRR).

   A larger link capacity means a smaller delay and less congestion.  To
   cover diverse link loads to get good coverage of possible scenarios,
   we set the link capacity to be proportional to the total average
   bandwidth of the flows passing through the link.

5.2.  Routing Policy

   Routing policy plays a crucial role in routing protocols, which
   determines the path of a packet from the source to the destination.

   *  Default: We set the weight of all links in the topology to be the
      same, that is, equal to 1.  Then we use the Dijkstra algorithm to
      generate the shortest path configuration.  Dijkstra algorithm uses
      Breadth-First Search (BFS) to find the single source shortest path
      in a weighted digraph.

   *  Variants: We randomly select some links (the same link can be
      chosen more than once) and add a small weight to them.  Then we
      use the Dijkstra algorithm to generate a series of variants of the
      default shortest path configuration based on the weighted graph.
      These variants can add some randomness to the routing
      configuration to cover longer paths and larger delays.

5.3.  Traffic Matrix

   The traffic matrix is very important for network performance
   modeling.  The traffic matrix can be regarded as a network map, which
   describes the traffic entering/leaving the network, including the
   source, destination, distribution of the traffic, etc.

   We generate traffic matrix configurations with variable traffic
   intensity to cover low to high loads.

   The parameters packet sizes, packet size probabilities, and ToS are
   generated according to the validation dataset analysis to have
   similar distributions.

   The arrival of packets for each source-destination pair is modeled
   using one of the time distributions such as Poisson, Constant Bit
   Rate (CBR), and ON-OFF.


Li, et al.               Expires 8 January 2025                 [Page 7]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


6.  Data Optimization

   This section will describe how to optimize the data from various
   sources to filter out high-quality data, which includes the seed
   sample selection phase and incremental optimization phase.

   Candidate data includes simulated network data generated in the data
   generation stage and real data from production networks.  Data
   optimization supports a variety of selection strategies, including
   high fidelity, high coverage, etc.  High fidelity means that the
   selected data can fit the real data (e.g., having similar topologies,
   routing policies, traffic models, etc.), and high coverage means that
   the selected data can cover as many scenarios as possible.

6.1.  Seed Sample Selection Phase

   In the seed sample selection phase, high-quality seed samples are
   selected through the following steps to provide high-quality initial
   samples for the incremental optimization phase.

   STEP 1: Training feature extraction model and feature extraction.

   (1.1) The training data D' is selected from the candidate data D
   according to the selection strategy.  For the high fidelity strategy,
   the real data is used as the training data D'; for the high coverage
   strategy, the real data and simulated data are used together as the
   training data D'.

   (1.2) Feature extraction model E is trained using the training data
   D'.  Feature extraction model E is a network performance evaluation
   model that can be used to evaluate performance indicators such as
   delay, jitter and packet loss (such as RouteNet).

   (1.3) Use the feature extraction model E obtained in STEP (1.2) to
   extract the feature of the training data D' obtained in STEP (1.1).
   A network can be defined as a set of flow F, queue Q, and link L.
   The link state SF (such as link utilization), queue state SQ (such as
   port occupation), and flow state SL (such as delay, throughput,
   packet loss, etc.) are taken as features.  Each sample in the
   training data D' is converted to a feature vector [SF,SQ,SL].

   STEP 2: Clustering.

   Cluster the training data D' after feature extraction.  Clustering
   (such as K-means and DBSCAN) is an unsupervised machine learning
   technique that can automatically discover the natural groups in the
   data, divide the data into multiple clusters, and the samples in the
   same cluster have similarities.


Li, et al.               Expires 8 January 2025                 [Page 8]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   Repeat STEP 3 and STEP 4 until all clusters have been traversed.

   STEP 3: Calculating cluster centers and nearest neighbors.

   (3.1) Calculate cluster centers.  The method of calculating cluster
   centers is determined according to the clustering algorithm used in
   STEP 2.  For example, using K-means clustering algorithm, the cluster
   center is calculated by finding the average of all data points in the
   cluster.  These cluster centers are added to the seed dataset DS.

   (3.2) Calculate k nearest neighbors of each cluster center and add
   them to the seed dataset DS.  Suitable nearest neighbor calculation
   methods can be used, such as Euclidean distance, cosine distance,
   etc.

   STEP 4: Expert knowledge verification.

   (4.1) Expert knowledge can be used to verify the validity of samples
   through the range of indicators such as delay, queue occupation, and
   link utilization.  If the verification passed, go to STEP 3.
   Otherwise, go to STEP (4.2).

   (4.2) Randomly select m samples from the seed dataset DS and remove
   them.  Calculate the nearest neighbors of the removed m samples, add
   them to the seed data set DS, and go to STEP (4.1).

6.2.  Incremental Optimization Phase

   The seed samples are taken as the initial training dataset.  The
   filter model investigates the remaining candidate samples to filter
   out the easy, hard and OOD samples.  Then the easy samples and hard
   samples are added to the training dataset.  These processes are
   repeated to iteratively optimize the filter model and the training
   data until the high-quality data meets the constraints.

7.  Discussion

   Several topics related to data generation and optimization for DTN
   performance modeling require further discussion.

   *  Data generation methods: 1) Generate configurations that cover
      enough scenarios and scale from small to large networks. 2) Choose
      data generators that consider accuracy, speed, fidelity, etc. 3)
      Use data augmentation technology to expand the training data by
      using a small amount of practical data to generate similar data
      through prior knowledge.


Li, et al.               Expires 8 January 2025                 [Page 9]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


   *  Data optimization methods: 1) Select data from multi-source
      candidate data, including hard sample mining, OOD detection, etc.
      2) Verify whether the data quality meets the requirements.

   *  Deployment: 1) Time/space complexity and explainability of the
      data generation and optimization methods. 2) Provide feedback for
      data collection to form a closed loop.

8.  Security Considerations

   TBD

9.  IANA Considerations

   This document has no requests to IANA.

10.  References

10.1.  Informative References

   [I-D.irtf-nmrg-network-digital-twin-arch]
              Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu,
              Q., Boucadair, M., and C. Jacquenet, "Network Digital
              Twin: Concepts and Reference Architecture", Work in
              Progress, Internet-Draft, draft-irtf-nmrg-network-digital-
              twin-arch-05, 4 March 2024,
              <https://datatracker.ietf.org/doc/html/draft-irtf-nmrg-
              network-digital-twin-arch-05>.

   [MimicNet] Zhang, Q. Zhang., NG, K. K.W. NG., Kazer, C. W. Kazer.,
              Yan, S. Yan., Sedoc, J. Sedoc., and V. Liu. Liu,
              "MimicNet: Fast Performance Estimates for Data Center
              Networks with Machine Learning. In ACM SIGCOMM 2021
              Conference (SIGCOMM ’21).", August 2021.

   [RouteNet] Rusek, K. Rusek., Suárez-Varela, J. Suárez-Varela.,
              Almasan, P. Almasan., Barlet-Ros, P. Barlet-Ros., and A.
              Cabellos-Aparicio. Cabellos-Aparicio, "RouteNet:
              Leveraging Graph Neural Networks for network modeling and
              optimization in SDN. IEEE Journal on Selected Areas in
              Communication (JSAC), vol. 38, no. 10", October 2020.

10.2.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.


Li, et al.               Expires 8 January 2025                [Page 10]

Internet-Draft  Data Generation and Optimization for DTN       July 2024


Authors' Addresses

   Mei Li
   China Mobile
   Beijing
   100053
   China
   Email: limeiyjy@chinamobile.com


   Cheng Zhou
   China Mobile
   Beijing
   100053
   China
   Email: zhouchengyjy@chinamobile.com


   Danyang Chen
   China Mobile
   Beijing
   100053
   China
   Email: chendanyang@chinamobile.com


Li, et al.               Expires 8 January 2025                [Page 11]