<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-liu-nmrg-ai-llm-inference-requirements-02"
     ipr="trust200902">
  <front>
    <title abbrev="Network Management">Requirements Analysis of System and
    Network for Large Language Model Inference Service</title>

    <author fullname="Chang Liu" initials="C." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>liuchangjc@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Chuyi Guo" initials="C." surname="Guo">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>guochuyi@chinamobile.com</email>
      </address>
    </author>

    <!---->

    <date day="3" month="November" year="2025"/>

    <area>IRTF</area>

    <workgroup>Network Management</workgroup>

    <keyword>LLM inference;PD Fusion;PD Disaggregation;KV
    Cache;Latency;Throughput</keyword>

    <abstract>
      <t>With the rise of ChatGPT, DeepSeek, and other Large Language Models,
      which is short for LLMs in the remaining part, as well as the
      proliferation of inference applications, inference serving oriented to
      large-scale users has become increasingly critical. However, due to the
      extreme demands on computing power and communication during inference,
      the large-scale service deployment of LLMs poses significant challenges.
      To address these challenges, different vendors have adopted diverse
      inference service architectures, such as vLLM, SGLang, Mooncake, etc.
      This paper investigates mainstream inference frameworks, summarizes
      their core design principle and research question, and analyzes the
      challenges and requirements they impose on network management. The goal
      is to lay a foundation for defining a unified LLM inference architecture
      in the future.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Since the launch of ChatGPT in 2023, more and more product-level LLMs
      have emerged, with GPT-4o, Claude-Sonnet-3.5, Gemini, Kimi, and others
      leading the charge. In early 2025, DeepSeek-R1 reignited the LLM frenzy,
      and Musk's xAI recently unveiled the powerful Grok3. It is evident that
      LLMs will continue to reach new heights.</t>

      <t>Major vendors, including OpenAI, Anthropic, DeepSeek, and Google,
      have deployed their LLM applications across mobile and web platforms. As
      the field grows, daily active users (DAUs) for these applications are
      expected to surge, potentially reaching hundreds of millions during peak
      periods. This presents significant challenges for large-scale inference
      services. For instance, up to now, DeepSeek still struggles with
      persistent "Service Busy" issues.</t>

      <t>Existing large-scale inference service architectures primarily adopt
      two technical approaches: Prefill-Decoding (PD) Fusion and
      Prefill-Decoding Disaggregation, which is derived from the distinct
      computational characteristics of the Prefill (compute-intensive) and
      Decoding (memory-intensive) phases. Efficient network management and
      hardware coordination are essential to maximize system throughput and
      minimize user-perceived latency.</t>

      <t>This document first introduces mainstream inference frameworks, then
      optimization metrics, and finally elaborates on the network and system
      requirements for deploying large-scale LLM inference services.</t>
    </section>

    <section title="Service-Oriented Inference Frameworks">
      <t>At present, there are two main technical routes of the mainstream LLM
      service systems, namely PD Fusion and PD Disaggregation. Prefill, which
      is to simultaneously compute all of tokens of user requests, also known
      as prompts, is characterized as computational intensive,
      computing-bound, with extremely high computing force requirements.
      Decoding generates user-required content based on the KV Cache and first
      token generated by Prefill phase. Due to the reuse of KV Cache of the
      tokens prior to the current token, it is characterized as
      memory-intensive and memory-bound, with higher requirements for memory
      in decoding phase. A complete LLM inference procedure is shown in Figure
      1. Based on whether to decouple two stages with obviously different
      computing requirements, two technical routes of LLM inference serving
      system emerge, namely, PD Fusion and decoupled PD Disaggregation. The
      rest of this section describes in detail about the two technical
      architectures.</t>

      <figure align="center" title="LLM Inference Process">
        <artwork align="center" type="ascii-art"> +-------------+  +-------------+ +-------------+ +-------------+      
 |     LLM     |  |     LLM     | |     LLM     | |     LLM     |      
 | Iteration 1 +-+| Iteration 2 ++| Iteration 3 ++| Iteration 4 ++     
 +-----^-------+ |+---^------^--+|+---^-------^-+|+---^-------^-+|     
       |         |    |      |   |    |       |  |    |       |  |     
       |         | +--+--+   |   | +--+--+    |  | +--+--+    |  |     
&lt;Prompt:Is apple | | KV  | &lt;Yes&gt; | | KV  |  &lt;It&gt; | | KV  |  &lt;Is&gt; |&lt;EOS&gt;
        a fruit?&gt;| |Cache|   ^   | |Cache|    ^  | |Cache|    ^  |  ^  
                 | +--^--+   |   | +--^--+    |  | +--^--+    |  |  |  
                 |    |      |   |    |       |  |    |       |  |  |  
                 +----+------+   +----+-------+  +----+-------+  +--+  
                                                                       
+-----Prefill----+--------------------Decoding----------------------+  </artwork>
      </figure>

      <t>Prefill: Processes all tokens in user prompts (Parallelizable,
      compute-bound, requiring high computing power).</t>

      <t>Decoding: Generates output tokens sequentially based on the KV Cache
      from Prefill (Memory-bound, requiring high GPU memory).</t>

      <section title="PD Fusion Architecture">
        <t>In PD Fusion, LLM instances are deployed within a single cluster,
        managed by a global scheduler responsible for load balancing, KV Cache
        management, and resource allocation. Most frameworks adopt vLLM<xref
        target="vLLM"/>'s paged KV Cache mechanism, inspired by OS virtual
        memory management. This approach stores KV Cache into non-contiguous
        physical blocks across nodes and uses a scheduler to map logical
        blocks to physical memory. Additionally, prefix-sharing strategies are
        employed to reuse KV Cache for prompts with identical prefixes,
        reducing redundant computations. Remote KV Cache replication across
        nodes is also required to reduce duplicated computing of KV Cache of
        same tokens. The architecture is shown in Figure 2.</t>

        <figure align="center" title="PD Fusion Architecture">
          <artwork align="center" type="ascii-art">                      Request1/Prompt1                        
                      Request2/Prompt2                        
                              |                               
                              |                               
                 +------------v------------+                  
                 |                         |                  
                 |  Scheduler/Controller   |                  
       Request1  |                         | Request2         
     +-----------+  *********************  +----------+       
     |           |  *KV Cache Management*  |          |       
     |           |  *  Load Balancing   *  |          |       
     |           |  *     ... ...       *  |          |       
     |           |  *********************  |          |       
     |           +-------------------------+          |       
     |                                                |       
     |                                                |       
     |                                                |       
+----v-----+  Remote    +----------+   Remote    +----v-----+ 
|  Model   |KVCache copy|  Model   | KVCache copy|  Model   | 
|Instance 1&lt;-----------&gt;|Instance 2|&lt;------------&gt;Instance 3| 
+----------+            +----------+             +----------+ </artwork>
        </figure>
      </section>

      <section title="PD Disaggregation Architecture">
        <t>In PD Disaggregation, Prefill and Decoding are decoupled into
        separate instances to optimize hardware utilization. After Prefill
        computes the full KV Cache for a prompt, the data is transferred to
        Decoding instances for text generation. This architecture demands
        efficient coordination between Prefill and Decoding instances, as well
        as reliable high-speed data transmission. The workflow is illustrated
        in Figure 3.</t>

        <figure align="center" title="PD Disaggregation Architecture">
          <artwork align="center" type="ascii-art">                      Request1/Prompt1                              
                      Request2/Prompt2                              
                              |                                     
                              |                                     
                 +------------v------------+                        
                 |                         |                        
                 |  Scheduler/Controller   |                        
       Request1  |                         | Request2               
     +-----------+  *********************  +----------+             
     |           |  *KV Cache Management*  |          |             
     |           |  *  Load Balancing   *  |          |             
     |           |  *     ... ...       *  |          |             
     |           |  *********************  |          |             
     |           +-------------------------+          |             
     |                                                |             
     |                                                |             
     |                                                |             
+----v-----+                                     +----v-----+       
|  Model   |                                     |  Model   |       
|          |       Remote KVCache copy           |          |       
| Prefill  &lt;-------------------------------------&gt; Prefill  |       
|Instance 1|                                     |Instance 2|       
+----+-----+                                     +----+-----+       
     |KV Cache                                KV Cache|             
     |Transfer                                Transfer|             
     |                                                |             
+----v-----+                                     +----v-----+       
|  Model   |                                     |  Model   |       
|          |                                     |          |       
|Decoding  |                                     |Decoding  |       
|Instance 1|                                     |Instance 2|       
+----------+                                     +----------+</artwork>
        </figure>
      </section>
    </section>

    <section title="Inference-related Metrics">
      <t>The ultimate goals of an inference system are to maximize system
      goodput which reflects the serving volume of user requests and minimize
      user-perceived latency at low cost. For both PD Fusion and PD
      Disaggregation architectures, three kind of specific key metrics are
      defined as follows:</t>

      <t>Latency-related Metrics:<list>
          <t>TTFT (Time to First Token): The time taken by the Prefill phase
          to generate the first token.</t>

          <t>TBT (Time Between Tokens): The interval between consecutive token
          generations in the Decoding phase.</t>
        </list></t>

      <t>Throughput-related Metrics:<list>
          <t>TPS (Tokens Per Second): The number of tokens generated per
          second by the inference system.</t>

          <t>RPS (Requests Per Second): The number of requests processed per
          second by the inference system.</t>
        </list></t>

      <t>Cost-related Metrics: <list>
          <t>Cost Per Token: How much it costs for inference system to
          generate each token.</t>
        </list></t>
    </section>

    <section title="Research Question for Service-Oriented Inference Frameworks">
      <t>From the user's perspective, end-to-end inference latency generated
      during a complete inference process is a key metric directly impacting
      quality of service and user experience. From the perspective of llm
      deployment, the most critical challenge is how to deploy llm services at
      the lowest possible cost while meeting basic user requirements for
      inference latency, minimizing idle rates of computational resources, and
      ensuring systems operate at near-saturation levels. Industry commonly
      adopts decoupled architectures (e.g., PD Disaggregation) or other
      decoupled structures to deploy and independently optimize distinct model
      components on heterogeneous clusters. However, maximizing system
      throughput and reducing inference deployment costs inevitably sacrifices
      user end-to-end inference latency. In other words, latency, throughput,
      and cost form an impossible trinity. Synthesizing the needs and
      objectives of both users and service providers, the primary challenge
      for inference systems is: How to maximize system throughput and minimize
      deployment costs while adhering to Service Level Objective (SLO)
      constraints for inference services?</t>
    </section>

    <section title="Challenges for Service-Oriented Inference Frameworks">
      <t>To address the above key issue, current industry practices (e.g.,
      Mooncake<xref target="Mooncake"/>, DeepSeek, vLLM<xref target="vLLM"/>,
      SGLang<xref target="SGLang"/>) employ methods such as KV Cache prefix
      matching, PD Disaggregation deployment optimization, efficient KV Cache
      memory management, and flexible load balancing scheduling to enhance
      resource utilization. These approaches also aim to leverage fragmented,
      low-cost and idle resources, thereby increasing user throughput and
      reducing inference service deployment costs. However, this approach
      faces several significant challenges:</t>

      <section title="Challenge 1">
        <t>Whether using PD Disaggregation or PD Fusion deployment, the
        industry widely adopts KV Cache prefix matching for large-scale
        inference deployment optimization. This technique aims to reduce
        redundant computation and storage of prefix KV Cache across different
        inference requests, thereby saving computational resources and
        ultimately improving system throughput while lowering deployment
        costs. How to systematically manage these KV Cache prefixes within the
        inference network&mdash;including storage mechanisms, placement
        strategies, scheduling schemes, and replacement policies&mdash;remains
        a critical and urgent challenge.</t>
      </section>

      <section title="Challenge 2">
        <t>To optimize the trade-offs between inference deployment cost,
        service performance, and system throughput, implementing more
        efficient resource- and KV Cache-aware mechanisms within the inference
        network's management and control plane, alongside hardware- and
        network-aware load balancing scheduling, presents an extremely crucial
        challenge.</t>
      </section>

      <section title="Challenge 3">
        <t>If a PD Disaggregation deployment approach is adopted, ensuring
        efficient transmission of KV Cache between the separated,
        heterogeneous Prefilling and Decoding clusters becomes a vital
        challenge. This transmission must be designed to either avoid
        impacting end-to-end inference latency entirely or minimize its impact
        as much as possible.</t>
      </section>
    </section>

    <section title="Network Management Requirements for Service-Oriented Inference Frameworks">
      <t>To achieve large-scale LLM service deployment, frameworks MUST meet
      the following requirements in both control plane and data plane.</t>

      <section title="Efficient Load Balancing">
        <t>In large-scale inference systems, the efficient scheduling and
        routing of inference requests is a critical requirement. It primarily
        focuses on two aspects: first, the request scheduling must be KV
        Cache-aware, striving to match cached KV data to minimize redundant
        computations and storage; second, the request scheduling must account
        for hardware load and network conditions, ensuring sufficient
        resources to process inference requests as efficiently as
        possible.</t>
      </section>

      <section title="KV Cache Management">
        <t>In inference architecture systems, storage capacity is typically
        directly correlated with service concurrency. In large-scale inference
        service deployment scenarios, distributed storage systems are often
        employed to manage KV Cache. Additionally, multi-level storage
        structures within single nodes, such as HBM-CPU DRAM-SSD, are commonly
        used to cache KV data. Beyond the placement mechanisms of KV Cache in
        hierarchical distributed storage systems, the disparity in
        communication bandwidth across cards/nodes and different storage tiers
        has become a bottleneck restricting the efficient utilization of KV
        Cache. How to balance these communication bandwidths as much as
        possible constitutes another critical issue that needs to be
        addressed.</t>
      </section>

      <section title="KV Cache Transmission">
        <t>In large-scale inference service deployment scenarios, the
        separation of Prefill and Decode (PD) phases serves as an effective
        approach to enhance system concurrency and reduce deployment costs.
        This PD separation involves substantial KV Cache data transmission
        between Prefill instances and Decode instances, while demanding
        extremely high real-time requirements for request servicing. Ensuring
        real-time KV transmission between PD instances represents another
        critically important area that requires focused attention.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="vLLM">
        <front>
          <title>Efficient Memory Management for Large Language Model Serving
          with PagedAttention</title>

          <author fullname="Woosuk Kwon" surname="Kwon">
            <organization>UC Berkeley</organization>
          </author>

          <date year="2023"/>
        </front>
      </reference>

      <reference anchor="SGLang">
        <front>
          <title>SGLang: Efficient Execution of Structured Language Model
          Programs</title>

          <author fullname="Lianmin Zheng" surname="Zheng">
            <organization>UC Berkeley</organization>
          </author>

          <date day="6" month="6" year="2024"/>
        </front>
      </reference>

      <reference anchor="Mooncake">
        <front>
          <title>Mooncake: A KVCache-centric Disaggregated Architecture for
          LLM Serving</title>

          <author fullname="Ruoyu Qin" surname="Qin">
            <organization>Moonshot AI, Tsinghua University</organization>
          </author>

          <date day="9" month="7" year="2024"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
