Requirements Analysis of System and Network for Large Language Model Inference Service

Since the launch of ChatGPT in 2023, more and more product-level LLMs have emerged, with GPT-4o, Claude-Sonnet-3.5, Gemini, Kimi, and others leading the charge. In early 2025, DeepSeek-R1 reignited the LLM frenzy, and Musk's xAI recently unveiled the powerful Grok3. It is evident that LLMs will continue to reach new heights. Major vendors, including OpenAI, Anthropic, DeepSeek, and Google, have deployed their LLM applications across mobile and web platforms. As the field grows, daily active users (DAUs) for these applications are expected to surge, potentially reaching hundreds of millions during peak periods. This presents significant challenges for large-scale inference services. For instance, up to now, DeepSeek still struggles with persistent "Service Busy" issues. Existing large-scale inference service architectures primarily adopt two technical approaches: Prefill-Decoding (PD) Fusion and Prefill-Decoding Disaggregation, which is derived from the distinct computational characteristics of the Prefill (compute-intensive) and Decoding (memory-intensive) phases. Efficient network management and hardware coordination are essential to maximize system throughput and minimize user-perceived latency. This document first introduces mainstream inference frameworks, then optimization metrics, and finally elaborates on the network and system requirements for deploying large-scale LLM inference services.

At present, there are two main technical routes of the mainstream LLM service systems, namely PD Fusion and PD Disaggregation. Prefill, which is to simultaneously compute all of tokens of user requests, also known as prompts, is characterized as computational intensive, computing-bound, with extremely high computing force requirements. Decoding generates user-required content based on the KV Cache and first token generated by Prefill phase. Due to the reuse of KV Cache of the tokens prior to the current token, it is characterized as memory-intensive and memory-bound, with higher requirements for memory in decoding phase. A complete LLM inference procedure is shown in Figure 1. Based on whether to decouple two stages with obviously different computing requirements, two technical routes of LLM inference serving system emerge, namely, PD Fusion and decoupled PD Disaggregation. The rest of this section describes in detail about the two technical architectures.

+-------------+ +-------------+ +-------------+ +-------------+ | LLM | | LLM | | LLM | | LLM | | Iteration 1 +-+| Iteration 2 ++| Iteration 3 ++| Iteration 4 ++ +-----^-------+ |+---^------^--+|+---^-------^-+|+---^-------^-+| | | | | | | | | | | | | | +--+--+ | | +--+--+ | | +--+--+ | | <Prompt:Is apple | | KV | <Yes> | | KV | <It> | | KV | <Is> |<EOS> a fruit?>| |Cache| ^ | |Cache| ^ | |Cache| ^ | ^ | +--^--+ | | +--^--+ | | +--^--+ | | | | | | | | | | | | | | +----+------+ +----+-------+ +----+-------+ +--+ +-----Prefill----+--------------------Decoding----------------------+ Prefill: Processes all tokens in user prompts (Parallelizable, compute-bound, requiring high computing power). Decoding: Generates output tokens sequentially based on the KV Cache from Prefill (Memory-bound, requiring high GPU memory).

In PD Fusion, LLM instances are deployed within a single cluster, managed by a global scheduler responsible for load balancing, KV Cache management, and resource allocation. Most frameworks adopt vLLM's paged KV Cache mechanism, inspired by OS virtual memory management. This approach stores KV Cache into non-contiguous physical blocks across nodes and uses a scheduler to map logical blocks to physical memory. Additionally, prefix-sharing strategies are employed to reuse KV Cache for prompts with identical prefixes, reducing redundant computations. Remote KV Cache replication across nodes is also required to reduce duplicated computing of KV Cache of same tokens. The architecture is shown in Figure 2.

In PD Disaggregation, Prefill and Decoding are decoupled into separate instances to optimize hardware utilization. After Prefill computes the full KV Cache for a prompt, the data is transferred to Decoding instances for text generation. This architecture demands efficient coordination between Prefill and Decoding instances, as well as reliable high-speed data transmission. The workflow is illustrated in Figure 3.

The ultimate goals of an inference system are to maximize system goodput which reflects the serving volume of user requests and minimize user-perceived latency at low cost. For both PD Fusion and PD Disaggregation architectures, three kind of specific key metrics are defined as follows: Latency-related Metrics: TTFT (Time to First Token): The time taken by the Prefill phase to generate the first token. TBT (Time Between Tokens): The interval between consecutive token generations in the Decoding phase. Throughput-related Metrics: TPS (Tokens Per Second): The number of tokens generated per second by the inference system. RPS (Requests Per Second): The number of requests processed per second by the inference system. Cost-related Metrics: Cost Per Token: How much it costs for inference system to generate each token.

From the user's perspective, end-to-end inference latency generated during a complete inference process is a key metric directly impacting quality of service and user experience. From the perspective of llm deployment, the most critical challenge is how to deploy llm services at the lowest possible cost while meeting basic user requirements for inference latency, minimizing idle rates of computational resources, and ensuring systems operate at near-saturation levels. Industry commonly adopts decoupled architectures (e.g., PD Disaggregation) or other decoupled structures to deploy and independently optimize distinct model components on heterogeneous clusters. However, maximizing system throughput and reducing inference deployment costs inevitably sacrifices user end-to-end inference latency. In other words, latency, throughput, and cost form an impossible trinity. Synthesizing the needs and objectives of both users and service providers, the primary challenge for inference systems is: How to maximize system throughput and minimize deployment costs while adhering to Service Level Objective (SLO) constraints for inference services?

To address the above key issue, current industry practices (e.g., Mooncake, DeepSeek, vLLM, SGLang) employ methods such as KV Cache prefix matching, PD Disaggregation deployment optimization, efficient KV Cache memory management, and flexible load balancing scheduling to enhance resource utilization. These approaches also aim to leverage fragmented, low-cost and idle resources, thereby increasing user throughput and reducing inference service deployment costs. However, this approach faces several significant challenges:

Whether using PD Disaggregation or PD Fusion deployment, the industry widely adopts KV Cache prefix matching for large-scale inference deployment optimization. This technique aims to reduce redundant computation and storage of prefix KV Cache across different inference requests, thereby saving computational resources and ultimately improving system throughput while lowering deployment costs. How to systematically manage these KV Cache prefixes within the inference network—including storage mechanisms, placement strategies, scheduling schemes, and replacement policies—remains a critical and urgent challenge.

To optimize the trade-offs between inference deployment cost, service performance, and system throughput, implementing more efficient resource- and KV Cache-aware mechanisms within the inference network's management and control plane, alongside hardware- and network-aware load balancing scheduling, presents an extremely crucial challenge.

If a PD Disaggregation deployment approach is adopted, ensuring efficient transmission of KV Cache between the separated, heterogeneous Prefilling and Decoding clusters becomes a vital challenge. This transmission must be designed to either avoid impacting end-to-end inference latency entirely or minimize its impact as much as possible.

To achieve large-scale LLM service deployment, frameworks MUST meet the following requirements in both control plane and data plane.

In large-scale inference systems, the efficient scheduling and routing of inference requests is a critical requirement. It primarily focuses on two aspects: first, the request scheduling must be KV Cache-aware, striving to match cached KV data to minimize redundant computations and storage; second, the request scheduling must account for hardware load and network conditions, ensuring sufficient resources to process inference requests as efficiently as possible.

In inference architecture systems, storage capacity is typically directly correlated with service concurrency. In large-scale inference service deployment scenarios, distributed storage systems are often employed to manage KV Cache. Additionally, multi-level storage structures within single nodes, such as HBM-CPU DRAM-SSD, are commonly used to cache KV data. Beyond the placement mechanisms of KV Cache in hierarchical distributed storage systems, the disparity in communication bandwidth across cards/nodes and different storage tiers has become a bottleneck restricting the efficient utilization of KV Cache. How to balance these communication bandwidths as much as possible constitutes another critical issue that needs to be addressed.

In large-scale inference service deployment scenarios, the separation of Prefill and Decode (PD) phases serves as an effective approach to enhance system concurrency and reduce deployment costs. This PD separation involves substantial KV Cache data transmission between Prefill instances and Decode instances, while demanding extremely high real-time requirements for request servicing. Ensuring real-time KV transmission between PD instances represents another critically important area that requires focused attention.

TBD.