Network Working Group M. Mondal Internet-Draft M. Gaikwad Intended status: Informational Independent Expires: 12 October 2026 10 April 2026 Benchmarking Workload Profiles for Large Language Model Serving draft-mondal-llm-serving-workload-profiles-00 Abstract This document defines standard workload profiles for benchmarking Large Language Model (LLM) inference serving systems. Each profile represents one class of production workload. A profile is defined by its input and output token distribution, output structure, latency sensitivity, concurrency pattern, and caching behavior. Profiles are organized into six groups: Non-Generative, Minimal Output, Interactive Streaming, Prefill-Heavy Generative, Decode-Heavy Generative, and Multi-Step Chained. This document works with "Benchmarking Methodology for Large Language Model Serving" [LLM-METHOD]. It specifies the workloads that the methodology's tests SHOULD be run against. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 12 October 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. Mondal & Gaikwad Expires 12 October 2026 [Page 1] Internet-Draft LLM Benchmarking Profiles April 2026 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Relationship to Companion Documents . . . . . . . . . . . 4 1.2. Document Structure . . . . . . . . . . . . . . . . . . . 4 1.3. Out of Scope . . . . . . . . . . . . . . . . . . . . . . 4 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 3. Profile Classification Framework . . . . . . . . . . . . . . 5 3.1. Input/Output Ratio Class . . . . . . . . . . . . . . . . 5 3.2. Output Constraint Level . . . . . . . . . . . . . . . . . 6 3.3. Latency Sensitivity Class . . . . . . . . . . . . . . . . 6 3.4. Concurrency Pattern . . . . . . . . . . . . . . . . . . . 7 3.5. Prefix Sharing Pattern . . . . . . . . . . . . . . . . . 7 4. Profile Groups and Definitions . . . . . . . . . . . . . . . 8 4.1. Common Benchmark Run Parameters . . . . . . . . . . . . . 9 4.2. Group A: Non-Generative Profiles . . . . . . . . . . . . 10 4.2.1. EMBED: Embedding Generation . . . . . . . . . . . . . 11 4.2.2. XRANK: Cross-Encoder Reranking . . . . . . . . . . . 12 4.2.3. LOGPR: Logprob / Perplexity Scoring . . . . . . . . . 13 4.3. Group B: Minimal Output Profiles . . . . . . . . . . . . 15 4.3.1. CLAS: Classification and Labeling . . . . . . . . . . 15 4.3.2. SFQA: Short-Form QA / Factoid . . . . . . . . . . . . 16 4.3.3. RANK: Generative Ranking / Scoring . . . . . . . . . 18 4.3.4. NER: Entity Resolution and Recognition . . . . . . . 19 4.3.5. FUNC: Function Calling / Tool Selection . . . . . . . 20 4.3.6. SQLN: SQL / Query Generation . . . . . . . . . . . . 21 4.4. Group C: Interactive Streaming Profiles . . . . . . . . . 22 4.4.1. CHAT: Conversational Chat . . . . . . . . . . . . . . 22 4.4.2. COMP: Code Completion / Autocomplete . . . . . . . . 24 4.4.3. PERS: Dialogue / Roleplay / Persona . . . . . . . . . 25 4.5. Group D: Prefill-Heavy Generative Profiles . . . . . . . 26 4.5.1. SUMM: Summarization . . . . . . . . . . . . . . . . . 26 4.5.2. LDQA: Long Document Question Answering . . . . . . . 28 4.5.3. MDOC: Multi-Document Synthesis . . . . . . . . . . . 29 4.5.4. EXTR: Structured Extraction . . . . . . . . . . . . . 30 4.5.5. JUDG: LLM-as-Judge / Evaluation . . . . . . . . . . . 31 4.5.6. RAGN: Retrieval-Augmented Generation . . . . . . . . 33 4.6. Group E: Decode-Heavy Generative Profiles . . . . . . . . 34 4.6.1. CGEN: Content Generation . . . . . . . . . . . . . . 34 4.6.2. REAS: Reasoning / Chain-of-Thought . . . . . . . . . 35 4.6.3. DGEN: Data Generation / Synthetic Data . . . . . . . 37 4.6.4. TRNS: Translation . . . . . . . . . . . . . . . . . . 38 Mondal & Gaikwad Expires 12 October 2026 [Page 2] Internet-Draft LLM Benchmarking Profiles April 2026 4.6.5. EDIT: Rewriting / Editing . . . . . . . . . . . . . . 39 4.7. Group F: Multi-Step Chained Profiles . . . . . . . . . . 40 4.7.1. AGNT: Agentic / Tool Use . . . . . . . . . . . . . . 41 4.7.2. PLAN: Planning / Task Decomposition . . . . . . . . . 42 5. Profile-to-Metric Applicability Matrix . . . . . . . . . . . 43 6. Profile Composition for Mixed Workloads . . . . . . . . . . . 47 6.1. Weighted Profile Combinations . . . . . . . . . . . . . . 47 6.2. Reporting Requirements for Mixed Workloads . . . . . . . 47 7. Cross-Cutting Dimensions . . . . . . . . . . . . . . . . . . 48 7.1. Batch vs. Online . . . . . . . . . . . . . . . . . . . . 48 7.2. Streaming vs. Non-Streaming . . . . . . . . . . . . . . . 48 7.3. Fill-in-the-Middle . . . . . . . . . . . . . . . . . . . 49 8. Security Considerations . . . . . . . . . . . . . . . . . . . 49 8.1. Workload-Specific Attack Surfaces . . . . . . . . . . . . 49 8.2. Profile Gaming . . . . . . . . . . . . . . . . . . . . . 49 8.3. Data Privacy in Reference Datasets . . . . . . . . . . . 49 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 50 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 50 10.1. Normative References . . . . . . . . . . . . . . . . . . 50 10.2. Informative References . . . . . . . . . . . . . . . . . 50 Appendix A. Reference Datasets for Each Profile . . . . . . . . 50 Appendix B. Profile Parameter Quick Reference Table . . . . . . 53 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 56 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 56 1. Introduction The companion document [LLM-METHOD] defines test procedures, measurement specifications, and reporting formats for LLM inference systems. A second companion [LLM-TERMS] defines the metrics vocabulary. Together these two documents specify how to measure. This document specifies what to measure against. Production LLM deployments serve many different task types. These tasks differ in fundamental inference properties. Some workloads have no decode phase at all. Embedding workloads are one example. Some workloads produce free-form text. Others produce output constrained to a schema or a label set. Some users wait interactively for streaming tokens. Others consume results in batch. Some requests share a large input prefix with other requests. Others arrive independently. Mondal & Gaikwad Expires 12 October 2026 [Page 3] Internet-Draft LLM Benchmarking Profiles April 2026 Some workloads are a single inference call. Others are a chain of dependent calls. These properties determine which metrics from [LLM-TERMS] are meaningful. They determine which tests from [LLM-METHOD] apply. They determine what performance characteristics matter in production. Without standard profiles, benchmark results are hard to compare. A system tuned for chat may perform poorly on summarization. A system benchmarked only on synthetic uniform workloads may not match any real production scenario. This document fixes that. 1.1. Relationship to Companion Documents This document is the third in a series. [LLM-TERMS] defines the metrics vocabulary. Examples are Time to First Token, Inter-Token Latency, and Output Token Throughput. [LLM-METHOD] defines test procedures. Examples are how to measure TTFT, how to set up load generators, and how to format reports. This document defines workload profiles. It specifies what representative tasks to benchmark against. It maps each profile to the applicable metrics and tests. Implementers SHOULD select profiles that match their production workload mix. They SHOULD apply the tests from [LLM-METHOD] to those profiles. They SHOULD report results per profile. 1.2. Document Structure Section 3 defines the classification framework. It uses five dimensions to describe each profile. Section 4 defines the common benchmark run parameters that MUST be reported for all profiles (Section 4.1). Sections 4.2 through 4.7 define the profiles. Section 5 maps profiles to applicable metrics. Section 6 covers mixed-workload composition. Section 7 covers dimensions that apply to any profile. Section 9 covers IANA considerations. 1.3. Out of Scope The following topics are outside the scope of this document. Model training is out of scope. This document covers inference serving only. Mondal & Gaikwad Expires 12 October 2026 [Page 4] Internet-Draft LLM Benchmarking Profiles April 2026 Model quality is out of scope. Output correctness and accuracy are not performance metrics here. The one exception is output validity rates for structured output profiles, which MAY be reported as a secondary metric. Network and client-side latency are out of scope. The exception is when the system under test (SUT) boundary is defined to include network transport per [LLM-METHOD]. Multimodal workloads are out of scope. This document covers text- only LLM serving. Multimodal input and output such as images, audio, and video are left for future work. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 3. Profile Classification Framework Each profile is described along five dimensions. These dimensions allow systematic comparison across profiles. They help implementers understand the inference properties of each workload. 3.1. Input/Output Ratio Class This ratio is computed as input tokens divided by output tokens. It determines the balance between the prefill phase and the decode phase. It is often the main factor in whether a workload is compute- bound or memory-bandwidth-bound. The exact boundary depends on model architecture, batch size, quantization, and attention implementation. For profiles that produce hidden thinking tokens (see Section 4.6.2), those tokens count as output tokens for the ratio. The exception is when the serving system hides them entirely from the caller. In that case, the serving system MUST document whether max_output_tokens applies to visible tokens only, to total tokens, or to both. Prefill-Only (PO): No token generation. The output is a vector, a score, or a probability distribution. Examples are embedding, cross-encoder reranking, and perplexity scoring. Prefill-Dominant (PD): Input is much larger than output. The ratio is greater than 10:1. Examples are summarization, classification, and extraction. Mondal & Gaikwad Expires 12 October 2026 [Page 5] Internet-Draft LLM Benchmarking Profiles April 2026 Balanced (BA): Input and output are comparable in length. The ratio is between 1:3 and 3:1. Examples are translation, conversational chat, and rewriting. Decode-Dominant (DD): Output is much larger than input. The ratio is greater than 1:10. Examples are content generation, creative writing, and data generation. 3.2. Output Constraint Level This dimension describes how much the output format is constrained. It affects decode efficiency, guided generation overhead, and output validation requirements. Non-Token (NT): The output is not a token sequence. Examples are vectors, floating-point scores, and probability distributions. Minimal (MI): The output is a single token, a short label, or a numeric score. Examples are classification and yes/no decisions. Structured (ST): The output MUST conform to a defined schema. Examples are JSON objects, SQL queries, and function call arguments. Semi-Structured (SS): The output follows a general format but has variable content. Examples are Markdown, numbered lists, and tables. Free-Form (FF): There are no structural constraints. Examples are natural language text, creative writing, and conversational responses. 3.3. Latency Sensitivity Class This dimension describes how much latency the consumer can tolerate. It determines whether TTFT, ITL, or throughput is the primary optimization target. The latency targets listed below are informative. They reflect targets commonly used in production. They are not normative benchmark pass/fail thresholds. Actual targets depend on model size, hardware, network configuration, and operator SLOs. Interactive (IN): The user watches tokens arrive in real time. Interactive deployments often target sub-500ms TTFT as a user- experience goal. ITL consistency is critical. Examples are chat and code completion. Mondal & Gaikwad Expires 12 October 2026 [Page 6] Internet-Draft LLM Benchmarking Profiles April 2026 Near-Real-Time (NR): The response is needed within seconds. It is usually not streamed token by token. Production deployments commonly target total latency below 5 seconds. Examples are search ranking, extraction, and function calling. Throughput-Oriented (TO): Latency is secondary. The goal is to maximize aggregate throughput. Examples are batch classification, bulk summarization, and data generation. 3.4. Concurrency Pattern This dimension describes the typical request arrival and batching pattern. It affects scheduler behavior and resource utilization. Single-Stream (SI): Requests arrive individually. They follow a Poisson or similar arrival process. Examples are general API usage and chat. Burst-Batch (BB): Groups of related requests arrive together. Examples are ranking a set of candidates and classifying a document batch. Sustained-High (SH): A continuous high-volume stream of requests arrives at a stable rate. Examples are production pipelines and data processing. Chained-Sequential (CS): Each request depends on the output of the previous request. Examples are agentic workflows and planning pipelines. 3.5. Prefix Sharing Pattern This dimension describes how much requests share common input prefixes. It affects KV cache utilization and memory efficiency. In this document, prefix sharing means cross-request shared prefill reuse. A serving system retains and reuses the computed key-value (KV) state for a shared input prefix across multiple distinct requests. This is different from two other things that are sometimes called caching. The first is intra-request KV cache. This is the normal decode- phase cache retained within a single request. It is present in all generative profiles. It is not what this dimension measures. The second is external result caching. This is deduplication of identical complete requests at an API gateway or proxy layer. This is out of scope for this document. Mondal & Gaikwad Expires 12 October 2026 [Page 7] Internet-Draft LLM Benchmarking Profiles April 2026 When a profile has a non-None prefix sharing value, benchmarks MUST report prefix cache hit rates. A cache hit means the serving system determined that the leading N tokens of an incoming request match a cached prefix exactly. This match is token-by- token or via an equivalent hash. The cached KV state is then reused without recomputation. Serving systems MUST document the cache eviction policy and the cache type as part of the run parameters defined in Section 4.1. None (N): Each request has a unique prefix. Example is independent short queries. System-Prompt (S): A shared system prompt appears across requests. It is typically 100 to 2000 tokens. Most API deployments follow this pattern. Document-Shared (D): Multiple requests share a large document context. It is typically thousands to tens of thousands of tokens. Examples are ranking candidates against a shared passage and multi-question QA over a document. Turn-Accumulated (T): The prefix grows across conversation turns. Examples are multi-turn chat and agentic tool loops. Schema-Shared (H): Requests share a tool or function schema as a prefix. Examples are function calling and SQL generation with a shared database schema. 4. Profile Groups and Definitions This section defines 25 workload profiles in six groups. Section 4.1 specifies common benchmark run parameters that apply to all profiles. Sections 4.2 through 4.7 define the profiles. Each profile includes the following. Description: What the task is and why it is a distinct benchmarking profile. Classification Vector: The five-dimensional classification from Section 3. Typical Token Distribution: Representative input and output token length ranges. Key Performance Indicators: The most important metrics from [LLM-TERMS] for this profile. Mondal & Gaikwad Expires 12 October 2026 [Page 8] Internet-Draft LLM Benchmarking Profiles April 2026 Benchmarking Considerations: Profile-specific setup, measurement, or reporting guidance. 4.1. Common Benchmark Run Parameters Benchmark results are only comparable when the configuration is fully specified. The following parameters MUST be reported for every benchmark run. They apply in addition to any profile- specific requirements in Sections 4.2 through 4.7. +===================+============================================+ | Parameter | Requirement | +===================+============================================+ | Model identifier | MUST report exact model name and version | | and version | string | +-------------------+--------------------------------------------+ | Tokenizer name | MUST report; tokenizer affects all token | | and version | count measurements | +-------------------+--------------------------------------------+ | Decoding strategy | MUST report: greedy, sampling, or beam | | | search | +-------------------+--------------------------------------------+ | Temperature | MUST report if sampling; N/A for greedy | +-------------------+--------------------------------------------+ | top_p | MUST report if sampling | +-------------------+--------------------------------------------+ | top_k | MUST report if sampling | +-------------------+--------------------------------------------+ | max_output_tokens | MUST report; specify whether limit applies | | | to visible tokens, total tokens, or both | +-------------------+--------------------------------------------+ | Stop sequences | MUST report all stop conditions used | +-------------------+--------------------------------------------+ | Streaming enabled | MUST report: enabled or disabled | +-------------------+--------------------------------------------+ | Server token | MUST report if streaming enabled (e.g., | | flush policy | per-token, batched-N, time-based) | +-------------------+--------------------------------------------+ | Guided decoding | MUST report: enabled or disabled | | enabled | | +-------------------+--------------------------------------------+ | Guided decoding | MUST report if enabled (e.g., grammar- | | method | based, JSON schema, logit masking) | +-------------------+--------------------------------------------+ | Guided decoding | MUST report if enabled: schema token count | | schema size | and nesting depth | +-------------------+--------------------------------------------+ | Prefix cache | MUST report: enabled or disabled | Mondal & Gaikwad Expires 12 October 2026 [Page 9] Internet-Draft LLM Benchmarking Profiles April 2026 | enabled | | +-------------------+--------------------------------------------+ | Prefix cache type | MUST report if enabled (e.g., block-level | | | KV cache, full-prefix cache) | +-------------------+--------------------------------------------+ | Prefix cache | MUST report if enabled (e.g., LRU, TTL- | | eviction policy | based, capacity limit) | +-------------------+--------------------------------------------+ | Hardware | MUST report: accelerator type, count, and | | configuration | memory capacity | +-------------------+--------------------------------------------+ | Serving framework | MUST report | | and version | | +-------------------+--------------------------------------------+ | Quantization | MUST report: precision (e.g., fp16, int8, | | | int4) and method if applicable | +-------------------+--------------------------------------------+ Table 1 Implementers SHOULD also report the version of any load generation tool used. They SHOULD report the network topology between the load generator and the SUT. 4.2. Group A: Non-Generative Profiles Non-generative profiles produce no output token sequence. The inference pipeline runs only the prefill pass. It returns a non- token result. This result is a vector embedding, a scalar score, or a per-token probability distribution. There is no decode phase. The following metrics from [LLM-TERMS] do not apply to Group A profiles. * Time to First Token (TTFT) * Inter-Token Latency (ITL) * Output Token Throughput * Time Per Output Token (TPOT) Group A profiles MUST instead be measured using the following. Request Latency: Total time from request submission to result receipt. Sequences Per Second: Number of complete inference requests Mondal & Gaikwad Expires 12 October 2026 [Page 10] Internet-Draft LLM Benchmarking Profiles April 2026 processed per second under the specified concurrency. For Group A profiles, one sequence equals one request. Seq/s and Req/s are the same here. When a single API call submits multiple sequences in a batch, throughput MUST be reported as sequences per second, not API calls per second. Prefill Throughput: Input tokens processed per second. 4.2.1. EMBED: Embedding Generation The model encodes input text into a fixed-size dense vector. This vector is used for semantic search, retrieval, clustering, and similarity computation. This is one of the highest-volume LLM inference workloads in production. Every RAG pipeline, recommendation system, and semantic search index depends on embedding generation. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Only (PO) | +---------------------+--------------------------+ | Output Constraint | Non-Token (NT) | +---------------------+--------------------------+ | Latency Sensitivity | Near-Real-Time (NR) to | | | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | None (N) | +---------------------+--------------------------+ Table 2 Typical Token Distribution: +==============+=================================+ | Parameter | Range | +==============+=================================+ | Input Length | 16 to 8192 tokens | +--------------+---------------------------------+ | Output | Fixed-size vector (e.g., 768, | | | 1024, 1536, or 3072 dimensions) | +--------------+---------------------------------+ Table 3 Mondal & Gaikwad Expires 12 October 2026 [Page 11] Internet-Draft LLM Benchmarking Profiles April 2026 Key Performance Indicators: * Sequences per second at target batch size * P50 and P99 request latency under load * Prefill throughput in input tokens per second * Throughput scaling with batch size Embedding workloads work well with large batches. Benchmarks SHOULD measure throughput across batch sizes of 1, 8, 32, 64, 128, and 256. This shows batching efficiency. Input length SHOULD come from a representative corpus. Real embedding workloads have high length variance because short queries and long passages are both common. Benchmarks SHOULD report latency and throughput separately for query- length inputs (fewer than 64 tokens) and passage-length inputs (256 to 8192 tokens). These two cases often serve different stages of a retrieval pipeline. 4.2.2. XRANK: Cross-Encoder Reranking The model receives a query and a document. It produces a relevance score without generating tokens. Cross- encoder reranking is used in search pipelines to refine initial retrieval results. The model processes the concatenated query and document pair. It outputs a scalar relevance score. Classification Vector: +=====================+=====================+ | Dimension | Value | +=====================+=====================+ | I/O Ratio | Prefill-Only (PO) | +---------------------+---------------------+ | Output Constraint | Non-Token (NT) | +---------------------+---------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+---------------------+ | Concurrency Pattern | Burst-Batch (BB) | +---------------------+---------------------+ | Prefix Sharing | Document-Shared (D) | +---------------------+---------------------+ Table 4 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 12] Internet-Draft LLM Benchmarking Profiles April 2026 +==============+==========================================+ | Parameter | Range | +==============+==========================================+ | Input Length | 128 to 1024 tokens (query plus document) | +--------------+------------------------------------------+ | Output | Single scalar score | +--------------+------------------------------------------+ Table 5 Key Performance Indicators: * Pairs scored per second at target batch size * P50 and P99 request latency per scoring batch * Throughput scaling as candidate set size grows Cross-encoder reranking typically processes a batch of candidates. This is commonly 10 to 100 query-document pairs. The query is shared across all pairs. Benchmarks SHOULD measure performance with the shared query prefix to capture KV cache benefits. Candidate set sizes of 10, 25, 50, and 100 SHOULD be tested. Latency SHOULD be reported for the full candidate set. This is the time to score all candidates, not just one pair. This matches real search pipeline requirements. 4.2.3. LOGPR: Logprob / Perplexity Scoring The model processes input text. It returns per- token log probabilities or a sequence-level perplexity score. It does not generate new tokens. This is used for AI-generated content detection, model evaluation, data quality filtering, watermark detection, and calibration. The model runs a forward pass over the input. It returns the probability distribution at each token position. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 13] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Only (PO) | +---------------------+--------------------------+ | Output Constraint | Non-Token (NT) | +---------------------+--------------------------+ | Latency Sensitivity | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | None (N) | +---------------------+--------------------------+ Table 6 Typical Token Distribution: +==============+==================================================+ | Parameter | Range | +==============+==================================================+ | Input Length | 64 to 4096 tokens | +--------------+--------------------------------------------------+ | Output | Per-token logprobs or aggregate perplexity score | +--------------+--------------------------------------------------+ Table 7 Key Performance Indicators: * Sequences scored per second * Input tokens processed per second * P50 and P99 request latency by input length bucket Logprob scoring is computationally the same as the prefill phase of generative inference. The difference is that it returns probability data instead of starting a decode phase. It differs from embedding in one key way. The full vocabulary probability distribution may be returned at each position. This creates large output data volume even though no tokens are generated. Benchmarks SHOULD specify whether full vocabulary logprobs or only top-k logprobs are returned. This choice significantly affects data transfer overhead. Input length SHOULD span the full range to show prefill scaling. Mondal & Gaikwad Expires 12 October 2026 [Page 14] Internet-Draft LLM Benchmarking Profiles April 2026 4.3. Group B: Minimal Output Profiles Minimal output profiles generate a small number of tokens. The output is a label, a score, a short answer, or a small structured object. The decode phase is short relative to prefill. These workloads are typically prefill-dominant and work well with batching. For Group B profiles, the metric priorities differ from streaming workloads. Total Request Latency is the primary latency metric. TTFT and ITL are not primary here. Request Throughput in requests per second is more useful than output token throughput. TTFT is still measurable. It is approximately equal to total latency because the decode phase is very short. 4.3.1. CLAS: Classification and Labeling The model reads input text. It produces a categorical label or set of labels from a predefined taxonomy. Examples are sentiment analysis, content moderation, topic categorization, intent detection, and spam filtering. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+--------------------------+ | Output Constraint | Minimal (MI) | +---------------------+--------------------------+ | Latency Sensitivity | Near-Real-Time (NR) to | | | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+--------------------------+ Table 8 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 15] Internet-Draft LLM Benchmarking Profiles April 2026 +===============+===================+ | Parameter | Range | +===============+===================+ | Input Length | 32 to 2048 tokens | +---------------+-------------------+ | Output Length | 1 to 10 tokens | +---------------+-------------------+ Table 9 Key Performance Indicators: * Requests classified per second * P50 and P99 total request latency * Throughput scaling with batch size Classification workloads often use constrained decoding or guided generation to restrict output to valid labels. Benchmarks SHOULD report whether guided generation is enabled. They SHOULD report which method is used, such as grammar-based or logit masking. System prompts typically include the label taxonomy. They SHOULD be included in the benchmark setup. The label set size SHOULD be specified. Binary, multi-class, and multi-label variants have different guided generation overhead. 4.3.2. SFQA: Short-Form QA / Factoid The model receives a short question. It may also receive brief context. It produces a concise factual answer. Examples are customer support FAQ, knowledge base queries, and simple information retrieval. It differs from long-document QA (LDQA) in that the input context is short. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 16] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+=========================================+ | Dimension | Value | +=====================+=========================================+ | I/O Ratio | Balanced (BA) to Prefill-Dominant (PD) | +---------------------+-----------------------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+-----------------------------------------+ | Latency Sensitivity | Interactive (IN) to Near-Real-Time (NR) | +---------------------+-----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+-----------------------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+-----------------------------------------+ Table 10 Typical Token Distribution: +===============+==================+ | Parameter | Range | +===============+==================+ | Input Length | 16 to 512 tokens | +---------------+------------------+ | Output Length | 5 to 100 tokens | +---------------+------------------+ Table 11 Key Performance Indicators: * TTFT if streamed * Total request latency * Requests per second under concurrency This profile represents the simple API call pattern. It is the baseline for many LLM applications. It SHOULD be used as a reference workload to establish baseline system performance before testing more complex profiles. Input length SHOULD include very short queries (fewer than 32 tokens) and context- augmented queries (256 to 512 tokens). Mondal & Gaikwad Expires 12 October 2026 [Page 17] Internet-Draft LLM Benchmarking Profiles April 2026 4.3.3. RANK: Generative Ranking / Scoring The model evaluates one or more candidates. It produces a relevance score, a preference ordering, or a comparative judgment as generated tokens. It differs from cross- encoder reranking (XRANK) in that it generates token output such as scores, explanations, or ordinal labels. XRANK produces a raw scalar from the model's hidden state. Classification Vector: +=====================+=================================+ | Dimension | Value | +=====================+=================================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+---------------------------------+ | Output Constraint | Structured (ST) to Minimal (MI) | +---------------------+---------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+---------------------------------+ | Concurrency Pattern | Burst-Batch (BB) | +---------------------+---------------------------------+ | Prefix Sharing | Document-Shared (D) | +---------------------+---------------------------------+ Table 12 Typical Token Distribution: +===============+=============================================+ | Parameter | Range | +===============+=============================================+ | Input Length | 256 to 4096 tokens (prompt plus candidates) | +---------------+---------------------------------------------+ | Output Length | 5 to 50 tokens (scores or ranking) | +---------------+---------------------------------------------+ Table 13 Key Performance Indicators: * Time to score the full candidate set * Requests per second * Prefix cache hit rate across candidates sharing context Mondal & Gaikwad Expires 12 October 2026 [Page 18] Internet-Draft LLM Benchmarking Profiles April 2026 Ranking workloads have strong prefix sharing. When ranking N candidates against a shared query or document, the shared prefix may be 80 to 90 percent of total input tokens. Benchmarks MUST report prefix cache hit rates. They SHOULD compare performance with and without prefix caching. Candidate set sizes of 5, 10, 20, and 50 SHOULD be tested. 4.3.4. NER: Entity Resolution and Recognition The model reads input text. It identifies named entities and produces a structured list of entity mentions with their types and positions. This is used for information extraction from documents, log parsing, and data enrichment pipelines. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+--------------------------+ | Output Constraint | Structured (ST) | +---------------------+--------------------------+ | Latency Sensitivity | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+--------------------------+ Table 14 Typical Token Distribution: +===============+===========================================+ | Parameter | Range | +===============+===========================================+ | Input Length | 64 to 2048 tokens | +---------------+-------------------------------------------+ | Output Length | 20 to 200 tokens (structured entity list) | +---------------+-------------------------------------------+ Table 15 Key Performance Indicators: * Documents processed per second Mondal & Gaikwad Expires 12 October 2026 [Page 19] Internet-Draft LLM Benchmarking Profiles April 2026 * Total request latency by input length * Throughput under batch processing NER output length varies with input content density. Benchmarks SHOULD use representative documents with known entity density. This ensures output length distributions are realistic. Guided generation to a JSON schema is common. Whether it is enabled SHOULD be reported. 4.3.5. FUNC: Function Calling / Tool Selection The model receives a user request and a set of tool or function definitions. It selects the right function and produces well-formed arguments. The tool schema definitions form a large and highly cacheable prefix. Classification Vector: +=====================+=======================+ | Dimension | Value | +=====================+=======================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+-----------------------+ | Output Constraint | Structured (ST) | +---------------------+-----------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+-----------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+-----------------------+ | Prefix Sharing | Schema-Shared (H) | +---------------------+-----------------------+ Table 16 Typical Token Distribution: +===============+=============================================+ | Parameter | Range | +===============+=============================================+ | Input Length | 512 to 4096 tokens (schema plus user query) | +---------------+---------------------------------------------+ | Output Length | 20 to 200 tokens (function name plus JSON | | | arguments) | +---------------+---------------------------------------------+ Table 17 Mondal & Gaikwad Expires 12 October 2026 [Page 20] Internet-Draft LLM Benchmarking Profiles April 2026 Key Performance Indicators: * Total request latency * TTFT as an indicator of schema processing time * Prefix cache hit rate for shared tool schemas Tool schemas vary widely in size. A small schema with 5 tools may be a few hundred tokens. A large schema with 50 tools and complex parameters may be several thousand tokens. Benchmarks SHOULD test with schema sizes of approximately 500, 2000, and 4000 tokens. Tool schemas are highly shareable as prefixes. Prefix caching effectiveness is a critical measurement. Constrained decoding to valid JSON is standard and SHOULD be enabled. 4.3.6. SQLN: SQL / Query Generation The model receives a natural language question and a database schema definition. It produces a syntactically valid SQL query. The schema prefix is large and shared across queries against the same database. Classification Vector: +=====================+========================================+ | Dimension | Value | +=====================+========================================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+----------------------------------------+ | Output Constraint | Structured (ST) | +---------------------+----------------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) to Burst-Batch (BB) | +---------------------+----------------------------------------+ | Prefix Sharing | Schema-Shared (H) | +---------------------+----------------------------------------+ Table 18 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 21] Internet-Draft LLM Benchmarking Profiles April 2026 +===============+===========================================+ | Parameter | Range | +===============+===========================================+ | Input Length | 256 to 8192 tokens (schema plus question) | +---------------+-------------------------------------------+ | Output Length | 20 to 300 tokens (SQL query) | +---------------+-------------------------------------------+ Table 19 Key Performance Indicators: * Total request latency * Prefix cache effectiveness for shared schema * Requests per second under concurrent users on the same schema Database schemas range from simple to complex. A simple schema may have 5 tables and around 500 tokens. A complex schema may have 100 or more tables and 8000 or more tokens. Benchmarks SHOULD test with at least two schema sizes. Multiple queries against the same schema are the dominant production pattern. Prefix caching measurements are critical for this profile. 4.4. Group C: Interactive Streaming Profiles Interactive streaming profiles serve users who watch tokens arrive in real time. These profiles share four key properties. TTFT is critical for perceived responsiveness. ITL consistency determines perceived fluency. Multi-turn context accumulation is common. Prefix caching across conversation turns adds value. For Group C profiles, all streaming metrics from [LLM-TERMS] apply. These are TTFT, ITL, TPOT, and output token throughput. The Throughput-Latency Tradeoff test from [LLM-METHOD] is especially important for this group. 4.4.1. CHAT: Conversational Chat A user and the model have a multi-turn conversation. Each turn adds to the conversation history. This creates a growing input context. This is the most common interactive LLM workload. It is the default profile for general- purpose chat applications. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 22] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+======================+ | Dimension | Value | +=====================+======================+ | I/O Ratio | Balanced (BA) | +---------------------+----------------------+ | Output Constraint | Free-Form (FF) | +---------------------+----------------------+ | Latency Sensitivity | Interactive (IN) | +---------------------+----------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+----------------------+ | Prefix Sharing | Turn-Accumulated (T) | +---------------------+----------------------+ Table 20 Typical Token Distribution: +========================+=======================================+ | Parameter | Range | +========================+=======================================+ | Input Length | 64 to 16384 tokens (grows with turns) | +------------------------+---------------------------------------+ | Output Length | 50 to 1000 tokens | +------------------------+---------------------------------------+ | Turns per conversation | 3 to 20 | +------------------------+---------------------------------------+ Table 21 Key Performance Indicators: * TTFT at various context lengths * P50 and P99 ITL * TTFT degradation as conversation grows * Prefix cache hit rate across turns Chat benchmarks MUST test across multiple conversation depths. This shows how TTFT scales as context grows. Tests at turns 1, 5, 10, and 20 SHOULD be included at minimum. Input data SHOULD come from real multi-turn conversation datasets such as ShareGPT [SHAREGPT] or LMSYS-Chat [LMSYS]. This captures realistic length distributions. The time between turns, while the user reads and types, affects cache eviction pressure. This time SHOULD be set as a parameter in the benchmark. Mondal & Gaikwad Expires 12 October 2026 [Page 23] Internet-Draft LLM Benchmarking Profiles April 2026 4.4.2. COMP: Code Completion / Autocomplete The model receives a code context. This includes the file content, cursor position, and surrounding code. It generates a short completion. This profile has very tight latency requirements, very short output, and high request frequency. Prefix sharing is strong because the same file context appears in many consecutive requests. Classification Vector: +=====================+========================================+ | Dimension | Value | +=====================+========================================+ | I/O Ratio | Prefill-Dominant (PD) to Balanced (BA) | +---------------------+----------------------------------------+ | Output Constraint | Semi-Structured (SS) | +---------------------+----------------------------------------+ | Latency Sensitivity | Interactive (IN) | +---------------------+----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+----------------------------------------+ | Prefix Sharing | Document-Shared (D) | +---------------------+----------------------------------------+ Table 22 Typical Token Distribution: +===============+===================================+ | Parameter | Range | +===============+===================================+ | Input Length | 256 to 8192 tokens (file context) | +---------------+-----------------------------------+ | Output Length | 5 to 200 tokens (completion) | +---------------+-----------------------------------+ Table 23 Key Performance Indicators: * TTFT (primary). Interactive deployments often target sub-200ms TTFT. Benchmarks SHOULD report TTFT at P50, P95, and P99. * Total completion latency * Prefix cache hit rate (consecutive edits share most context) Mondal & Gaikwad Expires 12 October 2026 [Page 24] Internet-Draft LLM Benchmarking Profiles April 2026 Code completion has the tightest TTFT requirements of any profile. Benchmarks SHOULD simulate realistic editing patterns. In these patterns, consecutive requests share 95 percent or more of input tokens with small modifications. Fill-in-the-middle (FIM) variants SHOULD be tested if the serving system supports it. In FIM, the model generates tokens to fill a gap within existing code rather than appending to the end. Both single-line and multi-line completions SHOULD be measured. 4.4.3. PERS: Dialogue / Roleplay / Persona The model maintains a persistent character or persona defined by a large system prompt. This differs from generic chat. System prompts may be 2000 to 10000 tokens or more. They include character descriptions, world-building context, and behavioral guidelines. System prompt prefix caching is important when many users share the same persona. Classification Vector: +=====================+=============================================+ | Dimension | Value | +=====================+=============================================+ | I/O Ratio | Balanced (BA) | +---------------------+---------------------------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+---------------------------------------------+ | Latency Sensitivity | Interactive (IN) | +---------------------+---------------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+---------------------------------------------+ | Prefix Sharing | System-Prompt (S) plus | | | Turn-Accumulated (T) | +---------------------+---------------------------------------------+ Table 24 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 25] Internet-Draft LLM Benchmarking Profiles April 2026 +=================+======================+ | Parameter | Range | +=================+======================+ | System Prompt | 2000 to 10000 tokens | +-----------------+----------------------+ | User Turn Input | 16 to 256 tokens | +-----------------+----------------------+ | Output Length | 100 to 2000 tokens | +-----------------+----------------------+ Table 25 Key Performance Indicators: * TTFT with large system prompts * System prompt prefix cache hit rate across concurrent users * ITL consistency during long responses The key benchmark for this profile is performance under large shared system prompts with many concurrent users. When 1000 users share the same persona, the system prompt KV cache should be computed once. Benchmarks SHOULD measure TTFT and throughput as a function of the number of concurrent users sharing the same system prompt. Results with and without prefix caching SHOULD both be reported. 4.5. Group D: Prefill-Heavy Generative Profiles Prefill-heavy profiles process long input contexts. They generate moderate-length output. The prefill phase dominates compute and memory requirements. These workloads stress KV cache capacity, long- context attention efficiency, and memory management. The most important metrics for Group D profiles are TTFT (heavily influenced by prefill duration), prefill throughput in input tokens per second, memory utilization at peak context length, and total request latency. 4.5.1. SUMM: Summarization The model reads a document or set of passages. It produces a condensed summary. This is one of the most common production LLM tasks. The input is often 10 to 100 times longer than the output. The prefill phase dominates total latency. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 26] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+========================================+ | Dimension | Value | +=====================+========================================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+----------------------------------------+ | Output Constraint | Free-Form (FF) to Semi-Structured (SS) | +---------------------+----------------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) to Throughput- | | | Oriented (TO) | +---------------------+----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) to Sustained-High | | | (SH) | +---------------------+----------------------------------------+ | Prefix Sharing | None (N) | +---------------------+----------------------------------------+ Table 26 Typical Token Distribution: +===============+======================+ | Parameter | Range | +===============+======================+ | Input Length | 1024 to 32768 tokens | +---------------+----------------------+ | Output Length | 64 to 1024 tokens | +---------------+----------------------+ Table 27 Key Performance Indicators: * TTFT by input length bucket * Total request latency * Prefill throughput in input tokens per second * Documents summarized per second Summarization performance depends strongly on input length. Benchmarks MUST report results in input length buckets. Example buckets are 1K to 4K, 4K to 8K, 8K to 16K, and 16K to 32K tokens. Input data SHOULD come from document corpora with realistic length distributions rather than synthetic text. Both extractive summarization (shorter output) and abstractive summarization (longer output) SHOULD be tested. Mondal & Gaikwad Expires 12 October 2026 [Page 27] Internet-Draft LLM Benchmarking Profiles April 2026 4.5.2. LDQA: Long Document Question Answering The model receives a long document and a question. It produces a concise answer from the document content. It differs from summarization in that the output is shorter and more targeted. It differs from short-form QA in that it has a large context document. Classification Vector: +=====================+========================================+ | Dimension | Value | +=====================+========================================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+----------------------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+----------------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) to Burst-Batch (BB) | +---------------------+----------------------------------------+ | Prefix Sharing | Document-Shared (D) | +---------------------+----------------------------------------+ Table 28 Typical Token Distribution: +===============+=======================+ | Parameter | Range | +===============+=======================+ | Input Length | 4096 to 131072 tokens | +---------------+-----------------------+ | Output Length | 16 to 512 tokens | +---------------+-----------------------+ Table 29 Key Performance Indicators: * TTFT at various context lengths * TTFT scaling as context length doubles * Total request latency * Prefix cache effectiveness across multiple questions Mondal & Gaikwad Expires 12 October 2026 [Page 28] Internet-Draft LLM Benchmarking Profiles April 2026 Long document QA tests the interaction between context length and latency. Benchmarks SHOULD test with multiple questions over the same document. This measures prefix caching effectiveness. Context lengths SHOULD span at least three orders of magnitude. Example values are 4K, 16K, 64K, and 128K tokens. The position of the answer within the document can affect performance on some systems. This MAY be reported. 4.5.3. MDOC: Multi-Document Synthesis The model receives multiple documents. It produces one output that synthesizes information across all of them. Examples are literature review, competitive analysis, and multi-source report generation. It differs from single-document summarization in that it requires reasoning across documents. It differs from RAG in that there is no retrieval step. All documents are provided in full. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+--------------------------+ | Output Constraint | Semi-Structured (SS) | +---------------------+--------------------------+ | Latency Sensitivity | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+--------------------------+ | Prefix Sharing | None (N) | +---------------------+--------------------------+ Table 30 Typical Token Distribution: +===============+============================================+ | Parameter | Range | +===============+============================================+ | Input Length | 8192 to 131072 tokens (multiple documents) | +---------------+--------------------------------------------+ | Output Length | 256 to 4096 tokens | +---------------+--------------------------------------------+ Table 31 Key Performance Indicators: Mondal & Gaikwad Expires 12 October 2026 [Page 29] Internet-Draft LLM Benchmarking Profiles April 2026 * TTFT at extreme context lengths * Total request latency * Memory utilization at peak context * Output throughput during the decode phase Multi-document synthesis pushes context length to its maximum. Benchmarks SHOULD report the number of documents and the per- document length separately. This shows whether performance depends on total token count alone or also on document structure. This profile is a stress test for memory management and long-context attention implementations. 4.5.4. EXTR: Structured Extraction The model reads an input document. It extracts specific information into a defined output schema. Examples are invoice parsing, resume extraction, form processing, and log analysis. The output MUST conform to a specified JSON schema, XML structure, or tabular format. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+--------------------------+ | Output Constraint | Structured (ST) | +---------------------+--------------------------+ | Latency Sensitivity | Near-Real-Time (NR) to | | | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+--------------------------+ Table 32 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 30] Internet-Draft LLM Benchmarking Profiles April 2026 +===============+===========================================+ | Parameter | Range | +===============+===========================================+ | Input Length | 256 to 16384 tokens | +---------------+-------------------------------------------+ | Output Length | 50 to 500 tokens (structured JSON or XML) | +---------------+-------------------------------------------+ Table 33 Key Performance Indicators: * Documents processed per second * Total request latency * Guided generation overhead vs. an unconstrained baseline Extraction workloads typically use constrained or guided decoding to ensure the output conforms to the schema. Benchmarks MUST report whether guided generation is enabled. They MUST report the schema complexity including the number of fields and nesting depth. Benchmarks SHOULD compare throughput with and without guided generation. This shows the overhead. Output validation rates, meaning the percentage of outputs that conform to the schema, SHOULD be reported alongside performance metrics. 4.5.5. JUDG: LLM-as-Judge / Evaluation The model evaluates another model's output against specified criteria. It produces a score and optionally a justification. This is used in production for quality assurance, A/B testing, content moderation review, and RLHF reward modeling. The input includes the original prompt, the response to evaluate, and the scoring criteria. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 31] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+=========================================+ | Dimension | Value | +=====================+=========================================+ | I/O Ratio | Prefill-Dominant (PD) | +---------------------+-----------------------------------------+ | Output Constraint | Structured (ST) to Semi-Structured (SS) | +---------------------+-----------------------------------------+ | Latency Sensitivity | Throughput-Oriented (TO) | +---------------------+-----------------------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+-----------------------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+-----------------------------------------+ Table 34 Typical Token Distribution: +==============+==============================+ | Parameter | Range | +==============+==============================+ | Input Length | 512 to 8192 tokens (criteria | | | plus prompt plus response) | +--------------+------------------------------+ | Output | 10 to 300 tokens (score plus | | Length | reasoning) | +--------------+------------------------------+ Table 35 Key Performance Indicators: * Evaluations per second * Total request latency * Throughput under sustained batch load LLM-as-Judge is a high-volume pipeline workload. Benchmarks SHOULD simulate evaluation of a corpus of responses. They SHOULD measure sustained throughput over thousands of evaluations. The scoring criteria form a cacheable system prompt prefix. Both score-only mode (minimal output) and score-with-reasoning mode (longer output) SHOULD be tested. Mondal & Gaikwad Expires 12 October 2026 [Page 32] Internet-Draft LLM Benchmarking Profiles April 2026 4.5.6. RAGN: Retrieval-Augmented Generation The model receives retrieved context passages and a user query. It generates a grounded response. It differs from other prefill-heavy profiles in that the context is made of multiple retrieved chunks assembled together. It differs from LDQA in that the input context is shorter and is made of separately retrieved passages rather than a single contiguous document. Classification Vector: +=====================+=========================================+ | Dimension | Value | +=====================+=========================================+ | I/O Ratio | Prefill-Dominant (PD) to Balanced (BA) | +---------------------+-----------------------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+-----------------------------------------+ | Latency Sensitivity | Interactive (IN) to Near-Real-Time (NR) | +---------------------+-----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+-----------------------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+-----------------------------------------+ Table 36 Typical Token Distribution: +===================+=====================================+ | Parameter | Range | +===================+=====================================+ | Retrieved Context | 512 to 8192 tokens (3 to 20 chunks) | +-------------------+-------------------------------------+ | User Query | 16 to 256 tokens | +-------------------+-------------------------------------+ | Total Input | 528 to 8448 tokens | +-------------------+-------------------------------------+ | Output Length | 64 to 1024 tokens | +-------------------+-------------------------------------+ Table 37 Key Performance Indicators: * TTFT (the user is typically waiting interactively) * Total request latency including time to first useful content Mondal & Gaikwad Expires 12 October 2026 [Page 33] Internet-Draft LLM Benchmarking Profiles April 2026 * End-to-end latency including retrieval if measured at the compound system boundary RAG benchmarks SHOULD vary the number and size of retrieved chunks. This shows how performance scales as context grows. Configurations of 3, 5, 10, and 20 retrieved chunks SHOULD be tested. The SUT boundary from [LLM-METHOD] is especially important for RAG. Measuring at the model engine boundary excludes retrieval latency. Measuring at the compound system boundary includes it. Both SHOULD be reported when the retrieval system is part of the SUT. 4.6. Group E: Decode-Heavy Generative Profiles Decode-heavy profiles generate much more output than input. The decode phase dominates total latency and compute. These workloads stress memory bandwidth, ITL consistency, and sustained output token throughput. For Group E profiles, output token throughput and ITL distribution are the primary metrics. TTFT is less important relative to total generation time. 4.6.1. CGEN: Content Generation The model receives a short prompt or specification. It generates long-form content. Examples are blog posts, articles, marketing copy, emails, reports, and creative writing. Output is much longer than input and is free-form. Classification Vector: +=====================+=========================================+ | Dimension | Value | +=====================+=========================================+ | I/O Ratio | Decode-Dominant (DD) | +---------------------+-----------------------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+-----------------------------------------+ | Latency Sensitivity | Interactive (IN) to Near-Real-Time (NR) | +---------------------+-----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+-----------------------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+-----------------------------------------+ Table 38 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 34] Internet-Draft LLM Benchmarking Profiles April 2026 +===============+====================+ | Parameter | Range | +===============+====================+ | Input Length | 32 to 1024 tokens | +---------------+--------------------+ | Output Length | 256 to 8192 tokens | +---------------+--------------------+ Table 39 Key Performance Indicators: * Output token throughput in tokens per second * P50 and P99 ITL * ITL consistency over long generation (stutter detection) * Total generation time Long-form generation is the main test of sustained decode performance. Benchmarks SHOULD measure ITL distribution across the full generation. Measuring only the initial tokens is not sufficient. Decode performance can degrade as the KV cache grows. Plotting ITL against output position is useful for detecting decode slowdown. Output lengths of 256, 1024, 2048, and 4096 or more tokens SHOULD be tested. 4.6.2. REAS: Reasoning / Chain-of-Thought The model performs extended step-by-step reasoning before producing a final answer. This profile has grown in importance with reasoning- optimized models that produce explicit or hidden thinking tokens. The input is short. The output is very long, often 10 to 100 times the length of the final answer. Most of the output is intermediate reasoning steps. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 35] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+========================================+ | Dimension | Value | +=====================+========================================+ | I/O Ratio | Decode-Dominant (DD) | +---------------------+----------------------------------------+ | Output Constraint | Free-Form (FF) to Semi-Structured (SS) | +---------------------+----------------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+----------------------------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+----------------------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+----------------------------------------+ Table 40 Typical Token Distribution: +==================+=======================================+ | Parameter | Range | +==================+=======================================+ | Input Length | 64 to 2048 tokens | +------------------+---------------------------------------+ | Reasoning Output | 512 to 32768 tokens (thinking or CoT) | +------------------+---------------------------------------+ | Final Answer | 32 to 512 tokens | +------------------+---------------------------------------+ | Total Output | 544 to 33280 tokens | +------------------+---------------------------------------+ Table 41 Key Performance Indicators: * Total generation time (may be 30 to 120 seconds or more) * Output token throughput sustained over long generation * ITL consistency over extended decode * Time to final answer if thinking tokens are hidden Reasoning workloads produce the longest output sequences of any profile. This creates unique stress on KV cache management during decode. Hidden thinking tokens are a serving-system- specific feature. Benchmarks MUST distinguish between visible tokens and hidden thinking tokens if the serving system supports this. Serving systems that hide thinking tokens MUST document whether Mondal & Gaikwad Expires 12 October 2026 [Page 36] Internet-Draft LLM Benchmarking Profiles April 2026 max_output_tokens applies to visible tokens only, to total tokens, or to both. This MUST be reported in the run parameters defined in Section 4.1. Time to first visible token, which comes after reasoning completes, is a meaningful metric. It differs from standard TTFT and SHOULD be reported when thinking tokens are hidden. Benchmarks SHOULD test with reasoning budgets of 1K, 4K, 16K, and 32K tokens. Memory pressure during extended reasoning is a critical measurement. 4.6.3. DGEN: Data Generation / Synthetic Data The model receives a schema or specification. It generates multiple structured records. This is used for test data generation, training data augmentation, simulation, and content pipelines. The output is repetitive, structured, and produced at high volume. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Decode-Dominant (DD) | +---------------------+--------------------------+ | Output Constraint | Structured (ST) | +---------------------+--------------------------+ | Latency Sensitivity | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+--------------------------+ Table 42 Typical Token Distribution: +===============+===============================================+ | Parameter | Range | +===============+===============================================+ | Input Length | 128 to 2048 tokens (schema plus instructions) | +---------------+-----------------------------------------------+ | Output Length | 512 to 16384 tokens (multiple records) | +---------------+-----------------------------------------------+ Table 43 Key Performance Indicators: Mondal & Gaikwad Expires 12 October 2026 [Page 37] Internet-Draft LLM Benchmarking Profiles April 2026 * Output tokens per second (sustained) * Records generated per second * Throughput under high concurrency Data generation is a throughput-oriented profile. TTFT and ITL are not primary metrics here. Benchmarks SHOULD measure sustained output throughput over large generation batches of 10K or more records. Request latency is still measurable and MAY be reported as a secondary metric. Guided generation to a JSON or CSV schema is standard and SHOULD be enabled. Schema validation rate SHOULD be reported. 4.6.4. TRNS: Translation The model translates text from one natural language to another. The input-to-output ratio is approximately 1:1, varying by language pair. The full input content is consumed in producing the output. Classification Vector: +=====================+==========================+ | Dimension | Value | +=====================+==========================+ | I/O Ratio | Balanced (BA) | +---------------------+--------------------------+ | Output Constraint | Free-Form (FF) | +---------------------+--------------------------+ | Latency Sensitivity | Near-Real-Time (NR) to | | | Throughput-Oriented (TO) | +---------------------+--------------------------+ | Concurrency Pattern | Single-Stream (SI) to | | | Sustained-High (SH) | +---------------------+--------------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+--------------------------+ Table 44 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 38] Internet-Draft LLM Benchmarking Profiles April 2026 +===============+=========================================+ | Parameter | Range | +===============+=========================================+ | Input Length | 32 to 8192 tokens | +---------------+-----------------------------------------+ | Output Length | 32 to 10000 tokens | +---------------+-----------------------------------------+ | I/O Ratio | 0.5:1 to 2:1 depending on language pair | +---------------+-----------------------------------------+ Table 45 Key Performance Indicators: * Total request latency by segment length * Output tokens per second * Throughput in segments translated per second Token-to-token ratio varies by language pair because tokenizer coverage differs across languages. Benchmarks MUST specify the language pair and tokenizer used. CJK languages typically produce more tokens per semantic unit than European languages. Benchmarks SHOULD test with at least two language pairs. This shows tokenizer effects. Segment lengths SHOULD span sentence- level (32 to 64 tokens), paragraph-level (128 to 512 tokens), and document-level (2000 or more tokens). 4.6.5. EDIT: Rewriting / Editing The model receives input text. It produces a transformed version of the same text. Examples are grammar correction, tone adjustment, style transfer, simplification, and paraphrasing. It differs from translation in that it works within the same language. It differs from summarization in that the output length is similar to the input length. Classification Vector: Mondal & Gaikwad Expires 12 October 2026 [Page 39] Internet-Draft LLM Benchmarking Profiles April 2026 +=====================+=====================+ | Dimension | Value | +=====================+=====================+ | I/O Ratio | Balanced (BA) | +---------------------+---------------------+ | Output Constraint | Free-Form (FF) | +---------------------+---------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+---------------------+ | Concurrency Pattern | Single-Stream (SI) | +---------------------+---------------------+ | Prefix Sharing | System-Prompt (S) | +---------------------+---------------------+ Table 46 Typical Token Distribution: +===============+======================================+ | Parameter | Range | +===============+======================================+ | Input Length | 64 to 4096 tokens | +---------------+--------------------------------------+ | Output Length | 64 to 4096 tokens (similar to input) | +---------------+--------------------------------------+ Table 47 Key Performance Indicators: * Total request latency * Output token throughput * TTFT if streamed to the user for review Rewriting produces output structurally similar to input. Some serving systems may use this similarity to enable speculative decoding or draft-based approaches. Benchmarks SHOULD report whether such optimizations are enabled. 4.7. Group F: Multi-Step Chained Profiles Multi-step profiles involve sequences of dependent inference calls. Each call depends on the output of the previous call. Total user- perceived latency is the sum of all step latencies. These workloads test end-to-end pipeline performance rather than single-request optimization. Mondal & Gaikwad Expires 12 October 2026 [Page 40] Internet-Draft LLM Benchmarking Profiles April 2026 For Group F profiles, the primary metric is end-to-end pipeline latency. Individual step TTFT and total latency contribute to this but are not sufficient on their own. 4.7.1. AGNT: Agentic / Tool Use The model operates in a loop. It receives an observation, decides on an action, receives the tool result, and continues until the task is complete. Each iteration is a separate inference call. The context grows across iterations as prior observations and actions accumulate. Classification Vector: +===================+=================================+ | Dimension | Value | +===================+=================================+ | I/O Ratio | Balanced (BA) per step | +-------------------+---------------------------------+ | Output Constraint | Structured (ST) per tool call; | | | Free-Form (FF) for final answer | +-------------------+---------------------------------+ | Latency | Near-Real-Time (NR) to | | Sensitivity | Interactive (IN) | +-------------------+---------------------------------+ | Concurrency | Chained-Sequential (CS) | | Pattern | | +-------------------+---------------------------------+ | Prefix Sharing | Turn-Accumulated (T) | +-------------------+---------------------------------+ Table 48 Typical Token Distribution (per step): +================+========================================+ | Parameter | Range | +================+========================================+ | Input Length | 512 to 16384 tokens (grows with steps) | +----------------+----------------------------------------+ | Output Length | 20 to 500 tokens per step | +----------------+----------------------------------------+ | Steps per task | 3 to 20 | +----------------+----------------------------------------+ Table 49 Key Performance Indicators: Mondal & Gaikwad Expires 12 October 2026 [Page 41] Internet-Draft LLM Benchmarking Profiles April 2026 * End-to-end task completion time * Per-step TTFT (compounds across steps) * Per-step total latency * TTFT scaling as context accumulates across steps * Prefix cache hit rate across steps Agentic workloads are especially sensitive to per-step TTFT. Latencies compound across steps. A 500ms TTFT per step over 10 steps adds 5 seconds of TTFT alone to total task completion time. Benchmarks MUST simulate realistic multi-step trajectories. They MUST include tool call outputs injected between steps. Context grows across steps, so prefix caching across turns is critical. Benchmarks SHOULD test with task lengths of 3, 5, 10, and 20 steps. Both successful paths and long paths SHOULD be tested. 4.7.2. PLAN: Planning / Task Decomposition The model receives a high-level goal. It produces a structured plan with ordered steps, dependencies, and resource requirements. This is often the first step of an agentic pipeline. It may involve iterative refinement where the model reviews and revises its own plan. Classification Vector: +=====================+=======================================+ | Dimension | Value | +=====================+=======================================+ | I/O Ratio | Balanced (BA) to Decode-Dominant (DD) | +---------------------+---------------------------------------+ | Output Constraint | Semi-Structured (SS) | +---------------------+---------------------------------------+ | Latency Sensitivity | Near-Real-Time (NR) | +---------------------+---------------------------------------+ | Concurrency Pattern | Chained-Sequential (CS) | +---------------------+---------------------------------------+ | Prefix Sharing | Turn-Accumulated (T) | +---------------------+---------------------------------------+ Table 50 Typical Token Distribution: Mondal & Gaikwad Expires 12 October 2026 [Page 42] Internet-Draft LLM Benchmarking Profiles April 2026 +===================+========================================+ | Parameter | Range | +===================+========================================+ | Input Length | 128 to 4096 tokens (goal plus context) | +-------------------+----------------------------------------+ | Output Length | 256 to 2048 tokens (structured plan) | +-------------------+----------------------------------------+ | Refinement Rounds | 1 to 3 | +-------------------+----------------------------------------+ Table 51 Key Performance Indicators: * Total planning time including refinement rounds * Per-round latency * Output token throughput Planning with iterative refinement requires the model to evaluate and modify its own prior output. Benchmarks SHOULD test both single-pass planning and multi-round refinement. Plan quality is outside the scope of performance benchmarking. The time to produce a plan of a given length and structure is measurable and SHOULD be reported. 5. Profile-to-Metric Applicability Matrix The matrix below maps each profile to the applicable metrics from [LLM-TERMS]. "Primary" means the most important metric for that profile. "Applicable" means it is meaningful but secondary. "-" means it does not apply. TTFT and ITL are streaming-only metrics. They apply only when streaming is enabled per Section 7.2. When streaming is disabled, only total request latency is measurable. For Group F profiles, "Request Latency" means end-to-end pipeline latency across all steps. "TTFT" means per-step TTFT, which compounds across the pipeline. Request latency is still measurable in batch mode but is secondary to throughput metrics. Sub-table 1 lists applicability for TTFT, ITL, TPOT, and output token throughput. Mondal & Gaikwad Expires 12 October 2026 [Page 43] Internet-Draft LLM Benchmarking Profiles April 2026 +=========+============+============+============+==============+ | Profile | TTFT | ITL | TPOT | Output Tok/s | +=========+============+============+============+==============+ | *Group A: Non-Generative* | +---------+------------+------------+------------+--------------+ | EMBED | - | - | - | - | +---------+------------+------------+------------+--------------+ | XRANK | - | - | - | - | +---------+------------+------------+------------+--------------+ | LOGPR | - | - | - | - | +---------+------------+------------+------------+--------------+ | *Group B: Minimal Output* | +---------+------------+------------+------------+--------------+ | CLAS | Applicable | - | - | - | +---------+------------+------------+------------+--------------+ | SFQA | Primary | Applicable | - | - | +---------+------------+------------+------------+--------------+ | RANK | Applicable | - | - | - | +---------+------------+------------+------------+--------------+ | NER | Applicable | - | Applicable | - | +---------+------------+------------+------------+--------------+ | FUNC | Primary | Applicable | - | - | +---------+------------+------------+------------+--------------+ | SQLN | Applicable | - | - | - | +---------+------------+------------+------------+--------------+ | *Group C: Interactive* | +---------+------------+------------+------------+--------------+ | CHAT | Primary | Primary | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | COMP | Primary | Applicable | - | - | +---------+------------+------------+------------+--------------+ | PERS | Primary | Primary | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | *Group D: Prefill-Heavy* | +---------+------------+------------+------------+--------------+ | SUMM | Primary | Applicable | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | LDQA | Primary | Applicable | - | - | +---------+------------+------------+------------+--------------+ | MDOC | Primary | Applicable | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | EXTR | Applicable | - | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | JUDG | Applicable | - | Applicable | - | +---------+------------+------------+------------+--------------+ | RAGN | Primary | Primary | Applicable | Applicable | +---------+------------+------------+------------+--------------+ | *Group E: Decode-Heavy* | Mondal & Gaikwad Expires 12 October 2026 [Page 44] Internet-Draft LLM Benchmarking Profiles April 2026 +---------+------------+------------+------------+--------------+ | CGEN | Applicable | Primary | Primary | Primary | +---------+------------+------------+------------+--------------+ | REAS | Applicable | Primary | Primary | Primary | +---------+------------+------------+------------+--------------+ | DGEN | - | - | Applicable | Primary | +---------+------------+------------+------------+--------------+ | TRNS | Applicable | Applicable | Primary | Primary | +---------+------------+------------+------------+--------------+ | EDIT | Applicable | Applicable | Primary | Applicable | +---------+------------+------------+------------+--------------+ | *Group F: Multi-Step* | +---------+------------+------------+------------+--------------+ | AGNT | Primary | Applicable | - | - | +---------+------------+------------+------------+--------------+ | PLAN | Applicable | Applicable | Applicable | Applicable | +---------+------------+------------+------------+--------------+ Table 52 Sub-table 2 lists applicability for request latency, request throughput, sequence throughput, and prefill throughput. +=========+=============+============+=========+===============+ | Profile | Req Latency | Req/s | Seq/s | Prefill Tok/s | +=========+=============+============+=========+===============+ | *Group A: Non-Generative* | +---------+-------------+------------+---------+---------------+ | EMBED | Primary | Primary | Primary | Applicable | +---------+-------------+------------+---------+---------------+ | XRANK | Primary | Primary | Primary | Applicable | +---------+-------------+------------+---------+---------------+ | LOGPR | Primary | Applicable | Primary | Primary | +---------+-------------+------------+---------+---------------+ | *Group B: Minimal Output* | +---------+-------------+------------+---------+---------------+ | CLAS | Primary | Primary | - | Applicable | +---------+-------------+------------+---------+---------------+ | SFQA | Primary | Applicable | - | - | +---------+-------------+------------+---------+---------------+ | RANK | Primary | Primary | - | Applicable | +---------+-------------+------------+---------+---------------+ | NER | Primary | Primary | - | Applicable | +---------+-------------+------------+---------+---------------+ | FUNC | Primary | Applicable | - | Applicable | +---------+-------------+------------+---------+---------------+ | SQLN | Primary | Applicable | - | Applicable | +---------+-------------+------------+---------+---------------+ Mondal & Gaikwad Expires 12 October 2026 [Page 45] Internet-Draft LLM Benchmarking Profiles April 2026 | *Group C: Interactive* | +---------+-------------+------------+---------+---------------+ | CHAT | Applicable | - | - | - | +---------+-------------+------------+---------+---------------+ | COMP | Primary | - | - | Applicable | +---------+-------------+------------+---------+---------------+ | PERS | Applicable | - | - | Applicable | +---------+-------------+------------+---------+---------------+ | *Group D: Prefill-Heavy* | +---------+-------------+------------+---------+---------------+ | SUMM | Primary | Applicable | - | Primary | +---------+-------------+------------+---------+---------------+ | LDQA | Primary | Applicable | - | Primary | +---------+-------------+------------+---------+---------------+ | MDOC | Primary | - | - | Primary | +---------+-------------+------------+---------+---------------+ | EXTR | Primary | Primary | - | Primary | +---------+-------------+------------+---------+---------------+ | JUDG | Primary | Primary | - | Applicable | +---------+-------------+------------+---------+---------------+ | RAGN | Primary | - | - | Applicable | +---------+-------------+------------+---------+---------------+ | *Group E: Decode-Heavy* | +---------+-------------+------------+---------+---------------+ | CGEN | Applicable | - | - | - | +---------+-------------+------------+---------+---------------+ | REAS | Primary | - | - | - | +---------+-------------+------------+---------+---------------+ | DGEN | Applicable | Primary | - | - | +---------+-------------+------------+---------+---------------+ | TRNS | Primary | Applicable | - | - | +---------+-------------+------------+---------+---------------+ | EDIT | Primary | - | - | - | +---------+-------------+------------+---------+---------------+ | *Group F: Multi-Step* | +---------+-------------+------------+---------+---------------+ | AGNT | Primary | - | - | Applicable | +---------+-------------+------------+---------+---------------+ | PLAN | Primary | - | - | - | +---------+-------------+------------+---------+---------------+ Table 53 Mondal & Gaikwad Expires 12 October 2026 [Page 46] Internet-Draft LLM Benchmarking Profiles April 2026 6. Profile Composition for Mixed Workloads Production LLM deployments rarely serve a single profile. A typical deployment might serve 60 percent CHAT, 20 percent RAGN, 10 percent SUMM, and 10 percent FUNC traffic at the same time. Benchmarking with mixed profiles is needed to understand real- world performance. 6.1. Weighted Profile Combinations A mixed workload specification MUST include the following: the set of profiles included; the traffic weight for each profile as a percentage of total requests or total tokens; and whether weights are in request count or token volume. Example mixed workload specification: +=========+================+========================+ | Profile | Request Weight | Token Weight (approx.) | +=========+================+========================+ | CHAT | 60% | 45% | +---------+----------------+------------------------+ | RAGN | 20% | 30% | +---------+----------------+------------------------+ | SUMM | 10% | 20% | +---------+----------------+------------------------+ | FUNC | 10% | 5% | +---------+----------------+------------------------+ Table 54 Request weight and token weight differ because profiles have different average token counts per request. Both SHOULD be reported. 6.2. Reporting Requirements for Mixed Workloads When benchmarking mixed workloads, the following apply. Overall aggregate metrics including total throughput and aggregate latency distribution MUST be reported. Per-profile metrics MUST be reported separately. A mixed workload benchmark that reports only aggregate metrics hides profile-specific performance characteristics. The interaction between profiles MUST be described. For example, long-running REAS requests may increase queuing delay for latency- sensitive CHAT requests under continuous batching. Mondal & Gaikwad Expires 12 October 2026 [Page 47] Internet-Draft LLM Benchmarking Profiles April 2026 Scheduling fairness metrics from [LLM-METHOD] SHOULD be applied to mixed workloads. This shows whether any profile has worse latency than expected. The benchmark run parameters from Section 4.1 MUST be reported for the full mixed workload run. When parameters differ across profiles within the same run, the per-profile overrides MUST be reported. 7. Cross-Cutting Dimensions The following dimensions apply to any profile in Section 4. They describe deployment or configuration choices. They are not task characteristics. 7.1. Batch vs. Online Any profile can be run in batch mode or online mode. Batch mode processes a queue of requests with throughput as the priority. Online mode serves interactive requests with latency as the priority. The profile definition gives the typical latency sensitivity. Deployers MAY run any profile in either mode. Benchmark reports MUST specify the deployment mode. Online mode uses open-loop load generation with a specified arrival rate. Latency metrics are primary. Batch mode uses closed-loop operation with maximum concurrency. Throughput metrics are primary. 7.2. Streaming vs. Non-Streaming Generative profiles in Groups B through F can be served with or without streaming. Streaming delivers tokens as they are generated. Non-streaming returns the complete response at once. With streaming, TTFT and ITL are measurable and meaningful. Without streaming, only total request latency is measurable. The streaming mode MUST be reported. Profiles with Interactive latency sensitivity (Section 3.3) SHOULD be tested with streaming enabled. Where the matrix in Section 5 marks TTFT or ITL as Primary or Applicable, those metrics apply only when streaming is enabled. Mondal & Gaikwad Expires 12 October 2026 [Page 48] Internet-Draft LLM Benchmarking Profiles April 2026 7.3. Fill-in-the-Middle Fill-in-the-middle (FIM) is a variant that applies primarily to the COMP profile. In FIM, the model generates tokens to fill a gap within existing text rather than appending to the end. FIM changes the prefill pattern because the model receives both a prefix and a suffix context. FIM is not a separate profile. When a profile is tested in FIM mode, this MUST be reported. FIM- capable serving systems may show different prefill performance because of the bidirectional context. 8. Security Considerations 8.1. Workload-Specific Attack Surfaces Different profiles have different security implications. Decode-heavy profiles in Group E can be used as resource exhaustion vectors by requesting maximum output lengths. Benchmark setups MUST enforce output length limits consistent with the production configuration. Multi-step profiles in Group F may accumulate context across steps. Benchmark configurations MUST specify maximum steps and total context limits. Structured output profiles may be vulnerable to schema injection attacks. These cause excessive guided generation backtracking. Benchmarks SHOULD use representative schemas rather than adversarially crafted ones. 8.2. Profile Gaming Systems may be tuned for specific profiles at the expense of others. Single-profile benchmarks can be misleading. Benchmark reports SHOULD include results across multiple profiles that represent the intended production workload mix. 8.3. Data Privacy in Reference Datasets Reference datasets for benchmarking (Appendix A) SHOULD NOT contain personally identifiable information (PII). When using conversation datasets such as those for the CHAT profile, data MUST be anonymized or synthetic. Mondal & Gaikwad Expires 12 October 2026 [Page 49] Internet-Draft LLM Benchmarking Profiles April 2026 9. IANA Considerations This document has no IANA actions. 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, May 2017, . [LLM-TERMS] Gaikwad, M., "Benchmarking Terminology for Large Language Model Serving", Work in Progress, Internet-Draft, draft- gaikwad-llm-benchmarking-terminology-00, January 2026, . [LLM-METHOD] Gaikwad, M., "Benchmarking Methodology for Large Language Model Serving", Work in Progress, Internet-Draft, draft- gaikwad-llm-benchmarking-methodology-00, January 2026, . 10.2. Informative References [SHAREGPT] "ShareGPT52K", 2023, . [LMSYS] Zheng, L., Chiang, W., Sheng, Y., Li, T., and et al., "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset", arXiv 2309.11998, September 2023, . Appendix A. Reference Datasets for Each Profile This appendix lists recommended datasets for building workloads for each profile. Implementers MAY use other datasets. They MUST report the dataset used and its token length statistics. Mondal & Gaikwad Expires 12 October 2026 [Page 50] Internet-Draft LLM Benchmarking Profiles April 2026 Implementers are responsible for checking dataset license terms before use. The datasets listed here are referenced for their technical relevance. No license terms are implied or endorsed. The PII requirements in Section 8.3 apply regardless of which dataset is chosen. For the PERS profile, the recommended source is synthetic persona datasets. Implementers MUST describe the synthetic dataset used. At minimum they MUST report the system prompt token length distribution (mean, P50, P90, P99), the vocabulary and domain coverage, and the generation method or tool used. +=========+========================+==========================+ | Profile | Recommended Dataset(s) | Notes | +=========+========================+==========================+ | EMBED | MTEB benchmark | Query and passage length | | | passages; MS MARCO | variants | | | passages | | +---------+------------------------+--------------------------+ | XRANK | MS MARCO query-passage | Candidate sets of 10 to | | | pairs | 100 | +---------+------------------------+--------------------------+ | LOGPR | OpenWebText; The Pile | Representative web text | | | subsets | | +---------+------------------------+--------------------------+ | CLAS | SST-2; AG News; | Binary, multi-class, | | | GoEmotions | multi-label variants | +---------+------------------------+--------------------------+ | SFQA | Natural Questions; | Short-context factoid | | | TriviaQA | questions | +---------+------------------------+--------------------------+ | RANK | MS MARCO; TREC-DL | Relevance judgment tasks | +---------+------------------------+--------------------------+ | NER | CoNLL-2003; OntoNotes | Standard NER benchmarks | +---------+------------------------+--------------------------+ | FUNC | Berkeley Function | Varied tool schema | | | Calling Leaderboard | complexity | +---------+------------------------+--------------------------+ | SQLN | Spider; WikiSQL | Simple and complex | | | | schema variants | +---------+------------------------+--------------------------+ | CHAT | ShareGPT; LMSYS-Chat- | Real multi-turn | | | 1M | conversations; MUST be | | | | anonymized | +---------+------------------------+--------------------------+ | COMP | The Stack; HumanEval | Real code files with | | | | cursor positions | +---------+------------------------+--------------------------+ Mondal & Gaikwad Expires 12 October 2026 [Page 51] Internet-Draft LLM Benchmarking Profiles April 2026 | PERS | Synthetic persona | Large system prompts (2K | | | datasets | to 10K tokens); see | | | | characterization | | | | requirements above | +---------+------------------------+--------------------------+ | SUMM | CNN/DailyMail; arXiv; | Short and long document | | | GovReport | variants | +---------+------------------------+--------------------------+ | LDQA | NarrativeQA; QuALITY; | Long-document QA | | | Scrolls | benchmarks | +---------+------------------------+--------------------------+ | MDOC | Multi-XScience; | Multi-source synthesis | | | WikiSum | | +---------+------------------------+--------------------------+ | EXTR | SQuAD extractive; | Schema-constrained | | | CORD; SROIE | extraction | +---------+------------------------+--------------------------+ | JUDG | MT-Bench judgments; | LLM evaluation datasets | | | Arena-Hard | | +---------+------------------------+--------------------------+ | RAGN | Natural Questions plus | Realistic retrieved | | | retriever output | chunk assembly | +---------+------------------------+--------------------------+ | CGEN | WritingPrompts; Alpaca | Creative and | | | instructions | instructional generation | +---------+------------------------+--------------------------+ | REAS | GSM8K; MATH; ARC- | Problems requiring | | | Challenge | extended reasoning | +---------+------------------------+--------------------------+ | DGEN | Spider schema | Schema-directed | | | definitions; | generation; implementers | | | synthetically | MUST document schema | | | generated JSON schemas | generation method | +---------+------------------------+--------------------------+ | TRNS | WMT translation | Multiple language pairs | | | benchmarks | | +---------+------------------------+--------------------------+ | EDIT | JFLEG; IteraTeR | Grammar correction and | | | | style transfer | +---------+------------------------+--------------------------+ | AGNT | SWE-bench; WebArena | Multi-step tool use | | | trajectories | traces | +---------+------------------------+--------------------------+ | PLAN | ALFWorld; HotpotQA | Planning and | | | decompositions | decomposition tasks | +---------+------------------------+--------------------------+ Table 55 Mondal & Gaikwad Expires 12 October 2026 [Page 52] Internet-Draft LLM Benchmarking Profiles April 2026 Appendix B. Profile Parameter Quick Reference Table Sub-table 1 lists the profile identity, group, I/O ratio, and output constraint. +================+=======+=======+===========+===================+ | Profile | Code | Group | I/O Ratio | Output Constraint | +================+=======+=======+===========+===================+ | Embedding | EMBED | A | PO | NT | | Generation | | | | | +----------------+-------+-------+-----------+-------------------+ | Cross-Encoder | XRANK | A | PO | NT | | Reranking | | | | | +----------------+-------+-------+-----------+-------------------+ | Logprob | LOGPR | A | PO | NT | | Scoring | | | | | +----------------+-------+-------+-----------+-------------------+ | Classification | CLAS | B | PD | MI | +----------------+-------+-------+-----------+-------------------+ | Short-Form QA | SFQA | B | BA-PD | FF | +----------------+-------+-------+-----------+-------------------+ | Generative | RANK | B | PD | ST-MI | | Ranking | | | | | +----------------+-------+-------+-----------+-------------------+ | Entity | NER | B | PD | ST | | Recognition | | | | | +----------------+-------+-------+-----------+-------------------+ | Function | FUNC | B | PD | ST | | Calling | | | | | +----------------+-------+-------+-----------+-------------------+ | SQL Generation | SQLN | B | PD | ST | +----------------+-------+-------+-----------+-------------------+ | Conversational | CHAT | C | BA | FF | | Chat | | | | | +----------------+-------+-------+-----------+-------------------+ | Code | COMP | C | PD-BA | SS | | Completion | | | | | +----------------+-------+-------+-----------+-------------------+ | Persona | PERS | C | BA | FF | | Dialogue | | | | | +----------------+-------+-------+-----------+-------------------+ | Summarization | SUMM | D | PD | FF-SS | +----------------+-------+-------+-----------+-------------------+ | Long Document | LDQA | D | PD | FF | | QA | | | | | +----------------+-------+-------+-----------+-------------------+ | Multi-Doc | MDOC | D | PD | SS | | Synthesis | | | | | Mondal & Gaikwad Expires 12 October 2026 [Page 53] Internet-Draft LLM Benchmarking Profiles April 2026 +----------------+-------+-------+-----------+-------------------+ | Structured | EXTR | D | PD | ST | | Extraction | | | | | +----------------+-------+-------+-----------+-------------------+ | LLM-as-Judge | JUDG | D | PD | ST-SS | +----------------+-------+-------+-----------+-------------------+ | RAG | RAGN | D | PD-BA | FF | +----------------+-------+-------+-----------+-------------------+ | Content | CGEN | E | DD | FF | | Generation | | | | | +----------------+-------+-------+-----------+-------------------+ | Reasoning / | REAS | E | DD | FF-SS | | CoT | | | | | +----------------+-------+-------+-----------+-------------------+ | Data | DGEN | E | DD | ST | | Generation | | | | | +----------------+-------+-------+-----------+-------------------+ | Translation | TRNS | E | BA | FF | +----------------+-------+-------+-----------+-------------------+ | Rewriting / | EDIT | E | BA | FF | | Editing | | | | | +----------------+-------+-------+-----------+-------------------+ | Agentic / Tool | AGNT | F | BA/step | ST+FF | | Use | | | | | +----------------+-------+-------+-----------+-------------------+ | Planning | PLAN | F | BA-DD | SS | +----------------+-------+-------+-----------+-------------------+ Table 56 Sub-table 2 repeats the profile key and lists latency class, concurrency, prefix sharing, and token ranges. +=========+=========+=============+=========+=========+==========+ | Profile | Latency | Concurrency | Prefix | Input | Output | | | Class | | Sharing | Tokens | Tokens | +=========+=========+=============+=========+=========+==========+ | EMBED | NR-TO | SH | N | 16-8K | N/A | | | | | | | (vector) | +---------+---------+-------------+---------+---------+----------+ | XRANK | NR | BB | D | 128-1K | N/A | | | | | | | (score) | +---------+---------+-------------+---------+---------+----------+ | LOGPR | TO | SH | N | 64-4K | N/A | | | | | | | (probs) | +---------+---------+-------------+---------+---------+----------+ | CLAS | NR-TO | SH | S | 32-2K | 1-10 | +---------+---------+-------------+---------+---------+----------+ Mondal & Gaikwad Expires 12 October 2026 [Page 54] Internet-Draft LLM Benchmarking Profiles April 2026 | SFQA | IN-NR | SI | S | 16-512 | 5-100 | +---------+---------+-------------+---------+---------+----------+ | RANK | NR | BB | D | 256-4K | 5-50 | +---------+---------+-------------+---------+---------+----------+ | NER | TO | SH | S | 64-2K | 20-200 | +---------+---------+-------------+---------+---------+----------+ | FUNC | NR | SI | H | 512-4K | 20-200 | +---------+---------+-------------+---------+---------+----------+ | SQLN | NR | SI-BB | H | 256-8K | 20-300 | +---------+---------+-------------+---------+---------+----------+ | CHAT | IN | SI | T | 64-16K | 50-1K | +---------+---------+-------------+---------+---------+----------+ | COMP | IN | SI | D | 256-8K | 5-200 | +---------+---------+-------------+---------+---------+----------+ | PERS | IN | SI | S+T | 2K-10K | 100-2K | | | | | | sys | | +---------+---------+-------------+---------+---------+----------+ | SUMM | NR-TO | SI-SH | N | 1K-32K | 64-1K | +---------+---------+-------------+---------+---------+----------+ | LDQA | NR | SI-BB | D | 4K-128K | 16-512 | +---------+---------+-------------+---------+---------+----------+ | MDOC | TO | SI | N | 8K-128K | 256-4K | +---------+---------+-------------+---------+---------+----------+ | EXTR | NR-TO | SH | S | 256-16K | 50-500 | +---------+---------+-------------+---------+---------+----------+ | JUDG | TO | SH | S | 512-8K | 10-300 | +---------+---------+-------------+---------+---------+----------+ | RAGN | IN-NR | SI | S | 528-8K | 64-1K | +---------+---------+-------------+---------+---------+----------+ | CGEN | IN-NR | SI | S | 32-1K | 256-8K | +---------+---------+-------------+---------+---------+----------+ | REAS | NR | SI | S | 64-2K | 544-33K | +---------+---------+-------------+---------+---------+----------+ | DGEN | TO | SH | S | 128-2K | 512-16K | +---------+---------+-------------+---------+---------+----------+ | TRNS | NR-TO | SI-SH | S | 32-8K | 32-10K | +---------+---------+-------------+---------+---------+----------+ | EDIT | NR | SI | S | 64-4K | 64-4K | +---------+---------+-------------+---------+---------+----------+ | AGNT | NR-IN | CS | T | 512-16K | 20-500/ | | | | | | | step | +---------+---------+-------------+---------+---------+----------+ | PLAN | NR | CS | T | 128-4K | 256-2K | +---------+---------+-------------+---------+---------+----------+ Table 57 Mondal & Gaikwad Expires 12 October 2026 [Page 55] Internet-Draft LLM Benchmarking Profiles April 2026 Acknowledgements The authors thank the IETF BMWG community for their foundational work in benchmarking methodology. Authors' Addresses Mohadeb Mondal Independent Email: mohadeb.mondal@gmail.com Madhava Gaikwad Independent Email: gaikwad.madhav@gmail.com Mondal & Gaikwad Expires 12 October 2026 [Page 56]