Benchmarking Workload Profiles for Large Language Model Serving

Internet-Draft	LLM Benchmarking Profiles	April 2026
Mondal & Gaikwad	Expires 12 October 2026	[Page]

Abstract

This document defines standard workload profiles for benchmarking Large Language Model (LLM) inference serving systems. Each profile represents one class of production workload. A profile is defined by its input and output token distribution, output structure, latency sensitivity, concurrency pattern, and caching behavior. Profiles are organized into six groups: Non-Generative, Minimal Output, Interactive Streaming, Prefill-Heavy Generative, Decode-Heavy Generative, and Multi-Step Chained. This document works with "Benchmarking Methodology for Large Language Model Serving" [LLM-METHOD]. It specifies the workloads that the methodology's tests SHOULD be run against.¶

4. Profile Groups and Definitions

This section defines 25 workload profiles in six groups. Section 4.1 specifies common benchmark run parameters that apply to all profiles. Sections 4.2 through 4.7 define the profiles.¶

Each profile includes the following.¶

Description:: What the task is and why it is a distinct benchmarking profile.¶
Classification Vector:: The five-dimensional classification from Section 3.¶
Typical Token Distribution:: Representative input and output token length ranges.¶
Key Performance Indicators:: The most important metrics from [LLM-TERMS] for this profile.¶
Benchmarking Considerations:: Profile-specific setup, measurement, or reporting guidance.¶

4.1. Common Benchmark Run Parameters

Benchmark results are only comparable when the configuration is fully specified. The following parameters MUST be reported for every benchmark run. They apply in addition to any profile- specific requirements in Sections 4.2 through 4.7.¶

Table 1
Parameter	Requirement
Model identifier and version	MUST report exact model name and version string
Tokenizer name and version	MUST report; tokenizer affects all token count measurements
Decoding strategy	MUST report: greedy, sampling, or beam search
Temperature	MUST report if sampling; N/A for greedy
top_p	MUST report if sampling
top_k	MUST report if sampling
max_output_tokens	MUST report; specify whether limit applies to visible tokens, total tokens, or both
Stop sequences	MUST report all stop conditions used
Streaming enabled	MUST report: enabled or disabled
Server token flush policy	MUST report if streaming enabled (e.g., per-token, batched-N, time-based)
Guided decoding enabled	MUST report: enabled or disabled
Guided decoding method	MUST report if enabled (e.g., grammar-based, JSON schema, logit masking)
Guided decoding schema size	MUST report if enabled: schema token count and nesting depth
Prefix cache enabled	MUST report: enabled or disabled
Prefix cache type	MUST report if enabled (e.g., block-level KV cache, full-prefix cache)
Prefix cache eviction policy	MUST report if enabled (e.g., LRU, TTL-based, capacity limit)
Hardware configuration	MUST report: accelerator type, count, and memory capacity
Serving framework and version	MUST report
Quantization	MUST report: precision (e.g., fp16, int8, int4) and method if applicable

Implementers SHOULD also report the version of any load generation tool used. They SHOULD report the network topology between the load generator and the SUT.¶

4.2. Group A: Non-Generative Profiles

Non-generative profiles produce no output token sequence. The inference pipeline runs only the prefill pass. It returns a non-token result. This result is a vector embedding, a scalar score, or a per-token probability distribution.¶

There is no decode phase. The following metrics from [LLM-TERMS] do not apply to Group A profiles.¶

Time to First Token (TTFT)¶
Inter-Token Latency (ITL)¶
Output Token Throughput¶
Time Per Output Token (TPOT)¶

Group A profiles MUST instead be measured using the following.¶

Request Latency:: Total time from request submission to result receipt.¶
Sequences Per Second:: Number of complete inference requests processed per second under the specified concurrency. For Group A profiles, one sequence equals one request. Seq/s and Req/s are the same here. When a single API call submits multiple sequences in a batch, throughput MUST be reported as sequences per second, not API calls per second.¶
Prefill Throughput:: Input tokens processed per second.¶

4.2.1. EMBED: Embedding Generation

The model encodes input text into a fixed-size dense vector. This vector is used for semantic search, retrieval, clustering, and similarity computation. This is one of the highest-volume LLM inference workloads in production. Every RAG pipeline, recommendation system, and semantic search index depends on embedding generation.¶

Classification Vector:¶

Table 2
Dimension	Value
I/O Ratio	Prefill-Only (PO)
Output Constraint	Non-Token (NT)
Latency Sensitivity	Near-Real-Time (NR) to Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	None (N)

Typical Token Distribution:¶

Table 3
Parameter	Range
Input Length	16 to 8192 tokens
Output	Fixed-size vector (e.g., 768, 1024, 1536, or 3072 dimensions)

Key Performance Indicators:¶

Sequences per second at target batch size¶
P50 and P99 request latency under load¶
Prefill throughput in input tokens per second¶
Throughput scaling with batch size¶

Embedding workloads work well with large batches. Benchmarks SHOULD measure throughput across batch sizes of 1, 8, 32, 64, 128, and 256. This shows batching efficiency. Input length SHOULD come from a representative corpus. Real embedding workloads have high length variance because short queries and long passages are both common. Benchmarks SHOULD report latency and throughput separately for query-length inputs (fewer than 64 tokens) and passage-length inputs (256 to 8192 tokens). These two cases often serve different stages of a retrieval pipeline.¶

4.2.2. XRANK: Cross-Encoder Reranking

The model receives a query and a document. It produces a relevance score without generating tokens. Cross- encoder reranking is used in search pipelines to refine initial retrieval results. The model processes the concatenated query and document pair. It outputs a scalar relevance score.¶

Classification Vector:¶

Table 4
Dimension	Value
I/O Ratio	Prefill-Only (PO)
Output Constraint	Non-Token (NT)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Burst-Batch (BB)
Prefix Sharing	Document-Shared (D)

Typical Token Distribution:¶

Table 5
Parameter	Range
Input Length	128 to 1024 tokens (query plus document)
Output	Single scalar score

Key Performance Indicators:¶

Pairs scored per second at target batch size¶
P50 and P99 request latency per scoring batch¶
Throughput scaling as candidate set size grows¶

Cross-encoder reranking typically processes a batch of candidates. This is commonly 10 to 100 query-document pairs. The query is shared across all pairs. Benchmarks SHOULD measure performance with the shared query prefix to capture KV cache benefits. Candidate set sizes of 10, 25, 50, and 100 SHOULD be tested. Latency SHOULD be reported for the full candidate set. This is the time to score all candidates, not just one pair. This matches real search pipeline requirements.¶

4.2.3. LOGPR: Logprob / Perplexity Scoring

The model processes input text. It returns per- token log probabilities or a sequence-level perplexity score. It does not generate new tokens. This is used for AI-generated content detection, model evaluation, data quality filtering, watermark detection, and calibration. The model runs a forward pass over the input. It returns the probability distribution at each token position.¶

Classification Vector:¶

Table 6
Dimension	Value
I/O Ratio	Prefill-Only (PO)
Output Constraint	Non-Token (NT)
Latency Sensitivity	Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	None (N)

Typical Token Distribution:¶

Table 7
Parameter	Range
Input Length	64 to 4096 tokens
Output	Per-token logprobs or aggregate perplexity score

Key Performance Indicators:¶

Sequences scored per second¶
Input tokens processed per second¶
P50 and P99 request latency by input length bucket¶

Logprob scoring is computationally the same as the prefill phase of generative inference. The difference is that it returns probability data instead of starting a decode phase. It differs from embedding in one key way. The full vocabulary probability distribution may be returned at each position. This creates large output data volume even though no tokens are generated. Benchmarks SHOULD specify whether full vocabulary logprobs or only top-k logprobs are returned. This choice significantly affects data transfer overhead. Input length SHOULD span the full range to show prefill scaling.¶

4.3. Group B: Minimal Output Profiles

Minimal output profiles generate a small number of tokens. The output is a label, a score, a short answer, or a small structured object. The decode phase is short relative to prefill. These workloads are typically prefill-dominant and work well with batching.¶

For Group B profiles, the metric priorities differ from streaming workloads.¶

Total Request Latency is the primary latency metric. TTFT and ITL are not primary here.¶

Request Throughput in requests per second is more useful than output token throughput.¶

TTFT is still measurable. It is approximately equal to total latency because the decode phase is very short.¶

4.3.1. CLAS: Classification and Labeling

The model reads input text. It produces a categorical label or set of labels from a predefined taxonomy. Examples are sentiment analysis, content moderation, topic categorization, intent detection, and spam filtering.¶

Classification Vector:¶

Table 8
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Minimal (MI)
Latency Sensitivity	Near-Real-Time (NR) to Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 9
Parameter	Range
Input Length	32 to 2048 tokens
Output Length	1 to 10 tokens

Key Performance Indicators:¶

Requests classified per second¶
P50 and P99 total request latency¶
Throughput scaling with batch size¶

Classification workloads often use constrained decoding or guided generation to restrict output to valid labels. Benchmarks SHOULD report whether guided generation is enabled. They SHOULD report which method is used, such as grammar-based or logit masking. System prompts typically include the label taxonomy. They SHOULD be included in the benchmark setup. The label set size SHOULD be specified. Binary, multi-class, and multi-label variants have different guided generation overhead.¶

4.3.2. SFQA: Short-Form QA / Factoid

The model receives a short question. It may also receive brief context. It produces a concise factual answer. Examples are customer support FAQ, knowledge base queries, and simple information retrieval. It differs from long-document QA (LDQA) in that the input context is short.¶

Classification Vector:¶

Table 10
Dimension	Value
I/O Ratio	Balanced (BA) to Prefill-Dominant (PD)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Interactive (IN) to Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 11
Parameter	Range
Input Length	16 to 512 tokens
Output Length	5 to 100 tokens

Key Performance Indicators:¶

TTFT if streamed¶
Total request latency¶
Requests per second under concurrency¶

This profile represents the simple API call pattern. It is the baseline for many LLM applications. It SHOULD be used as a reference workload to establish baseline system performance before testing more complex profiles. Input length SHOULD include very short queries (fewer than 32 tokens) and context- augmented queries (256 to 512 tokens).¶

4.3.3. RANK: Generative Ranking / Scoring

The model evaluates one or more candidates. It produces a relevance score, a preference ordering, or a comparative judgment as generated tokens. It differs from cross- encoder reranking (XRANK) in that it generates token output such as scores, explanations, or ordinal labels. XRANK produces a raw scalar from the model's hidden state.¶

Classification Vector:¶

Table 12
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST) to Minimal (MI)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Burst-Batch (BB)
Prefix Sharing	Document-Shared (D)

Typical Token Distribution:¶

Table 13
Parameter	Range
Input Length	256 to 4096 tokens (prompt plus candidates)
Output Length	5 to 50 tokens (scores or ranking)

Key Performance Indicators:¶

Time to score the full candidate set¶
Requests per second¶
Prefix cache hit rate across candidates sharing context¶

Ranking workloads have strong prefix sharing. When ranking N candidates against a shared query or document, the shared prefix may be 80 to 90 percent of total input tokens. Benchmarks MUST report prefix cache hit rates. They SHOULD compare performance with and without prefix caching. Candidate set sizes of 5, 10, 20, and 50 SHOULD be tested.¶

4.3.4. NER: Entity Resolution and Recognition

The model reads input text. It identifies named entities and produces a structured list of entity mentions with their types and positions. This is used for information extraction from documents, log parsing, and data enrichment pipelines.¶

Classification Vector:¶

Table 14
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST)
Latency Sensitivity	Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 15
Parameter	Range
Input Length	64 to 2048 tokens
Output Length	20 to 200 tokens (structured entity list)

Key Performance Indicators:¶

Documents processed per second¶
Total request latency by input length¶
Throughput under batch processing¶

NER output length varies with input content density. Benchmarks SHOULD use representative documents with known entity density. This ensures output length distributions are realistic. Guided generation to a JSON schema is common. Whether it is enabled SHOULD be reported.¶

4.3.5. FUNC: Function Calling / Tool Selection

The model receives a user request and a set of tool or function definitions. It selects the right function and produces well-formed arguments. The tool schema definitions form a large and highly cacheable prefix.¶

Classification Vector:¶

Table 16
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	Schema-Shared (H)

Typical Token Distribution:¶

Table 17
Parameter	Range
Input Length	512 to 4096 tokens (schema plus user query)
Output Length	20 to 200 tokens (function name plus JSON arguments)

Key Performance Indicators:¶

Total request latency¶
TTFT as an indicator of schema processing time¶
Prefix cache hit rate for shared tool schemas¶

Tool schemas vary widely in size. A small schema with 5 tools may be a few hundred tokens. A large schema with 50 tools and complex parameters may be several thousand tokens. Benchmarks SHOULD test with schema sizes of approximately 500, 2000, and 4000 tokens. Tool schemas are highly shareable as prefixes. Prefix caching effectiveness is a critical measurement. Constrained decoding to valid JSON is standard and SHOULD be enabled.¶

4.3.6. SQLN: SQL / Query Generation

The model receives a natural language question and a database schema definition. It produces a syntactically valid SQL query. The schema prefix is large and shared across queries against the same database.¶

Classification Vector:¶

Table 18
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI) to Burst-Batch (BB)
Prefix Sharing	Schema-Shared (H)

Typical Token Distribution:¶

Table 19
Parameter	Range
Input Length	256 to 8192 tokens (schema plus question)
Output Length	20 to 300 tokens (SQL query)

Key Performance Indicators:¶

Total request latency¶
Prefix cache effectiveness for shared schema¶
Requests per second under concurrent users on the same schema¶

Database schemas range from simple to complex. A simple schema may have 5 tables and around 500 tokens. A complex schema may have 100 or more tables and 8000 or more tokens. Benchmarks SHOULD test with at least two schema sizes. Multiple queries against the same schema are the dominant production pattern. Prefix caching measurements are critical for this profile.¶

4.4. Group C: Interactive Streaming Profiles

Interactive streaming profiles serve users who watch tokens arrive in real time. These profiles share four key properties.¶

TTFT is critical for perceived responsiveness. ITL consistency determines perceived fluency. Multi-turn context accumulation is common. Prefix caching across conversation turns adds value.¶

For Group C profiles, all streaming metrics from [LLM-TERMS] apply. These are TTFT, ITL, TPOT, and output token throughput. The Throughput-Latency Tradeoff test from [LLM-METHOD] is especially important for this group.¶

4.4.1. CHAT: Conversational Chat

A user and the model have a multi-turn conversation. Each turn adds to the conversation history. This creates a growing input context. This is the most common interactive LLM workload. It is the default profile for general- purpose chat applications.¶

Classification Vector:¶

Table 20
Dimension	Value
I/O Ratio	Balanced (BA)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Interactive (IN)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	Turn-Accumulated (T)

Typical Token Distribution:¶

Table 21
Parameter	Range
Input Length	64 to 16384 tokens (grows with turns)
Output Length	50 to 1000 tokens
Turns per conversation	3 to 20

Key Performance Indicators:¶

TTFT at various context lengths¶
P50 and P99 ITL¶
TTFT degradation as conversation grows¶
Prefix cache hit rate across turns¶

Chat benchmarks MUST test across multiple conversation depths. This shows how TTFT scales as context grows. Tests at turns 1, 5, 10, and 20 SHOULD be included at minimum. Input data SHOULD come from real multi-turn conversation datasets such as ShareGPT [SHAREGPT] or LMSYS-Chat [LMSYS]. This captures realistic length distributions. The time between turns, while the user reads and types, affects cache eviction pressure. This time SHOULD be set as a parameter in the benchmark.¶

4.4.2. COMP: Code Completion / Autocomplete

The model receives a code context. This includes the file content, cursor position, and surrounding code. It generates a short completion. This profile has very tight latency requirements, very short output, and high request frequency. Prefix sharing is strong because the same file context appears in many consecutive requests.¶

Classification Vector:¶

Table 22
Dimension	Value
I/O Ratio	Prefill-Dominant (PD) to Balanced (BA)
Output Constraint	Semi-Structured (SS)
Latency Sensitivity	Interactive (IN)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	Document-Shared (D)

Typical Token Distribution:¶

Table 23
Parameter	Range
Input Length	256 to 8192 tokens (file context)
Output Length	5 to 200 tokens (completion)

Key Performance Indicators:¶

TTFT (primary). Interactive deployments often target sub-200ms TTFT. Benchmarks SHOULD report TTFT at P50, P95, and P99.¶
Total completion latency¶
Prefix cache hit rate (consecutive edits share most context)¶

Code completion has the tightest TTFT requirements of any profile. Benchmarks SHOULD simulate realistic editing patterns. In these patterns, consecutive requests share 95 percent or more of input tokens with small modifications. Fill-in-the-middle (FIM) variants SHOULD be tested if the serving system supports it. In FIM, the model generates tokens to fill a gap within existing code rather than appending to the end. Both single-line and multi-line completions SHOULD be measured.¶

4.4.3. PERS: Dialogue / Roleplay / Persona

The model maintains a persistent character or persona defined by a large system prompt. This differs from generic chat. System prompts may be 2000 to 10000 tokens or more. They include character descriptions, world-building context, and behavioral guidelines. System prompt prefix caching is important when many users share the same persona.¶

Classification Vector:¶

Table 24
Dimension	Value
I/O Ratio	Balanced (BA)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Interactive (IN)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S) plus Turn-Accumulated (T)

Typical Token Distribution:¶

Table 25
Parameter	Range
System Prompt	2000 to 10000 tokens
User Turn Input	16 to 256 tokens
Output Length	100 to 2000 tokens

Key Performance Indicators:¶

TTFT with large system prompts¶
System prompt prefix cache hit rate across concurrent users¶
ITL consistency during long responses¶

The key benchmark for this profile is performance under large shared system prompts with many concurrent users. When 1000 users share the same persona, the system prompt KV cache should be computed once. Benchmarks SHOULD measure TTFT and throughput as a function of the number of concurrent users sharing the same system prompt. Results with and without prefix caching SHOULD both be reported.¶

4.5. Group D: Prefill-Heavy Generative Profiles

Prefill-heavy profiles process long input contexts. They generate moderate-length output. The prefill phase dominates compute and memory requirements. These workloads stress KV cache capacity, long-context attention efficiency, and memory management.¶

The most important metrics for Group D profiles are TTFT (heavily influenced by prefill duration), prefill throughput in input tokens per second, memory utilization at peak context length, and total request latency.¶

4.5.1. SUMM: Summarization

The model reads a document or set of passages. It produces a condensed summary. This is one of the most common production LLM tasks. The input is often 10 to 100 times longer than the output. The prefill phase dominates total latency.¶

Classification Vector:¶

Table 26
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Free-Form (FF) to Semi-Structured (SS)
Latency Sensitivity	Near-Real-Time (NR) to Throughput-Oriented (TO)
Concurrency Pattern	Single-Stream (SI) to Sustained-High (SH)
Prefix Sharing	None (N)

Typical Token Distribution:¶

Table 27
Parameter	Range
Input Length	1024 to 32768 tokens
Output Length	64 to 1024 tokens

Key Performance Indicators:¶

TTFT by input length bucket¶
Total request latency¶
Prefill throughput in input tokens per second¶
Documents summarized per second¶

Summarization performance depends strongly on input length. Benchmarks MUST report results in input length buckets. Example buckets are 1K to 4K, 4K to 8K, 8K to 16K, and 16K to 32K tokens. Input data SHOULD come from document corpora with realistic length distributions rather than synthetic text. Both extractive summarization (shorter output) and abstractive summarization (longer output) SHOULD be tested.¶

4.5.2. LDQA: Long Document Question Answering

The model receives a long document and a question. It produces a concise answer from the document content. It differs from summarization in that the output is shorter and more targeted. It differs from short-form QA in that it has a large context document.¶

Classification Vector:¶

Table 28
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI) to Burst-Batch (BB)
Prefix Sharing	Document-Shared (D)

Typical Token Distribution:¶

Table 29
Parameter	Range
Input Length	4096 to 131072 tokens
Output Length	16 to 512 tokens

Key Performance Indicators:¶

TTFT at various context lengths¶
TTFT scaling as context length doubles¶
Total request latency¶
Prefix cache effectiveness across multiple questions¶

Long document QA tests the interaction between context length and latency. Benchmarks SHOULD test with multiple questions over the same document. This measures prefix caching effectiveness. Context lengths SHOULD span at least three orders of magnitude. Example values are 4K, 16K, 64K, and 128K tokens. The position of the answer within the document can affect performance on some systems. This MAY be reported.¶

4.5.3. MDOC: Multi-Document Synthesis

The model receives multiple documents. It produces one output that synthesizes information across all of them. Examples are literature review, competitive analysis, and multi-source report generation. It differs from single-document summarization in that it requires reasoning across documents. It differs from RAG in that there is no retrieval step. All documents are provided in full.¶

Classification Vector:¶

Table 30
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Semi-Structured (SS)
Latency Sensitivity	Throughput-Oriented (TO)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	None (N)

Typical Token Distribution:¶

Table 31
Parameter	Range
Input Length	8192 to 131072 tokens (multiple documents)
Output Length	256 to 4096 tokens

Key Performance Indicators:¶

TTFT at extreme context lengths¶
Total request latency¶
Memory utilization at peak context¶
Output throughput during the decode phase¶

Multi-document synthesis pushes context length to its maximum. Benchmarks SHOULD report the number of documents and the per- document length separately. This shows whether performance depends on total token count alone or also on document structure. This profile is a stress test for memory management and long-context attention implementations.¶

4.5.4. EXTR: Structured Extraction

The model reads an input document. It extracts specific information into a defined output schema. Examples are invoice parsing, resume extraction, form processing, and log analysis. The output MUST conform to a specified JSON schema, XML structure, or tabular format.¶

Classification Vector:¶

Table 32
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST)
Latency Sensitivity	Near-Real-Time (NR) to Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 33
Parameter	Range
Input Length	256 to 16384 tokens
Output Length	50 to 500 tokens (structured JSON or XML)

Key Performance Indicators:¶

Documents processed per second¶
Total request latency¶
Guided generation overhead vs. an unconstrained baseline¶

Extraction workloads typically use constrained or guided decoding to ensure the output conforms to the schema. Benchmarks MUST report whether guided generation is enabled. They MUST report the schema complexity including the number of fields and nesting depth. Benchmarks SHOULD compare throughput with and without guided generation. This shows the overhead. Output validation rates, meaning the percentage of outputs that conform to the schema, SHOULD be reported alongside performance metrics.¶

4.5.5. JUDG: LLM-as-Judge / Evaluation

The model evaluates another model's output against specified criteria. It produces a score and optionally a justification. This is used in production for quality assurance, A/B testing, content moderation review, and RLHF reward modeling. The input includes the original prompt, the response to evaluate, and the scoring criteria.¶

Classification Vector:¶

Table 34
Dimension	Value
I/O Ratio	Prefill-Dominant (PD)
Output Constraint	Structured (ST) to Semi-Structured (SS)
Latency Sensitivity	Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 35
Parameter	Range
Input Length	512 to 8192 tokens (criteria plus prompt plus response)
Output Length	10 to 300 tokens (score plus reasoning)

Key Performance Indicators:¶

Evaluations per second¶
Total request latency¶
Throughput under sustained batch load¶

LLM-as-Judge is a high-volume pipeline workload. Benchmarks SHOULD simulate evaluation of a corpus of responses. They SHOULD measure sustained throughput over thousands of evaluations. The scoring criteria form a cacheable system prompt prefix. Both score-only mode (minimal output) and score-with-reasoning mode (longer output) SHOULD be tested.¶

4.5.6. RAGN: Retrieval-Augmented Generation

The model receives retrieved context passages and a user query. It generates a grounded response. It differs from other prefill-heavy profiles in that the context is made of multiple retrieved chunks assembled together. It differs from LDQA in that the input context is shorter and is made of separately retrieved passages rather than a single contiguous document.¶

Classification Vector:¶

Table 36
Dimension	Value
I/O Ratio	Prefill-Dominant (PD) to Balanced (BA)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Interactive (IN) to Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 37
Parameter	Range
Retrieved Context	512 to 8192 tokens (3 to 20 chunks)
User Query	16 to 256 tokens
Total Input	528 to 8448 tokens
Output Length	64 to 1024 tokens

Key Performance Indicators:¶

TTFT (the user is typically waiting interactively)¶
Total request latency including time to first useful content¶
End-to-end latency including retrieval if measured at the compound system boundary¶

RAG benchmarks SHOULD vary the number and size of retrieved chunks. This shows how performance scales as context grows. Configurations of 3, 5, 10, and 20 retrieved chunks SHOULD be tested. The SUT boundary from [LLM-METHOD] is especially important for RAG. Measuring at the model engine boundary excludes retrieval latency. Measuring at the compound system boundary includes it. Both SHOULD be reported when the retrieval system is part of the SUT.¶

4.6. Group E: Decode-Heavy Generative Profiles

Decode-heavy profiles generate much more output than input. The decode phase dominates total latency and compute. These workloads stress memory bandwidth, ITL consistency, and sustained output token throughput.¶

For Group E profiles, output token throughput and ITL distribution are the primary metrics. TTFT is less important relative to total generation time.¶

4.6.1. CGEN: Content Generation

The model receives a short prompt or specification. It generates long-form content. Examples are blog posts, articles, marketing copy, emails, reports, and creative writing. Output is much longer than input and is free-form.¶

Classification Vector:¶

Table 38
Dimension	Value
I/O Ratio	Decode-Dominant (DD)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Interactive (IN) to Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 39
Parameter	Range
Input Length	32 to 1024 tokens
Output Length	256 to 8192 tokens

Key Performance Indicators:¶

Output token throughput in tokens per second¶
P50 and P99 ITL¶
ITL consistency over long generation (stutter detection)¶
Total generation time¶

Long-form generation is the main test of sustained decode performance. Benchmarks SHOULD measure ITL distribution across the full generation. Measuring only the initial tokens is not sufficient. Decode performance can degrade as the KV cache grows. Plotting ITL against output position is useful for detecting decode slowdown. Output lengths of 256, 1024, 2048, and 4096 or more tokens SHOULD be tested.¶

4.6.2. REAS: Reasoning / Chain-of-Thought

The model performs extended step-by-step reasoning before producing a final answer. This profile has grown in importance with reasoning-optimized models that produce explicit or hidden thinking tokens. The input is short. The output is very long, often 10 to 100 times the length of the final answer. Most of the output is intermediate reasoning steps.¶

Classification Vector:¶

Table 40
Dimension	Value
I/O Ratio	Decode-Dominant (DD)
Output Constraint	Free-Form (FF) to Semi-Structured (SS)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 41
Parameter	Range
Input Length	64 to 2048 tokens
Reasoning Output	512 to 32768 tokens (thinking or CoT)
Final Answer	32 to 512 tokens
Total Output	544 to 33280 tokens

Key Performance Indicators:¶

Total generation time (may be 30 to 120 seconds or more)¶
Output token throughput sustained over long generation¶
ITL consistency over extended decode¶
Time to final answer if thinking tokens are hidden¶

Reasoning workloads produce the longest output sequences of any profile. This creates unique stress on KV cache management during decode. Hidden thinking tokens are a serving-system- specific feature. Benchmarks MUST distinguish between visible tokens and hidden thinking tokens if the serving system supports this. Serving systems that hide thinking tokens MUST document whether max_output_tokens applies to visible tokens only, to total tokens, or to both. This MUST be reported in the run parameters defined in Section 4.1. Time to first visible token, which comes after reasoning completes, is a meaningful metric. It differs from standard TTFT and SHOULD be reported when thinking tokens are hidden. Benchmarks SHOULD test with reasoning budgets of 1K, 4K, 16K, and 32K tokens. Memory pressure during extended reasoning is a critical measurement.¶

4.6.3. DGEN: Data Generation / Synthetic Data

The model receives a schema or specification. It generates multiple structured records. This is used for test data generation, training data augmentation, simulation, and content pipelines. The output is repetitive, structured, and produced at high volume.¶

Classification Vector:¶

Table 42
Dimension	Value
I/O Ratio	Decode-Dominant (DD)
Output Constraint	Structured (ST)
Latency Sensitivity	Throughput-Oriented (TO)
Concurrency Pattern	Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 43
Parameter	Range
Input Length	128 to 2048 tokens (schema plus instructions)
Output Length	512 to 16384 tokens (multiple records)

Key Performance Indicators:¶

Output tokens per second (sustained)¶
Records generated per second¶
Throughput under high concurrency¶

Data generation is a throughput-oriented profile. TTFT and ITL are not primary metrics here. Benchmarks SHOULD measure sustained output throughput over large generation batches of 10K or more records. Request latency is still measurable and MAY be reported as a secondary metric. Guided generation to a JSON or CSV schema is standard and SHOULD be enabled. Schema validation rate SHOULD be reported.¶

4.6.4. TRNS: Translation

The model translates text from one natural language to another. The input-to-output ratio is approximately 1:1, varying by language pair. The full input content is consumed in producing the output.¶

Classification Vector:¶

Table 44
Dimension	Value
I/O Ratio	Balanced (BA)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Near-Real-Time (NR) to Throughput-Oriented (TO)
Concurrency Pattern	Single-Stream (SI) to Sustained-High (SH)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 45
Parameter	Range
Input Length	32 to 8192 tokens
Output Length	32 to 10000 tokens
I/O Ratio	0.5:1 to 2:1 depending on language pair

Key Performance Indicators:¶

Total request latency by segment length¶
Output tokens per second¶
Throughput in segments translated per second¶

Token-to-token ratio varies by language pair because tokenizer coverage differs across languages. Benchmarks MUST specify the language pair and tokenizer used. CJK languages typically produce more tokens per semantic unit than European languages. Benchmarks SHOULD test with at least two language pairs. This shows tokenizer effects. Segment lengths SHOULD span sentence- level (32 to 64 tokens), paragraph-level (128 to 512 tokens), and document-level (2000 or more tokens).¶

4.6.5. EDIT: Rewriting / Editing

The model receives input text. It produces a transformed version of the same text. Examples are grammar correction, tone adjustment, style transfer, simplification, and paraphrasing. It differs from translation in that it works within the same language. It differs from summarization in that the output length is similar to the input length.¶

Classification Vector:¶

Table 46
Dimension	Value
I/O Ratio	Balanced (BA)
Output Constraint	Free-Form (FF)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Single-Stream (SI)
Prefix Sharing	System-Prompt (S)

Typical Token Distribution:¶

Table 47
Parameter	Range
Input Length	64 to 4096 tokens
Output Length	64 to 4096 tokens (similar to input)

Key Performance Indicators:¶

Total request latency¶
Output token throughput¶
TTFT if streamed to the user for review¶

Rewriting produces output structurally similar to input. Some serving systems may use this similarity to enable speculative decoding or draft-based approaches. Benchmarks SHOULD report whether such optimizations are enabled.¶

4.7. Group F: Multi-Step Chained Profiles

Multi-step profiles involve sequences of dependent inference calls. Each call depends on the output of the previous call. Total user-perceived latency is the sum of all step latencies. These workloads test end-to-end pipeline performance rather than single-request optimization.¶

For Group F profiles, the primary metric is end-to-end pipeline latency. Individual step TTFT and total latency contribute to this but are not sufficient on their own.¶

4.7.1. AGNT: Agentic / Tool Use

The model operates in a loop. It receives an observation, decides on an action, receives the tool result, and continues until the task is complete. Each iteration is a separate inference call. The context grows across iterations as prior observations and actions accumulate.¶

Classification Vector:¶

Table 48
Dimension	Value
I/O Ratio	Balanced (BA) per step
Output Constraint	Structured (ST) per tool call; Free-Form (FF) for final answer
Latency Sensitivity	Near-Real-Time (NR) to Interactive (IN)
Concurrency Pattern	Chained-Sequential (CS)
Prefix Sharing	Turn-Accumulated (T)

Typical Token Distribution (per step):¶

Table 49
Parameter	Range
Input Length	512 to 16384 tokens (grows with steps)
Output Length	20 to 500 tokens per step
Steps per task	3 to 20

Key Performance Indicators:¶

End-to-end task completion time¶
Per-step TTFT (compounds across steps)¶
Per-step total latency¶
TTFT scaling as context accumulates across steps¶
Prefix cache hit rate across steps¶

Agentic workloads are especially sensitive to per-step TTFT. Latencies compound across steps. A 500ms TTFT per step over 10 steps adds 5 seconds of TTFT alone to total task completion time. Benchmarks MUST simulate realistic multi-step trajectories. They MUST include tool call outputs injected between steps. Context grows across steps, so prefix caching across turns is critical. Benchmarks SHOULD test with task lengths of 3, 5, 10, and 20 steps. Both successful paths and long paths SHOULD be tested.¶

4.7.2. PLAN: Planning / Task Decomposition

The model receives a high-level goal. It produces a structured plan with ordered steps, dependencies, and resource requirements. This is often the first step of an agentic pipeline. It may involve iterative refinement where the model reviews and revises its own plan.¶

Classification Vector:¶

Table 50
Dimension	Value
I/O Ratio	Balanced (BA) to Decode-Dominant (DD)
Output Constraint	Semi-Structured (SS)
Latency Sensitivity	Near-Real-Time (NR)
Concurrency Pattern	Chained-Sequential (CS)
Prefix Sharing	Turn-Accumulated (T)

Typical Token Distribution:¶

Table 51
Parameter	Range
Input Length	128 to 4096 tokens (goal plus context)
Output Length	256 to 2048 tokens (structured plan)
Refinement Rounds	1 to 3

Key Performance Indicators:¶

Total planning time including refinement rounds¶
Per-round latency¶
Output token throughput¶

Planning with iterative refinement requires the model to evaluate and modify its own prior output. Benchmarks SHOULD test both single-pass planning and multi-round refinement. Plan quality is outside the scope of performance benchmarking. The time to produce a plan of a given length and structure is measurable and SHOULD be reported.¶

Profile	TTFT	ITL	TPOT	Output Tok/s
Group A: Non-Generative
EMBED	-	-	-	-
XRANK	-	-	-	-
LOGPR	-	-	-	-
Group B: Minimal Output
CLAS	Applicable	-	-	-
SFQA	Primary	Applicable	-	-
RANK	Applicable	-	-	-
NER	Applicable	-	Applicable	-
FUNC	Primary	Applicable	-	-
SQLN	Applicable	-	-	-
Group C: Interactive
CHAT	Primary	Primary	Applicable	Applicable
COMP	Primary	Applicable	-	-
PERS	Primary	Primary	Applicable	Applicable
Group D: Prefill-Heavy
SUMM	Primary	Applicable	Applicable	Applicable
LDQA	Primary	Applicable	-	-
MDOC	Primary	Applicable	Applicable	Applicable
EXTR	Applicable	-	Applicable	Applicable
JUDG	Applicable	-	Applicable	-
RAGN	Primary	Primary	Applicable	Applicable
Group E: Decode-Heavy
CGEN	Applicable	Primary	Primary	Primary
REAS	Applicable	Primary	Primary	Primary
DGEN	-	-	Applicable	Primary
TRNS	Applicable	Applicable	Primary	Primary
EDIT	Applicable	Applicable	Primary	Applicable
Group F: Multi-Step
AGNT	Primary	Applicable	-	-
PLAN	Applicable	Applicable	Applicable	Applicable

Profile	Recommended Dataset(s)	Notes
EMBED	MTEB benchmark passages; MS MARCO passages	Query and passage length variants
XRANK	MS MARCO query-passage pairs	Candidate sets of 10 to 100
LOGPR	OpenWebText; The Pile subsets	Representative web text
CLAS	SST-2; AG News; GoEmotions	Binary, multi-class, multi-label variants
SFQA	Natural Questions; TriviaQA	Short-context factoid questions
RANK	MS MARCO; TREC-DL	Relevance judgment tasks
NER	CoNLL-2003; OntoNotes	Standard NER benchmarks
FUNC	Berkeley Function Calling Leaderboard	Varied tool schema complexity
SQLN	Spider; WikiSQL	Simple and complex schema variants
CHAT	ShareGPT; LMSYS-Chat-1M	Real multi-turn conversations; MUST be anonymized
COMP	The Stack; HumanEval	Real code files with cursor positions
PERS	Synthetic persona datasets	Large system prompts (2K to 10K tokens); see characterization requirements above
SUMM	CNN/DailyMail; arXiv; GovReport	Short and long document variants
LDQA	NarrativeQA; QuALITY; Scrolls	Long-document QA benchmarks
MDOC	Multi-XScience; WikiSum	Multi-source synthesis
EXTR	SQuAD extractive; CORD; SROIE	Schema-constrained extraction
JUDG	MT-Bench judgments; Arena-Hard	LLM evaluation datasets
RAGN	Natural Questions plus retriever output	Realistic retrieved chunk assembly
CGEN	WritingPrompts; Alpaca instructions	Creative and instructional generation
REAS	GSM8K; MATH; ARC-Challenge	Problems requiring extended reasoning
DGEN	Spider schema definitions; synthetically generated JSON schemas	Schema-directed generation; implementers MUST document schema generation method
TRNS	WMT translation benchmarks	Multiple language pairs
EDIT	JFLEG; IteraTeR	Grammar correction and style transfer
AGNT	SWE-bench; WebArena trajectories	Multi-step tool use traces
PLAN	ALFWorld; HotpotQA decompositions	Planning and decomposition tasks

Profile	Code	Group	I/O Ratio	Output Constraint
Embedding Generation	EMBED	A	PO	NT
Cross-Encoder Reranking	XRANK	A	PO	NT
Logprob Scoring	LOGPR	A	PO	NT
Classification	CLAS	B	PD	MI
Short-Form QA	SFQA	B	BA-PD	FF
Generative Ranking	RANK	B	PD	ST-MI
Entity Recognition	NER	B	PD	ST
Function Calling	FUNC	B	PD	ST
SQL Generation	SQLN	B	PD	ST
Conversational Chat	CHAT	C	BA	FF
Code Completion	COMP	C	PD-BA	SS
Persona Dialogue	PERS	C	BA	FF
Summarization	SUMM	D	PD	FF-SS
Long Document QA	LDQA	D	PD	FF
Multi-Doc Synthesis	MDOC	D	PD	SS
Structured Extraction	EXTR	D	PD	ST
LLM-as-Judge	JUDG	D	PD	ST-SS
RAG	RAGN	D	PD-BA	FF
Content Generation	CGEN	E	DD	FF
Reasoning / CoT	REAS	E	DD	FF-SS
Data Generation	DGEN	E	DD	ST
Translation	TRNS	E	BA	FF
Rewriting / Editing	EDIT	E	BA	FF
Agentic / Tool Use	AGNT	F	BA/step	ST+FF
Planning	PLAN	F	BA-DD	SS

Profile	Latency Class	Concurrency	Prefix Sharing	Input Tokens	Output Tokens
EMBED	NR-TO	SH	N	16-8K	N/A (vector)
XRANK	NR	BB	D	128-1K	N/A (score)
LOGPR	TO	SH	N	64-4K	N/A (probs)
CLAS	NR-TO	SH	S	32-2K	1-10
SFQA	IN-NR	SI	S	16-512	5-100
RANK	NR	BB	D	256-4K	5-50
NER	TO	SH	S	64-2K	20-200
FUNC	NR	SI	H	512-4K	20-200
SQLN	NR	SI-BB	H	256-8K	20-300
CHAT	IN	SI	T	64-16K	50-1K
COMP	IN	SI	D	256-8K	5-200
PERS	IN	SI	S+T	2K-10K sys	100-2K
SUMM	NR-TO	SI-SH	N	1K-32K	64-1K
LDQA	NR	SI-BB	D	4K-128K	16-512
MDOC	TO	SI	N	8K-128K	256-4K
EXTR	NR-TO	SH	S	256-16K	50-500
JUDG	TO	SH	S	512-8K	10-300
RAGN	IN-NR	SI	S	528-8K	64-1K
CGEN	IN-NR	SI	S	32-1K	256-8K
REAS	NR	SI	S	64-2K	544-33K
DGEN	TO	SH	S	128-2K	512-16K
TRNS	NR-TO	SI-SH	S	32-8K	32-10K
EDIT	NR	SI	S	64-4K	64-4K
AGNT	NR-IN	CS	T	512-16K	20-500/step
PLAN	NR	CS	T	128-4K	256-2K