<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="rfc7991bis.rnc"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     ipr="trust200902"
     docName="draft-mondal-llm-serving-workload-profiles-00"
     category="info"
     submissionType="independent"
     version="3">

  <front>
    <title abbrev="LLM Benchmarking Profiles">Benchmarking Workload Profiles for Large Language Model Serving</title>

    <seriesInfo name="Internet-Draft" value="draft-mondal-llm-serving-workload-profiles-00"/>

    <author fullname="Mohadeb Mondal" initials="M." surname="Mondal">
      <organization>Independent</organization>
      <address>
        <email>mohadeb.mondal@gmail.com</email>
      </address>
    </author>

    <author fullname="Madhava Gaikwad" initials="M." surname="Gaikwad">
      <organization>Independent</organization>
      <address>
        <email>gaikwad.madhav@gmail.com</email>
      </address>
    </author>

    <date year="2026" month="April" day="10"/>

    <keyword>LLM</keyword>
    <keyword>benchmarking</keyword>
    <keyword>workload profiles</keyword>
    <keyword>inference</keyword>
    <keyword>large language model</keyword>

    <abstract>
      <t>
        This document defines standard workload profiles for benchmarking
        Large Language Model (LLM) inference serving systems. Each profile
        represents one class of production workload. A profile is defined
        by its input and output token distribution, output structure,
        latency sensitivity, concurrency pattern, and caching behavior.
        Profiles are organized into six groups: Non-Generative, Minimal
        Output, Interactive Streaming, Prefill-Heavy Generative, Decode-Heavy
        Generative, and Multi-Step Chained. This document works with
        "Benchmarking Methodology for Large Language Model Serving"
        <xref target="LLM-METHOD"/>. It specifies the workloads that the methodology's
        tests SHOULD be run against.
      </t>
    </abstract>
  </front>

  <middle>

    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>
      <t>
        The companion document <xref target="LLM-METHOD"/> defines test procedures,
        measurement specifications, and reporting formats for LLM inference
        systems. A second companion <xref target="LLM-TERMS"/> defines the metrics
        vocabulary. Together these two documents specify how to measure.
        This document specifies what to measure against.
      </t>
      <t>
        Production LLM deployments serve many different task types. These
        tasks differ in fundamental inference properties.
      </t>
      <t>
        Some workloads have no decode phase at all. Embedding workloads
        are one example.
      </t>
      <t>
        Some workloads produce free-form text. Others produce output
        constrained to a schema or a label set.
      </t>
      <t>
        Some users wait interactively for streaming tokens. Others consume
        results in batch.
      </t>
      <t>
        Some requests share a large input prefix with other requests.
        Others arrive independently.
      </t>
      <t>
        Some workloads are a single inference call. Others are a chain
        of dependent calls.
      </t>
      <t>
        These properties determine which metrics from <xref target="LLM-TERMS"/> are
        meaningful. They determine which tests from <xref target="LLM-METHOD"/> apply.
        They determine what performance characteristics matter in
        production.
      </t>
      <t>
        Without standard profiles, benchmark results are hard to compare.
        A system tuned for chat may perform poorly on summarization. A
        system benchmarked only on synthetic uniform workloads may not
        match any real production scenario. This document fixes that.
      </t>

      <section anchor="relationship" numbered="true" toc="default">
        <name>Relationship to Companion Documents</name>
        <t>
          This document is the third in a series.
        </t>
        <t>
          <xref target="LLM-TERMS"/> defines the metrics vocabulary. Examples are Time to
          First Token, Inter-Token Latency, and Output Token Throughput.
        </t>
        <t>
          <xref target="LLM-METHOD"/> defines test procedures. Examples are how to measure
          TTFT, how to set up load generators, and how to format reports.
        </t>
        <t>
          This document defines workload profiles. It specifies what
          representative tasks to benchmark against. It maps each profile
          to the applicable metrics and tests.
        </t>
        <t>
          Implementers SHOULD select profiles that match their production
          workload mix. They SHOULD apply the tests from <xref target="LLM-METHOD"/> to
          those profiles. They SHOULD report results per profile.
        </t>
      </section>

      <section anchor="structure" numbered="true" toc="default">
        <name>Document Structure</name>
        <t>
          <xref target="framework"/> defines the classification framework. It uses five
          dimensions to describe each profile. <xref target="profiles"/> defines the
          common benchmark run parameters that MUST be reported for all
          profiles (<xref target="common-params"/>). Sections <xref target="group-a" format="counter"/>
          through <xref target="group-f" format="counter"/> define the
          profiles. <xref target="applicability"/> maps profiles to applicable metrics.
          <xref target="mixed"/> covers mixed-workload composition. <xref target="cross-cutting"/>
          covers dimensions that apply to any profile. <xref target="iana"/> covers IANA
          considerations.
        </t>
      </section>

      <section anchor="out-of-scope" numbered="true" toc="default">
        <name>Out of Scope</name>
        <t>
          The following topics are outside the scope of this document.
        </t>
        <t>
          Model training is out of scope. This document covers inference
          serving only.
        </t>
        <t>
          Model quality is out of scope. Output correctness and accuracy
          are not performance metrics here. The one exception is output
          validity rates for structured output profiles, which MAY be
          reported as a secondary metric.
        </t>
        <t>
          Network and client-side latency are out of scope. The exception
          is when the system under test (SUT) boundary is defined to
          include network transport per <xref target="LLM-METHOD"/>.
        </t>
        <t>
          Multimodal workloads are out of scope. This document covers
          text-only LLM serving. Multimodal input and output such as
          images, audio, and video are left for future work.
        </t>
      </section>
    </section>

    <section anchor="requirements" numbered="true" toc="default">
      <name>Requirements Language</name>
      <t>
        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
        NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
        "MAY", and "OPTIONAL" in this document are to be interpreted as
        described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
        appear in all capitals, as shown here.
      </t>
    </section>

    <section anchor="framework" numbered="true" toc="default">
      <name>Profile Classification Framework</name>
      <t>
        Each profile is described along five dimensions. These dimensions
        allow systematic comparison across profiles. They help
        implementers understand the inference properties of each workload.
      </t>

      <section anchor="io-ratio" numbered="true" toc="default">
        <name>Input/Output Ratio Class</name>
        <t>
          This ratio is computed as input tokens divided by output tokens.
          It determines the balance between the prefill phase and the decode
          phase. It is often the main factor in whether a workload is
          compute-bound or memory-bandwidth-bound. The exact boundary
          depends on model architecture, batch size, quantization, and
          attention implementation.
        </t>
        <t>
          For profiles that produce hidden thinking tokens (see <xref target="reas"/>),
          those tokens count as output tokens for the ratio. The
          exception is when the serving system hides them entirely from
          the caller. In that case, the serving system MUST document
          whether max_output_tokens applies to visible tokens only, to
          total tokens, or to both.
        </t>
        <dl newline="false" spacing="normal">
          <dt>Prefill-Only (PO):</dt>
          <dd>No token generation. The output is a
            vector, a score, or a probability distribution. Examples are
            embedding, cross-encoder reranking, and perplexity scoring.</dd>
          <dt>Prefill-Dominant (PD):</dt>
          <dd>Input is much larger than output.
            The ratio is greater than 10:1. Examples are summarization,
            classification, and extraction.</dd>
          <dt>Balanced (BA):</dt>
          <dd>Input and output are comparable in length.
            The ratio is between 1:3 and 3:1. Examples are translation,
            conversational chat, and rewriting.</dd>
          <dt>Decode-Dominant (DD):</dt>
          <dd>Output is much larger than input.
            The ratio is greater than 1:10. Examples are content generation,
            creative writing, and data generation.</dd>
        </dl>
      </section>

      <section anchor="output-constraint" numbered="true" toc="default">
        <name>Output Constraint Level</name>
        <t>
          This dimension describes how much the output format is
          constrained. It affects decode efficiency, guided generation
          overhead, and output validation requirements.
        </t>
        <dl newline="false" spacing="normal">
          <dt>Non-Token (NT):</dt>
          <dd>The output is not a token sequence. Examples
            are vectors, floating-point scores, and probability distributions.</dd>
          <dt>Minimal (MI):</dt>
          <dd>The output is a single token, a short label,
            or a numeric score. Examples are classification and yes/no
            decisions.</dd>
          <dt>Structured (ST):</dt>
          <dd>The output MUST conform to a defined schema.
            Examples are JSON objects, SQL queries, and function call
            arguments.</dd>
          <dt>Semi-Structured (SS):</dt>
          <dd>The output follows a general format
            but has variable content. Examples are Markdown, numbered lists,
            and tables.</dd>
          <dt>Free-Form (FF):</dt>
          <dd>There are no structural constraints. Examples
            are natural language text, creative writing, and conversational
            responses.</dd>
        </dl>
      </section>

      <section anchor="latency-sensitivity" numbered="true" toc="default">
        <name>Latency Sensitivity Class</name>
        <t>
          This dimension describes how much latency the consumer can
          tolerate. It determines whether TTFT, ITL, or throughput is the
          primary optimization target.
        </t>
        <t>
          The latency targets listed below are informative. They reflect
          targets commonly used in production. They are not normative
          benchmark pass/fail thresholds. Actual targets depend on model
          size, hardware, network configuration, and operator SLOs.
        </t>
        <dl newline="false" spacing="normal">
          <dt>Interactive (IN):</dt>
          <dd>The user watches tokens arrive in real
            time. Interactive deployments often target sub-500ms TTFT as a
            user-experience goal. ITL consistency is critical. Examples are
            chat and code completion.</dd>
          <dt>Near-Real-Time (NR):</dt>
          <dd>The response is needed within seconds.
            It is usually not streamed token by token. Production deployments
            commonly target total latency below 5 seconds. Examples are
            search ranking, extraction, and function calling.</dd>
          <dt>Throughput-Oriented (TO):</dt>
          <dd>Latency is secondary. The goal is
            to maximize aggregate throughput. Examples are batch
            classification, bulk summarization, and data generation.</dd>
        </dl>
      </section>

      <section anchor="concurrency-pattern" numbered="true" toc="default">
        <name>Concurrency Pattern</name>
        <t>
          This dimension describes the typical request arrival and batching
          pattern. It affects scheduler behavior and resource utilization.
        </t>
        <dl newline="false" spacing="normal">
          <dt>Single-Stream (SI):</dt>
          <dd>Requests arrive individually. They
            follow a Poisson or similar arrival process. Examples are
            general API usage and chat.</dd>
          <dt>Burst-Batch (BB):</dt>
          <dd>Groups of related requests arrive
            together. Examples are ranking a set of candidates and
            classifying a document batch.</dd>
          <dt>Sustained-High (SH):</dt>
          <dd>A continuous high-volume stream of
            requests arrives at a stable rate. Examples are production
            pipelines and data processing.</dd>
          <dt>Chained-Sequential (CS):</dt>
          <dd>Each request depends on the output
            of the previous request. Examples are agentic workflows and
            planning pipelines.</dd>
        </dl>
      </section>

      <section anchor="prefix-sharing" numbered="true" toc="default">
        <name>Prefix Sharing Pattern</name>
        <t>
          This dimension describes how much requests share common input
          prefixes. It affects KV cache utilization and memory efficiency.
        </t>
        <t>
          In this document, prefix sharing means cross-request shared
          prefill reuse. A serving system retains and reuses the computed
          key-value (KV) state for a shared input prefix across multiple
          distinct requests. This is different from two other things that
          are sometimes called caching.
        </t>
        <t>
          The first is intra-request KV cache. This is the normal decode-
          phase cache retained within a single request. It is present in
          all generative profiles. It is not what this dimension measures.
        </t>
        <t>
          The second is external result caching. This is deduplication of
          identical complete requests at an API gateway or proxy layer.
          This is out of scope for this document.
        </t>
        <t>
          When a profile has a non-None prefix sharing value, benchmarks
          MUST report prefix cache hit rates. A cache hit means the serving
          system determined that the leading N tokens of an incoming
          request match a cached prefix exactly. This match is token-by-
          token or via an equivalent hash. The cached KV state is then
          reused without recomputation. Serving systems MUST document the
          cache eviction policy and the cache type as part of the run
          parameters defined in <xref target="common-params"/>.
        </t>
        <dl newline="false" spacing="normal">
          <dt>None (N):</dt>
          <dd>Each request has a unique prefix. Example is
            independent short queries.</dd>
          <dt>System-Prompt (S):</dt>
          <dd>A shared system prompt appears across
            requests. It is typically 100 to 2000 tokens. Most API
            deployments follow this pattern.</dd>
          <dt>Document-Shared (D):</dt>
          <dd>Multiple requests share a large document
            context. It is typically thousands to tens of thousands of
            tokens. Examples are ranking candidates against a shared passage
            and multi-question QA over a document.</dd>
          <dt>Turn-Accumulated (T):</dt>
          <dd>The prefix grows across conversation
            turns. Examples are multi-turn chat and agentic tool loops.</dd>
          <dt>Schema-Shared (H):</dt>
          <dd>Requests share a tool or function schema
            as a prefix. Examples are function calling and SQL generation
            with a shared database schema.</dd>
        </dl>
      </section>
    </section>

    <section anchor="profiles" numbered="true" toc="default">
      <name>Profile Groups and Definitions</name>
      <t>
        This section defines 25 workload profiles in six groups.
        <xref target="common-params"/> specifies common benchmark run parameters that apply
        to all profiles. Sections <xref target="group-a" format="counter"/> through
        <xref target="group-f" format="counter"/> define the profiles.
      </t>
      <t>
        Each profile includes the following.
      </t>
      <dl newline="false" spacing="normal">
        <dt>Description:</dt>
        <dd>What the task is and why it is a distinct benchmarking profile.</dd>
        <dt>Classification Vector:</dt>
        <dd>The five-dimensional classification from <xref target="framework"/>.</dd>
        <dt>Typical Token Distribution:</dt>
        <dd>Representative input and output token length ranges.</dd>
        <dt>Key Performance Indicators:</dt>
        <dd>The most important metrics from <xref target="LLM-TERMS"/> for this profile.</dd>
        <dt>Benchmarking Considerations:</dt>
        <dd>Profile-specific setup, measurement, or reporting guidance.</dd>
      </dl>

      <section anchor="common-params" numbered="true" toc="default">
        <name>Common Benchmark Run Parameters</name>
        <t>
          Benchmark results are only comparable when the configuration is
          fully specified. The following parameters MUST be reported for
          every benchmark run. They apply in addition to any profile-
          specific requirements in Sections <xref target="group-a" format="counter"/>
          through <xref target="group-f" format="counter"/>.
        </t>
        <table align="left">
          <thead>
            <tr><th>Parameter</th><th>Requirement</th></tr>
          </thead>
          <tbody>
            <tr><td>Model identifier and version</td><td>MUST report exact model name and version string</td></tr>
            <tr><td>Tokenizer name and version</td><td>MUST report; tokenizer affects all token count measurements</td></tr>
            <tr><td>Decoding strategy</td><td>MUST report: greedy, sampling, or beam search</td></tr>
            <tr><td>Temperature</td><td>MUST report if sampling; N/A for greedy</td></tr>
            <tr><td>top_p</td><td>MUST report if sampling</td></tr>
            <tr><td>top_k</td><td>MUST report if sampling</td></tr>
            <tr><td>max_output_tokens</td><td>MUST report; specify whether limit applies to visible tokens, total tokens, or both</td></tr>
            <tr><td>Stop sequences</td><td>MUST report all stop conditions used</td></tr>
            <tr><td>Streaming enabled</td><td>MUST report: enabled or disabled</td></tr>
            <tr><td>Server token flush policy</td><td>MUST report if streaming enabled (e.g., per-token, batched-N, time-based)</td></tr>
            <tr><td>Guided decoding enabled</td><td>MUST report: enabled or disabled</td></tr>
            <tr><td>Guided decoding method</td><td>MUST report if enabled (e.g., grammar-based, JSON schema, logit masking)</td></tr>
            <tr><td>Guided decoding schema size</td><td>MUST report if enabled: schema token count and nesting depth</td></tr>
            <tr><td>Prefix cache enabled</td><td>MUST report: enabled or disabled</td></tr>
            <tr><td>Prefix cache type</td><td>MUST report if enabled (e.g., block-level KV cache, full-prefix cache)</td></tr>
            <tr><td>Prefix cache eviction policy</td><td>MUST report if enabled (e.g., LRU, TTL-based, capacity limit)</td></tr>
            <tr><td>Hardware configuration</td><td>MUST report: accelerator type, count, and memory capacity</td></tr>
            <tr><td>Serving framework and version</td><td>MUST report</td></tr>
            <tr><td>Quantization</td><td>MUST report: precision (e.g., fp16, int8, int4) and method if applicable</td></tr>
          </tbody>
        </table>
        <t>
          Implementers SHOULD also report the version of any load
          generation tool used. They SHOULD report the network topology
          between the load generator and the SUT.
        </t>
      </section>

      <section anchor="group-a" numbered="true" toc="default">
        <name>Group A: Non-Generative Profiles</name>
        <t>
          Non-generative profiles produce no output token sequence. The
          inference pipeline runs only the prefill pass. It returns a
          non-token result. This result is a vector embedding, a scalar
          score, or a per-token probability distribution.
        </t>
        <t>
          There is no decode phase. The following metrics from <xref target="LLM-TERMS"/>
          do not apply to Group A profiles.
        </t>
        <ul>
          <li>Time to First Token (TTFT)</li>
          <li>Inter-Token Latency (ITL)</li>
          <li>Output Token Throughput</li>
          <li>Time Per Output Token (TPOT)</li>
        </ul>
        <t>
          Group A profiles MUST instead be measured using the following.
        </t>
        <dl newline="false" spacing="normal">
          <dt>Request Latency:</dt>
          <dd>Total time from request submission to result receipt.</dd>
          <dt>Sequences Per Second:</dt>
          <dd>Number of complete inference requests
            processed per second under the specified concurrency. For Group A
            profiles, one sequence equals one request. Seq/s and Req/s are
            the same here. When a single API call submits multiple sequences
            in a batch, throughput MUST be reported as sequences per second,
            not API calls per second.</dd>
          <dt>Prefill Throughput:</dt>
          <dd>Input tokens processed per second.</dd>
        </dl>

        <section anchor="embed" numbered="true" toc="default">
          <name>EMBED: Embedding Generation</name>
          <t>
            The model encodes input text into a fixed-size
            dense vector. This vector is used for semantic search, retrieval,
            clustering, and similarity computation. This is one of the
            highest-volume LLM inference workloads in production. Every RAG
            pipeline, recommendation system, and semantic search index
            depends on embedding generation.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Only (PO)</td></tr>
              <tr><td>Output Constraint</td><td>Non-Token (NT)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>None (N)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>16 to 8192 tokens</td></tr>
              <tr><td>Output</td><td>Fixed-size vector (e.g., 768, 1024, 1536, or 3072 dimensions)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Sequences per second at target batch size</li>
            <li>P50 and P99 request latency under load</li>
            <li>Prefill throughput in input tokens per second</li>
            <li>Throughput scaling with batch size</li>
          </ul>
          <t>
            Embedding workloads work well with large batches. Benchmarks
            SHOULD measure throughput across batch sizes of 1, 8, 32, 64,
            128, and 256. This shows batching efficiency. Input length
            SHOULD come from a representative corpus. Real embedding
            workloads have high length variance because short queries and
            long passages are both common. Benchmarks SHOULD report latency
            and throughput separately for query-length inputs (fewer than
            64 tokens) and passage-length inputs (256 to 8192 tokens).
            These two cases often serve different stages of a retrieval
            pipeline.
          </t>
        </section>

        <section anchor="xrank" numbered="true" toc="default">
          <name>XRANK: Cross-Encoder Reranking</name>
          <t>
            The model receives a query and a document. It
            produces a relevance score without generating tokens. Cross-
            encoder reranking is used in search pipelines to refine initial
            retrieval results. The model processes the concatenated query and
            document pair. It outputs a scalar relevance score.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Only (PO)</td></tr>
              <tr><td>Output Constraint</td><td>Non-Token (NT)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Burst-Batch (BB)</td></tr>
              <tr><td>Prefix Sharing</td><td>Document-Shared (D)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>128 to 1024 tokens (query plus document)</td></tr>
              <tr><td>Output</td><td>Single scalar score</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Pairs scored per second at target batch size</li>
            <li>P50 and P99 request latency per scoring batch</li>
            <li>Throughput scaling as candidate set size grows</li>
          </ul>
          <t>
            Cross-encoder reranking typically processes a batch of
            candidates. This is commonly 10 to 100 query-document pairs.
            The query is shared across all pairs. Benchmarks SHOULD measure
            performance with the shared query prefix to capture KV cache
            benefits. Candidate set sizes of 10, 25, 50, and 100 SHOULD be
            tested. Latency SHOULD be reported for the full candidate set.
            This is the time to score all candidates, not just one pair.
            This matches real search pipeline requirements.
          </t>
        </section>

        <section anchor="logpr" numbered="true" toc="default">
          <name>LOGPR: Logprob / Perplexity Scoring</name>
          <t>
            The model processes input text. It returns per-
            token log probabilities or a sequence-level perplexity score. It
            does not generate new tokens. This is used for AI-generated
            content detection, model evaluation, data quality filtering,
            watermark detection, and calibration. The model runs a forward
            pass over the input. It returns the probability distribution at
            each token position.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Only (PO)</td></tr>
              <tr><td>Output Constraint</td><td>Non-Token (NT)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>None (N)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>64 to 4096 tokens</td></tr>
              <tr><td>Output</td><td>Per-token logprobs or aggregate perplexity score</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Sequences scored per second</li>
            <li>Input tokens processed per second</li>
            <li>P50 and P99 request latency by input length bucket</li>
          </ul>
          <t>
            Logprob scoring is computationally the same as the prefill phase
            of generative inference. The difference is that it returns
            probability data instead of starting a decode phase. It differs
            from embedding in one key way. The full vocabulary probability
            distribution may be returned at each position. This creates
            large output data volume even though no tokens are generated.
            Benchmarks SHOULD specify whether full vocabulary logprobs or
            only top-k logprobs are returned. This choice significantly
            affects data transfer overhead. Input length SHOULD span the
            full range to show prefill scaling.
          </t>
        </section>
      </section>

      <section anchor="group-b" numbered="true" toc="default">
        <name>Group B: Minimal Output Profiles</name>
        <t>
          Minimal output profiles generate a small number of tokens. The
          output is a label, a score, a short answer, or a small structured
          object. The decode phase is short relative to prefill. These
          workloads are typically prefill-dominant and work well with
          batching.
        </t>
        <t>
          For Group B profiles, the metric priorities differ from streaming
          workloads.
        </t>
        <t>
          Total Request Latency is the primary latency metric. TTFT and
          ITL are not primary here.
        </t>
        <t>
          Request Throughput in requests per second is more useful than
          output token throughput.
        </t>
        <t>
          TTFT is still measurable. It is approximately equal to total
          latency because the decode phase is very short.
        </t>

        <section anchor="clas" numbered="true" toc="default">
          <name>CLAS: Classification and Labeling</name>
          <t>
            The model reads input text. It produces a
            categorical label or set of labels from a predefined taxonomy.
            Examples are sentiment analysis, content moderation, topic
            categorization, intent detection, and spam filtering.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Minimal (MI)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>32 to 2048 tokens</td></tr>
              <tr><td>Output Length</td><td>1 to 10 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Requests classified per second</li>
            <li>P50 and P99 total request latency</li>
            <li>Throughput scaling with batch size</li>
          </ul>
          <t>
            Classification workloads often use constrained decoding or guided
            generation to restrict output to valid labels. Benchmarks SHOULD
            report whether guided generation is enabled. They SHOULD report
            which method is used, such as grammar-based or logit masking.
            System prompts typically include the label taxonomy. They SHOULD
            be included in the benchmark setup. The label set size SHOULD
            be specified. Binary, multi-class, and multi-label variants have
            different guided generation overhead.
          </t>
        </section>

        <section anchor="sfqa" numbered="true" toc="default">
          <name>SFQA: Short-Form QA / Factoid</name>
          <t>
            The model receives a short question. It may
            also receive brief context. It produces a concise factual answer.
            Examples are customer support FAQ, knowledge base queries, and
            simple information retrieval. It differs from long-document QA
            (LDQA) in that the input context is short.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA) to Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN) to Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>16 to 512 tokens</td></tr>
              <tr><td>Output Length</td><td>5 to 100 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT if streamed</li>
            <li>Total request latency</li>
            <li>Requests per second under concurrency</li>
          </ul>
          <t>
            This profile represents the simple API call pattern. It is the
            baseline for many LLM applications. It SHOULD be used as a
            reference workload to establish baseline system performance
            before testing more complex profiles. Input length SHOULD
            include very short queries (fewer than 32 tokens) and context-
            augmented queries (256 to 512 tokens).
          </t>
        </section>

        <section anchor="rank" numbered="true" toc="default">
          <name>RANK: Generative Ranking / Scoring</name>
          <t>
            The model evaluates one or more candidates. It
            produces a relevance score, a preference ordering, or a
            comparative judgment as generated tokens. It differs from cross-
            encoder reranking (XRANK) in that it generates token output such
            as scores, explanations, or ordinal labels. XRANK produces a raw
            scalar from the model's hidden state.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST) to Minimal (MI)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Burst-Batch (BB)</td></tr>
              <tr><td>Prefix Sharing</td><td>Document-Shared (D)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>256 to 4096 tokens (prompt plus candidates)</td></tr>
              <tr><td>Output Length</td><td>5 to 50 tokens (scores or ranking)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Time to score the full candidate set</li>
            <li>Requests per second</li>
            <li>Prefix cache hit rate across candidates sharing context</li>
          </ul>
          <t>
            Ranking workloads have strong prefix sharing. When ranking N
            candidates against a shared query or document, the shared prefix
            may be 80 to 90 percent of total input tokens. Benchmarks MUST
            report prefix cache hit rates. They SHOULD compare performance
            with and without prefix caching. Candidate set sizes of 5, 10,
            20, and 50 SHOULD be tested.
          </t>
        </section>

        <section anchor="ner" numbered="true" toc="default">
          <name>NER: Entity Resolution and Recognition</name>
          <t>
            The model reads input text. It identifies named
            entities and produces a structured list of entity mentions with
            their types and positions. This is used for information
            extraction from documents, log parsing, and data enrichment
            pipelines.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>64 to 2048 tokens</td></tr>
              <tr><td>Output Length</td><td>20 to 200 tokens (structured entity list)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Documents processed per second</li>
            <li>Total request latency by input length</li>
            <li>Throughput under batch processing</li>
          </ul>
          <t>
            NER output length varies with input content density. Benchmarks
            SHOULD use representative documents with known entity density.
            This ensures output length distributions are realistic. Guided
            generation to a JSON schema is common. Whether it is enabled
            SHOULD be reported.
          </t>
        </section>

        <section anchor="func" numbered="true" toc="default">
          <name>FUNC: Function Calling / Tool Selection</name>
          <t>
            The model receives a user request and a set of
            tool or function definitions. It selects the right function and
            produces well-formed arguments. The tool schema definitions form
            a large and highly cacheable prefix.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>Schema-Shared (H)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>512 to 4096 tokens (schema plus user query)</td></tr>
              <tr><td>Output Length</td><td>20 to 200 tokens (function name plus JSON arguments)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total request latency</li>
            <li>TTFT as an indicator of schema processing time</li>
            <li>Prefix cache hit rate for shared tool schemas</li>
          </ul>
          <t>
            Tool schemas vary widely in size. A small schema with 5 tools
            may be a few hundred tokens. A large schema with 50 tools and
            complex parameters may be several thousand tokens. Benchmarks
            SHOULD test with schema sizes of approximately 500, 2000, and
            4000 tokens. Tool schemas are highly shareable as prefixes.
            Prefix caching effectiveness is a critical measurement.
            Constrained decoding to valid JSON is standard and SHOULD be
            enabled.
          </t>
        </section>

        <section anchor="sqln" numbered="true" toc="default">
          <name>SQLN: SQL / Query Generation</name>
          <t>
            The model receives a natural language question
            and a database schema definition. It produces a syntactically
            valid SQL query. The schema prefix is large and shared across
            queries against the same database.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI) to Burst-Batch (BB)</td></tr>
              <tr><td>Prefix Sharing</td><td>Schema-Shared (H)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>256 to 8192 tokens (schema plus question)</td></tr>
              <tr><td>Output Length</td><td>20 to 300 tokens (SQL query)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total request latency</li>
            <li>Prefix cache effectiveness for shared schema</li>
            <li>Requests per second under concurrent users on the same schema</li>
          </ul>
          <t>
            Database schemas range from simple to complex. A simple schema
            may have 5 tables and around 500 tokens. A complex schema may
            have 100 or more tables and 8000 or more tokens. Benchmarks
            SHOULD test with at least two schema sizes. Multiple queries
            against the same schema are the dominant production pattern.
            Prefix caching measurements are critical for this profile.
          </t>
        </section>
      </section>

      <section anchor="group-c" numbered="true" toc="default">
        <name>Group C: Interactive Streaming Profiles</name>
        <t>
          Interactive streaming profiles serve users who watch tokens
          arrive in real time. These profiles share four key properties.
        </t>
        <t>
          TTFT is critical for perceived responsiveness.
          ITL consistency determines perceived fluency.
          Multi-turn context accumulation is common.
          Prefix caching across conversation turns adds value.
        </t>
        <t>
          For Group C profiles, all streaming metrics from <xref target="LLM-TERMS"/>
          apply. These are TTFT, ITL, TPOT, and output token throughput.
          The Throughput-Latency Tradeoff test from <xref target="LLM-METHOD"/> is
          especially important for this group.
        </t>

        <section anchor="chat" numbered="true" toc="default">
          <name>CHAT: Conversational Chat</name>
          <t>
            A user and the model have a multi-turn
            conversation. Each turn adds to the conversation history. This
            creates a growing input context. This is the most common
            interactive LLM workload. It is the default profile for general-
            purpose chat applications.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>Turn-Accumulated (T)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>64 to 16384 tokens (grows with turns)</td></tr>
              <tr><td>Output Length</td><td>50 to 1000 tokens</td></tr>
              <tr><td>Turns per conversation</td><td>3 to 20</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT at various context lengths</li>
            <li>P50 and P99 ITL</li>
            <li>TTFT degradation as conversation grows</li>
            <li>Prefix cache hit rate across turns</li>
          </ul>
          <t>
            Chat benchmarks MUST test across multiple conversation depths.
            This shows how TTFT scales as context grows. Tests at turns 1,
            5, 10, and 20 SHOULD be included at minimum. Input data SHOULD
            come from real multi-turn conversation datasets such as ShareGPT
            <xref target="SHAREGPT"/> or LMSYS-Chat <xref target="LMSYS"/>. This captures realistic length
            distributions. The time between turns, while the user reads and
            types, affects cache eviction pressure. This time SHOULD be set
            as a parameter in the benchmark.
          </t>
        </section>

        <section anchor="comp" numbered="true" toc="default">
          <name>COMP: Code Completion / Autocomplete</name>
          <t>
            The model receives a code context. This
            includes the file content, cursor position, and surrounding code.
            It generates a short completion. This profile has very tight
            latency requirements, very short output, and high request
            frequency. Prefix sharing is strong because the same file
            context appears in many consecutive requests.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD) to Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>Document-Shared (D)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>256 to 8192 tokens (file context)</td></tr>
              <tr><td>Output Length</td><td>5 to 200 tokens (completion)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT (primary). Interactive deployments often target sub-200ms TTFT. Benchmarks SHOULD report TTFT at P50, P95, and P99.</li>
            <li>Total completion latency</li>
            <li>Prefix cache hit rate (consecutive edits share most context)</li>
          </ul>
          <t>
            Code completion has the tightest TTFT requirements of any
            profile. Benchmarks SHOULD simulate realistic editing patterns.
            In these patterns, consecutive requests share 95 percent or more
            of input tokens with small modifications. Fill-in-the-middle
            (FIM) variants SHOULD be tested if the serving system supports
            it. In FIM, the model generates tokens to fill a gap within
            existing code rather than appending to the end. Both single-line
            and multi-line completions SHOULD be measured.
          </t>
        </section>

        <section anchor="pers" numbered="true" toc="default">
          <name>PERS: Dialogue / Roleplay / Persona</name>
          <t>
            The model maintains a persistent character or
            persona defined by a large system prompt. This differs from
            generic chat. System prompts may be 2000 to 10000 tokens or
            more. They include character descriptions, world-building
            context, and behavioral guidelines. System prompt prefix caching
            is important when many users share the same persona.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S) plus Turn-Accumulated (T)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>System Prompt</td><td>2000 to 10000 tokens</td></tr>
              <tr><td>User Turn Input</td><td>16 to 256 tokens</td></tr>
              <tr><td>Output Length</td><td>100 to 2000 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT with large system prompts</li>
            <li>System prompt prefix cache hit rate across concurrent users</li>
            <li>ITL consistency during long responses</li>
          </ul>
          <t>
            The key benchmark for this profile is performance under large
            shared system prompts with many concurrent users. When 1000
            users share the same persona, the system prompt KV cache should
            be computed once. Benchmarks SHOULD measure TTFT and throughput
            as a function of the number of concurrent users sharing the same
            system prompt. Results with and without prefix caching SHOULD
            both be reported.
          </t>
        </section>
      </section>

      <section anchor="group-d" numbered="true" toc="default">
        <name>Group D: Prefill-Heavy Generative Profiles</name>
        <t>
          Prefill-heavy profiles process long input contexts. They generate
          moderate-length output. The prefill phase dominates compute and
          memory requirements. These workloads stress KV cache capacity,
          long-context attention efficiency, and memory management.
        </t>
        <t>
          The most important metrics for Group D profiles are TTFT (heavily
          influenced by prefill duration), prefill throughput in input tokens
          per second, memory utilization at peak context length, and total
          request latency.
        </t>

        <section anchor="summ" numbered="true" toc="default">
          <name>SUMM: Summarization</name>
          <t>
            The model reads a document or set of passages.
            It produces a condensed summary. This is one of the most common
            production LLM tasks. The input is often 10 to 100 times longer
            than the output. The prefill phase dominates total latency.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF) to Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI) to Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>None (N)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>1024 to 32768 tokens</td></tr>
              <tr><td>Output Length</td><td>64 to 1024 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT by input length bucket</li>
            <li>Total request latency</li>
            <li>Prefill throughput in input tokens per second</li>
            <li>Documents summarized per second</li>
          </ul>
          <t>
            Summarization performance depends strongly on input length.
            Benchmarks MUST report results in input length buckets. Example
            buckets are 1K to 4K, 4K to 8K, 8K to 16K, and 16K to 32K
            tokens. Input data SHOULD come from document corpora with
            realistic length distributions rather than synthetic text. Both
            extractive summarization (shorter output) and abstractive
            summarization (longer output) SHOULD be tested.
          </t>
        </section>

        <section anchor="ldqa" numbered="true" toc="default">
          <name>LDQA: Long Document Question Answering</name>
          <t>
            The model receives a long document and a
            question. It produces a concise answer from the document content.
            It differs from summarization in that the output is shorter and
            more targeted. It differs from short-form QA in that it has a
            large context document.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI) to Burst-Batch (BB)</td></tr>
              <tr><td>Prefix Sharing</td><td>Document-Shared (D)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>4096 to 131072 tokens</td></tr>
              <tr><td>Output Length</td><td>16 to 512 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT at various context lengths</li>
            <li>TTFT scaling as context length doubles</li>
            <li>Total request latency</li>
            <li>Prefix cache effectiveness across multiple questions</li>
          </ul>
          <t>
            Long document QA tests the interaction between context length
            and latency. Benchmarks SHOULD test with multiple questions over
            the same document. This measures prefix caching effectiveness.
            Context lengths SHOULD span at least three orders of magnitude.
            Example values are 4K, 16K, 64K, and 128K tokens. The position
            of the answer within the document can affect performance on some
            systems. This MAY be reported.
          </t>
        </section>

        <section anchor="mdoc" numbered="true" toc="default">
          <name>MDOC: Multi-Document Synthesis</name>
          <t>
            The model receives multiple documents. It
            produces one output that synthesizes information across all of
            them. Examples are literature review, competitive analysis, and
            multi-source report generation. It differs from single-document
            summarization in that it requires reasoning across documents. It
            differs from RAG in that there is no retrieval step. All
            documents are provided in full.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>None (N)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>8192 to 131072 tokens (multiple documents)</td></tr>
              <tr><td>Output Length</td><td>256 to 4096 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT at extreme context lengths</li>
            <li>Total request latency</li>
            <li>Memory utilization at peak context</li>
            <li>Output throughput during the decode phase</li>
          </ul>
          <t>
            Multi-document synthesis pushes context length to its maximum.
            Benchmarks SHOULD report the number of documents and the per-
            document length separately. This shows whether performance
            depends on total token count alone or also on document
            structure. This profile is a stress test for memory management
            and long-context attention implementations.
          </t>
        </section>

        <section anchor="extr" numbered="true" toc="default">
          <name>EXTR: Structured Extraction</name>
          <t>
            The model reads an input document. It extracts
            specific information into a defined output schema. Examples are
            invoice parsing, resume extraction, form processing, and log
            analysis. The output MUST conform to a specified JSON schema,
            XML structure, or tabular format.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>256 to 16384 tokens</td></tr>
              <tr><td>Output Length</td><td>50 to 500 tokens (structured JSON or XML)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Documents processed per second</li>
            <li>Total request latency</li>
            <li>Guided generation overhead vs. an unconstrained baseline</li>
          </ul>
          <t>
            Extraction workloads typically use constrained or guided decoding
            to ensure the output conforms to the schema. Benchmarks MUST
            report whether guided generation is enabled. They MUST report
            the schema complexity including the number of fields and nesting
            depth. Benchmarks SHOULD compare throughput with and without
            guided generation. This shows the overhead. Output validation
            rates, meaning the percentage of outputs that conform to the
            schema, SHOULD be reported alongside performance metrics.
          </t>
        </section>

        <section anchor="judg" numbered="true" toc="default">
          <name>JUDG: LLM-as-Judge / Evaluation</name>
          <t>
            The model evaluates another model's output
            against specified criteria. It produces a score and optionally
            a justification. This is used in production for quality
            assurance, A/B testing, content moderation review, and RLHF
            reward modeling. The input includes the original prompt, the
            response to evaluate, and the scoring criteria.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST) to Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>512 to 8192 tokens (criteria plus prompt plus response)</td></tr>
              <tr><td>Output Length</td><td>10 to 300 tokens (score plus reasoning)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Evaluations per second</li>
            <li>Total request latency</li>
            <li>Throughput under sustained batch load</li>
          </ul>
          <t>
            LLM-as-Judge is a high-volume pipeline workload. Benchmarks
            SHOULD simulate evaluation of a corpus of responses. They SHOULD
            measure sustained throughput over thousands of evaluations. The
            scoring criteria form a cacheable system prompt prefix. Both
            score-only mode (minimal output) and score-with-reasoning mode
            (longer output) SHOULD be tested.
          </t>
        </section>

        <section anchor="ragn" numbered="true" toc="default">
          <name>RAGN: Retrieval-Augmented Generation</name>
          <t>
            The model receives retrieved context passages
            and a user query. It generates a grounded response. It differs
            from other prefill-heavy profiles in that the context is made
            of multiple retrieved chunks assembled together. It differs from
            LDQA in that the input context is shorter and is made of
            separately retrieved passages rather than a single contiguous
            document.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Prefill-Dominant (PD) to Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN) to Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Retrieved Context</td><td>512 to 8192 tokens (3 to 20 chunks)</td></tr>
              <tr><td>User Query</td><td>16 to 256 tokens</td></tr>
              <tr><td>Total Input</td><td>528 to 8448 tokens</td></tr>
              <tr><td>Output Length</td><td>64 to 1024 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>TTFT (the user is typically waiting interactively)</li>
            <li>Total request latency including time to first useful content</li>
            <li>End-to-end latency including retrieval if measured at the compound system boundary</li>
          </ul>
          <t>
            RAG benchmarks SHOULD vary the number and size of retrieved
            chunks. This shows how performance scales as context grows.
            Configurations of 3, 5, 10, and 20 retrieved chunks SHOULD be
            tested. The SUT boundary from <xref target="LLM-METHOD"/> is
            especially important for RAG. Measuring at the model engine
            boundary excludes retrieval latency. Measuring at the compound
            system boundary includes it. Both SHOULD be reported when the
            retrieval system is part of the SUT.
          </t>
        </section>
      </section>

      <section anchor="group-e" numbered="true" toc="default">
        <name>Group E: Decode-Heavy Generative Profiles</name>
        <t>
          Decode-heavy profiles generate much more output than input. The
          decode phase dominates total latency and compute. These workloads
          stress memory bandwidth, ITL consistency, and sustained output
          token throughput.
        </t>
        <t>
          For Group E profiles, output token throughput and ITL
          distribution are the primary metrics. TTFT is less important
          relative to total generation time.
        </t>

        <section anchor="cgen" numbered="true" toc="default">
          <name>CGEN: Content Generation</name>
          <t>
            The model receives a short prompt or
            specification. It generates long-form content. Examples are blog
            posts, articles, marketing copy, emails, reports, and creative
            writing. Output is much longer than input and is free-form.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Decode-Dominant (DD)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Interactive (IN) to Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>32 to 1024 tokens</td></tr>
              <tr><td>Output Length</td><td>256 to 8192 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Output token throughput in tokens per second</li>
            <li>P50 and P99 ITL</li>
            <li>ITL consistency over long generation (stutter detection)</li>
            <li>Total generation time</li>
          </ul>
          <t>
            Long-form generation is the main test of sustained decode
            performance. Benchmarks SHOULD measure ITL distribution across
            the full generation. Measuring only the initial tokens is not
            sufficient. Decode performance can degrade as the KV cache
            grows. Plotting ITL against output position is useful for
            detecting decode slowdown. Output lengths of 256, 1024, 2048,
            and 4096 or more tokens SHOULD be tested.
          </t>
        </section>

        <section anchor="reas" numbered="true" toc="default">
          <name>REAS: Reasoning / Chain-of-Thought</name>
          <t>
            The model performs extended step-by-step
            reasoning before producing a final answer. This profile has
            grown in importance with reasoning-optimized models that produce
            explicit or hidden thinking tokens. The input is short. The
            output is very long, often 10 to 100 times the length of the
            final answer. Most of the output is intermediate reasoning steps.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Decode-Dominant (DD)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF) to Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>64 to 2048 tokens</td></tr>
              <tr><td>Reasoning Output</td><td>512 to 32768 tokens (thinking or CoT)</td></tr>
              <tr><td>Final Answer</td><td>32 to 512 tokens</td></tr>
              <tr><td>Total Output</td><td>544 to 33280 tokens</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total generation time (may be 30 to 120 seconds or more)</li>
            <li>Output token throughput sustained over long generation</li>
            <li>ITL consistency over extended decode</li>
            <li>Time to final answer if thinking tokens are hidden</li>
          </ul>
          <t>
            Reasoning workloads produce the longest output sequences of any
            profile. This creates unique stress on KV cache management
            during decode. Hidden thinking tokens are a serving-system-
            specific feature. Benchmarks MUST distinguish between visible
            tokens and hidden thinking tokens if the serving system supports
            this. Serving systems that hide thinking tokens MUST document
            whether max_output_tokens applies to visible tokens only, to
            total tokens, or to both. This MUST be reported in the run
            parameters defined in <xref target="common-params"/>. Time to first visible token,
            which comes after reasoning completes, is a meaningful metric.
            It differs from standard TTFT and SHOULD be reported when
            thinking tokens are hidden. Benchmarks SHOULD test with
            reasoning budgets of 1K, 4K, 16K, and 32K tokens. Memory
            pressure during extended reasoning is a critical measurement.
          </t>
        </section>

        <section anchor="dgen" numbered="true" toc="default">
          <name>DGEN: Data Generation / Synthetic Data</name>
          <t>
            The model receives a schema or specification.
            It generates multiple structured records. This is used for test
            data generation, training data augmentation, simulation, and
            content pipelines. The output is repetitive, structured, and
            produced at high volume.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Decode-Dominant (DD)</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>128 to 2048 tokens (schema plus instructions)</td></tr>
              <tr><td>Output Length</td><td>512 to 16384 tokens (multiple records)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Output tokens per second (sustained)</li>
            <li>Records generated per second</li>
            <li>Throughput under high concurrency</li>
          </ul>
          <t>
            Data generation is a throughput-oriented profile. TTFT and ITL
            are not primary metrics here. Benchmarks SHOULD measure
            sustained output throughput over large generation batches of
            10K or more records. Request latency is still measurable and
            MAY be reported as a secondary metric. Guided generation to a
            JSON or CSV schema is standard and SHOULD be enabled. Schema
            validation rate SHOULD be reported.
          </t>
        </section>

        <section anchor="trns" numbered="true" toc="default">
          <name>TRNS: Translation</name>
          <t>
            The model translates text from one natural
            language to another. The input-to-output ratio is approximately
            1:1, varying by language pair. The full input content is
            consumed in producing the output.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Throughput-Oriented (TO)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI) to Sustained-High (SH)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>32 to 8192 tokens</td></tr>
              <tr><td>Output Length</td><td>32 to 10000 tokens</td></tr>
              <tr><td>I/O Ratio</td><td>0.5:1 to 2:1 depending on language pair</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total request latency by segment length</li>
            <li>Output tokens per second</li>
            <li>Throughput in segments translated per second</li>
          </ul>
          <t>
            Token-to-token ratio varies by language pair because tokenizer
            coverage differs across languages. Benchmarks MUST specify the
            language pair and tokenizer used. CJK languages typically
            produce more tokens per semantic unit than European languages.
            Benchmarks SHOULD test with at least two language pairs. This
            shows tokenizer effects. Segment lengths SHOULD span sentence-
            level (32 to 64 tokens), paragraph-level (128 to 512 tokens),
            and document-level (2000 or more tokens).
          </t>
        </section>

        <section anchor="edit" numbered="true" toc="default">
          <name>EDIT: Rewriting / Editing</name>
          <t>
            The model receives input text. It produces a
            transformed version of the same text. Examples are grammar
            correction, tone adjustment, style transfer, simplification,
            and paraphrasing. It differs from translation in that it works
            within the same language. It differs from summarization in that
            the output length is similar to the input length.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA)</td></tr>
              <tr><td>Output Constraint</td><td>Free-Form (FF)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Single-Stream (SI)</td></tr>
              <tr><td>Prefix Sharing</td><td>System-Prompt (S)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>64 to 4096 tokens</td></tr>
              <tr><td>Output Length</td><td>64 to 4096 tokens (similar to input)</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total request latency</li>
            <li>Output token throughput</li>
            <li>TTFT if streamed to the user for review</li>
          </ul>
          <t>
            Rewriting produces output structurally similar to input. Some
            serving systems may use this similarity to enable speculative
            decoding or draft-based approaches. Benchmarks SHOULD report
            whether such optimizations are enabled.
          </t>
        </section>
      </section>

      <section anchor="group-f" numbered="true" toc="default">
        <name>Group F: Multi-Step Chained Profiles</name>
        <t>
          Multi-step profiles involve sequences of dependent inference
          calls. Each call depends on the output of the previous call.
          Total user-perceived latency is the sum of all step latencies.
          These workloads test end-to-end pipeline performance rather than
          single-request optimization.
        </t>
        <t>
          For Group F profiles, the primary metric is end-to-end pipeline
          latency. Individual step TTFT and total latency contribute to
          this but are not sufficient on their own.
        </t>

        <section anchor="agnt" numbered="true" toc="default">
          <name>AGNT: Agentic / Tool Use</name>
          <t>
            The model operates in a loop. It receives an
            observation, decides on an action, receives the tool result, and
            continues until the task is complete. Each iteration is a
            separate inference call. The context grows across iterations as
            prior observations and actions accumulate.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA) per step</td></tr>
              <tr><td>Output Constraint</td><td>Structured (ST) per tool call; Free-Form (FF) for final answer</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR) to Interactive (IN)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Chained-Sequential (CS)</td></tr>
              <tr><td>Prefix Sharing</td><td>Turn-Accumulated (T)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution (per step):</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>512 to 16384 tokens (grows with steps)</td></tr>
              <tr><td>Output Length</td><td>20 to 500 tokens per step</td></tr>
              <tr><td>Steps per task</td><td>3 to 20</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>End-to-end task completion time</li>
            <li>Per-step TTFT (compounds across steps)</li>
            <li>Per-step total latency</li>
            <li>TTFT scaling as context accumulates across steps</li>
            <li>Prefix cache hit rate across steps</li>
          </ul>
          <t>
            Agentic workloads are especially sensitive to per-step TTFT.
            Latencies compound across steps. A 500ms TTFT per step over 10
            steps adds 5 seconds of TTFT alone to total task completion
            time. Benchmarks MUST simulate realistic multi-step
            trajectories. They MUST include tool call outputs injected
            between steps. Context grows across steps, so prefix caching
            across turns is critical. Benchmarks SHOULD test with task
            lengths of 3, 5, 10, and 20 steps. Both successful paths and
            long paths SHOULD be tested.
          </t>
        </section>

        <section anchor="plan" numbered="true" toc="default">
          <name>PLAN: Planning / Task Decomposition</name>
          <t>
            The model receives a high-level goal. It
            produces a structured plan with ordered steps, dependencies,
            and resource requirements. This is often the first step of an
            agentic pipeline. It may involve iterative refinement where the
            model reviews and revises its own plan.
          </t>
          <t>Classification Vector:</t>
          <table align="left">
            <thead><tr><th>Dimension</th><th>Value</th></tr></thead>
            <tbody>
              <tr><td>I/O Ratio</td><td>Balanced (BA) to Decode-Dominant (DD)</td></tr>
              <tr><td>Output Constraint</td><td>Semi-Structured (SS)</td></tr>
              <tr><td>Latency Sensitivity</td><td>Near-Real-Time (NR)</td></tr>
              <tr><td>Concurrency Pattern</td><td>Chained-Sequential (CS)</td></tr>
              <tr><td>Prefix Sharing</td><td>Turn-Accumulated (T)</td></tr>
            </tbody>
          </table>
          <t>Typical Token Distribution:</t>
          <table align="left">
            <thead><tr><th>Parameter</th><th>Range</th></tr></thead>
            <tbody>
              <tr><td>Input Length</td><td>128 to 4096 tokens (goal plus context)</td></tr>
              <tr><td>Output Length</td><td>256 to 2048 tokens (structured plan)</td></tr>
              <tr><td>Refinement Rounds</td><td>1 to 3</td></tr>
            </tbody>
          </table>
          <t>Key Performance Indicators:</t>
          <ul>
            <li>Total planning time including refinement rounds</li>
            <li>Per-round latency</li>
            <li>Output token throughput</li>
          </ul>
          <t>
            Planning with iterative refinement requires the model to evaluate
            and modify its own prior output. Benchmarks SHOULD test both
            single-pass planning and multi-round refinement. Plan quality is
            outside the scope of performance benchmarking. The time to
            produce a plan of a given length and structure is measurable
            and SHOULD be reported.
          </t>
        </section>
      </section>
    </section>

    <section anchor="applicability" numbered="true" toc="default">
      <name>Profile-to-Metric Applicability Matrix</name>
      <t>
        The matrix below maps each profile to the applicable metrics
        from <xref target="LLM-TERMS"/>. "Primary" means the most important metric
        for that profile. "Applicable" means it is meaningful but
        secondary. "-" means it does not apply.
      </t>
      <t>
        TTFT and ITL are streaming-only metrics. They apply only when
        streaming is enabled per <xref target="streaming"/>. When streaming is
        disabled, only total request latency is measurable.
      </t>
      <t>
        For Group F profiles, "Request Latency" means end-to-end
        pipeline latency across all steps. "TTFT" means per-step TTFT,
        which compounds across the pipeline.
      </t>
      <t>
        Request latency is still measurable in batch mode but is
        secondary to throughput metrics.
      </t>
      <t>
        Sub-table 1 lists applicability for TTFT, ITL, TPOT, and output
        token throughput.
      </t>
      <table align="left">
        <thead>
          <tr>
            <th>Profile</th><th>TTFT</th><th>ITL</th><th>TPOT</th>
            <th>Output Tok/s</th>
          </tr>
        </thead>
        <tbody>
          <tr><td colspan="5"><strong>Group A: Non-Generative</strong></td></tr>
          <tr><td>EMBED</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>XRANK</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>LOGPR</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td colspan="5"><strong>Group B: Minimal Output</strong></td></tr>
          <tr><td>CLAS</td><td>Applicable</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>SFQA</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>RANK</td><td>Applicable</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>NER</td><td>Applicable</td><td>-</td><td>Applicable</td><td>-</td></tr>
          <tr><td>FUNC</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>SQLN</td><td>Applicable</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td colspan="5"><strong>Group C: Interactive</strong></td></tr>
          <tr><td>CHAT</td><td>Primary</td><td>Primary</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td>COMP</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>PERS</td><td>Primary</td><td>Primary</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group D: Prefill-Heavy</strong></td></tr>
          <tr><td>SUMM</td><td>Primary</td><td>Applicable</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td>LDQA</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>MDOC</td><td>Primary</td><td>Applicable</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td>EXTR</td><td>Applicable</td><td>-</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td>JUDG</td><td>Applicable</td><td>-</td><td>Applicable</td><td>-</td></tr>
          <tr><td>RAGN</td><td>Primary</td><td>Primary</td><td>Applicable</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group E: Decode-Heavy</strong></td></tr>
          <tr><td>CGEN</td><td>Applicable</td><td>Primary</td><td>Primary</td><td>Primary</td></tr>
          <tr><td>REAS</td><td>Applicable</td><td>Primary</td><td>Primary</td><td>Primary</td></tr>
          <tr><td>DGEN</td><td>-</td><td>-</td><td>Applicable</td><td>Primary</td></tr>
          <tr><td>TRNS</td><td>Applicable</td><td>Applicable</td><td>Primary</td><td>Primary</td></tr>
          <tr><td>EDIT</td><td>Applicable</td><td>Applicable</td><td>Primary</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group F: Multi-Step</strong></td></tr>
          <tr><td>AGNT</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>PLAN</td><td>Applicable</td><td>Applicable</td><td>Applicable</td><td>Applicable</td></tr>
        </tbody>
      </table>
      <t>
        Sub-table 2 lists applicability for request latency, request
        throughput, sequence throughput, and prefill throughput.
      </t>
      <table align="left">
        <thead>
          <tr>
            <th>Profile</th><th>Req Latency</th><th>Req/s</th>
            <th>Seq/s</th><th>Prefill Tok/s</th>
          </tr>
        </thead>
        <tbody>
          <tr><td colspan="5"><strong>Group A: Non-Generative</strong></td></tr>
          <tr><td>EMBED</td><td>Primary</td><td>Primary</td><td>Primary</td><td>Applicable</td></tr>
          <tr><td>XRANK</td><td>Primary</td><td>Primary</td><td>Primary</td><td>Applicable</td></tr>
          <tr><td>LOGPR</td><td>Primary</td><td>Applicable</td><td>Primary</td><td>Primary</td></tr>
          <tr><td colspan="5"><strong>Group B: Minimal Output</strong></td></tr>
          <tr><td>CLAS</td><td>Primary</td><td>Primary</td><td>-</td><td>Applicable</td></tr>
          <tr><td>SFQA</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>RANK</td><td>Primary</td><td>Primary</td><td>-</td><td>Applicable</td></tr>
          <tr><td>NER</td><td>Primary</td><td>Primary</td><td>-</td><td>Applicable</td></tr>
          <tr><td>FUNC</td><td>Primary</td><td>Applicable</td><td>-</td><td>Applicable</td></tr>
          <tr><td>SQLN</td><td>Primary</td><td>Applicable</td><td>-</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group C: Interactive</strong></td></tr>
          <tr><td>CHAT</td><td>Applicable</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>COMP</td><td>Primary</td><td>-</td><td>-</td><td>Applicable</td></tr>
          <tr><td>PERS</td><td>Applicable</td><td>-</td><td>-</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group D: Prefill-Heavy</strong></td></tr>
          <tr><td>SUMM</td><td>Primary</td><td>Applicable</td><td>-</td><td>Primary</td></tr>
          <tr><td>LDQA</td><td>Primary</td><td>Applicable</td><td>-</td><td>Primary</td></tr>
          <tr><td>MDOC</td><td>Primary</td><td>-</td><td>-</td><td>Primary</td></tr>
          <tr><td>EXTR</td><td>Primary</td><td>Primary</td><td>-</td><td>Primary</td></tr>
          <tr><td>JUDG</td><td>Primary</td><td>Primary</td><td>-</td><td>Applicable</td></tr>
          <tr><td>RAGN</td><td>Primary</td><td>-</td><td>-</td><td>Applicable</td></tr>
          <tr><td colspan="5"><strong>Group E: Decode-Heavy</strong></td></tr>
          <tr><td>CGEN</td><td>Applicable</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>REAS</td><td>Primary</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td>DGEN</td><td>Applicable</td><td>Primary</td><td>-</td><td>-</td></tr>
          <tr><td>TRNS</td><td>Primary</td><td>Applicable</td><td>-</td><td>-</td></tr>
          <tr><td>EDIT</td><td>Primary</td><td>-</td><td>-</td><td>-</td></tr>
          <tr><td colspan="5"><strong>Group F: Multi-Step</strong></td></tr>
          <tr><td>AGNT</td><td>Primary</td><td>-</td><td>-</td><td>Applicable</td></tr>
          <tr><td>PLAN</td><td>Primary</td><td>-</td><td>-</td><td>-</td></tr>
        </tbody>
      </table>
    </section>

    <section anchor="mixed" numbered="true" toc="default">
      <name>Profile Composition for Mixed Workloads</name>
      <t>
        Production LLM deployments rarely serve a single profile. A
        typical deployment might serve 60 percent CHAT, 20 percent RAGN,
        10 percent SUMM, and 10 percent FUNC traffic at the same time.
        Benchmarking with mixed profiles is needed to understand real-
        world performance.
      </t>

      <section anchor="weighted-combinations" numbered="true" toc="default">
        <name>Weighted Profile Combinations</name>
        <t>
          A mixed workload specification MUST include the following: the
          set of profiles included; the traffic weight for each profile as
          a percentage of total requests or total tokens; and whether
          weights are in request count or token volume.
        </t>
        <t>
          Example mixed workload specification:
        </t>
        <table align="left">
          <thead>
            <tr><th>Profile</th><th>Request Weight</th><th>Token Weight (approx.)</th></tr>
          </thead>
          <tbody>
            <tr><td>CHAT</td><td>60%</td><td>45%</td></tr>
            <tr><td>RAGN</td><td>20%</td><td>30%</td></tr>
            <tr><td>SUMM</td><td>10%</td><td>20%</td></tr>
            <tr><td>FUNC</td><td>10%</td><td>5%</td></tr>
          </tbody>
        </table>
        <t>
          Request weight and token weight differ because profiles have
          different average token counts per request. Both SHOULD be
          reported.
        </t>
      </section>

      <section anchor="mixed-reporting" numbered="true" toc="default">
        <name>Reporting Requirements for Mixed Workloads</name>
        <t>
          When benchmarking mixed workloads, the following apply.
        </t>
        <t>
          Overall aggregate metrics including total throughput and
          aggregate latency distribution MUST be reported.
        </t>
        <t>
          Per-profile metrics MUST be reported separately. A mixed
          workload benchmark that reports only aggregate metrics hides
          profile-specific performance characteristics.
        </t>
        <t>
          The interaction between profiles MUST be described. For example,
          long-running REAS requests may increase queuing delay for
          latency-sensitive CHAT requests under continuous batching.
        </t>
        <t>
          Scheduling fairness metrics from <xref target="LLM-METHOD"/> SHOULD be applied
          to mixed workloads. This shows whether any profile has worse
          latency than expected.
        </t>
        <t>
          The benchmark run parameters from <xref target="common-params"/> MUST be reported
          for the full mixed workload run. When parameters differ across
          profiles within the same run, the per-profile overrides MUST
          be reported.
        </t>
      </section>
    </section>

    <section anchor="cross-cutting" numbered="true" toc="default">
      <name>Cross-Cutting Dimensions</name>
      <t>
        The following dimensions apply to any profile in <xref target="profiles"/>.
        They describe deployment or configuration choices. They are not
        task characteristics.
      </t>

      <section anchor="batch-online" numbered="true" toc="default">
        <name>Batch vs. Online</name>
        <t>
          Any profile can be run in batch mode or online mode. Batch mode
          processes a queue of requests with throughput as the priority.
          Online mode serves interactive requests with latency as the
          priority. The profile definition gives the typical latency
          sensitivity. Deployers MAY run any profile in either mode.
        </t>
        <t>
          Benchmark reports MUST specify the deployment mode.
        </t>
        <t>
          Online mode uses open-loop load generation with a specified
          arrival rate. Latency metrics are primary.
        </t>
        <t>
          Batch mode uses closed-loop operation with maximum concurrency.
          Throughput metrics are primary.
        </t>
      </section>

      <section anchor="streaming" numbered="true" toc="default">
        <name>Streaming vs. Non-Streaming</name>
        <t>
          Generative profiles in Groups B through F can be served with or
          without streaming. Streaming delivers tokens as they are
          generated. Non-streaming returns the complete response at once.
        </t>
        <t>
          With streaming, TTFT and ITL are measurable and meaningful.
          Without streaming, only total request latency is measurable.
        </t>
        <t>
          The streaming mode MUST be reported. Profiles with Interactive
          latency sensitivity (<xref target="latency-sensitivity"/>) SHOULD be tested with
          streaming enabled. Where the matrix in <xref target="applicability"/> marks TTFT or
          ITL as Primary or Applicable, those metrics apply only when
          streaming is enabled.
        </t>
      </section>

      <section anchor="fim" numbered="true" toc="default">
        <name>Fill-in-the-Middle</name>
        <t>
          Fill-in-the-middle (FIM) is a variant that applies primarily to
          the COMP profile. In FIM, the model generates tokens to fill a
          gap within existing text rather than appending to the end. FIM
          changes the prefill pattern because the model receives both a
          prefix and a suffix context. FIM is not a separate profile.
        </t>
        <t>
          When a profile is tested in FIM mode, this MUST be reported.
          FIM-capable serving systems may show different prefill
          performance because of the bidirectional context.
        </t>
      </section>
    </section>

    <section anchor="security" numbered="true" toc="default">
      <name>Security Considerations</name>

      <section anchor="attack-surfaces" numbered="true" toc="default">
        <name>Workload-Specific Attack Surfaces</name>
        <t>
          Different profiles have different security implications.
        </t>
        <t>
          Decode-heavy profiles in Group E can be used as resource
          exhaustion vectors by requesting maximum output lengths.
          Benchmark setups MUST enforce output length limits consistent
          with the production configuration.
        </t>
        <t>
          Multi-step profiles in Group F may accumulate context across
          steps. Benchmark configurations MUST specify maximum steps and
          total context limits.
        </t>
        <t>
          Structured output profiles may be vulnerable to schema injection
          attacks. These cause excessive guided generation backtracking.
          Benchmarks SHOULD use representative schemas rather than
          adversarially crafted ones.
        </t>
      </section>

      <section anchor="profile-gaming" numbered="true" toc="default">
        <name>Profile Gaming</name>
        <t>
          Systems may be tuned for specific profiles at the expense of
          others. Single-profile benchmarks can be misleading. Benchmark
          reports SHOULD include results across multiple profiles that
          represent the intended production workload mix.
        </t>
      </section>

      <section anchor="data-privacy" numbered="true" toc="default">
        <name>Data Privacy in Reference Datasets</name>
        <t>
          Reference datasets for benchmarking (Appendix A) SHOULD NOT
          contain personally identifiable information (PII). When using
          conversation datasets such as those for the CHAT profile, data
          MUST be anonymized or synthetic.
        </t>
      </section>
    </section>

    <section anchor="iana" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>

  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author initials="S." surname="Bradner" fullname="S. Bradner"/>
            <date year="1997" month="March"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
        </reference>
        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author initials="B." surname="Leiba" fullname="B. Leiba"/>
            <date year="2017" month="May"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
        </reference>
        <reference anchor="LLM-TERMS">
          <front>
            <title>Benchmarking Terminology for Large Language Model Serving</title>
            <author initials="M." surname="Gaikwad" fullname="Madhava Gaikwad"/>
            <date year="2026" month="January"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-gaikwad-llm-benchmarking-terminology-00"/>
        </reference>
        <reference anchor="LLM-METHOD">
          <front>
            <title>Benchmarking Methodology for Large Language Model Serving</title>
            <author initials="M." surname="Gaikwad" fullname="Madhava Gaikwad"/>
            <date year="2026" month="January"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-gaikwad-llm-benchmarking-methodology-00"/>
        </reference>
      </references>
      <references>
        <name>Informative References</name>
        <reference anchor="SHAREGPT" target="https://huggingface.co/datasets/RyokoAI/ShareGPT52K">
          <front>
            <title>ShareGPT52K</title>
            <author surname="RyokoAI"/>
            <date year="2023"/>
          </front>
        </reference>
        <reference anchor="LMSYS" target="https://arxiv.org/abs/2309.11998">
          <front>
            <title>LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset</title>
            <author initials="L." surname="Zheng" fullname="Lianmin Zheng"/>
            <author initials="W." surname="Chiang" fullname="Wei-Lin Chiang"/>
            <author initials="Y." surname="Sheng" fullname="Ying Sheng"/>
            <author initials="T." surname="Li" fullname="Tianle Li"/>
            <author><organization>et al.</organization></author>
            <date year="2023" month="September"/>
          </front>
          <seriesInfo name="arXiv" value="2309.11998"/>
        </reference>
      </references>
    </references>

    <section anchor="appendix-a" numbered="true" toc="default">
      <name>Reference Datasets for Each Profile</name>
      <t>
        This appendix lists recommended datasets for building workloads
        for each profile. Implementers MAY use other datasets. They MUST
        report the dataset used and its token length statistics.
      </t>
      <t>
        Implementers are responsible for checking dataset license terms
        before use. The datasets listed here are referenced for their
        technical relevance. No license terms are implied or endorsed.
        The PII requirements in <xref target="data-privacy"/> apply regardless of which
        dataset is chosen.
      </t>
      <t>
        For the PERS profile, the recommended source is synthetic persona
        datasets. Implementers MUST describe the synthetic dataset used.
        At minimum they MUST report the system prompt token length
        distribution (mean, P50, P90, P99), the vocabulary and domain
        coverage, and the generation method or tool used.
      </t>
      <table align="left">
        <thead>
          <tr><th>Profile</th><th>Recommended Dataset(s)</th><th>Notes</th></tr>
        </thead>
        <tbody>
          <tr><td>EMBED</td><td>MTEB benchmark passages; MS MARCO passages</td><td>Query and passage length variants</td></tr>
          <tr><td>XRANK</td><td>MS MARCO query-passage pairs</td><td>Candidate sets of 10 to 100</td></tr>
          <tr><td>LOGPR</td><td>OpenWebText; The Pile subsets</td><td>Representative web text</td></tr>
          <tr><td>CLAS</td><td>SST-2; AG News; GoEmotions</td><td>Binary, multi-class, multi-label variants</td></tr>
          <tr><td>SFQA</td><td>Natural Questions; TriviaQA</td><td>Short-context factoid questions</td></tr>
          <tr><td>RANK</td><td>MS MARCO; TREC-DL</td><td>Relevance judgment tasks</td></tr>
          <tr><td>NER</td><td>CoNLL-2003; OntoNotes</td><td>Standard NER benchmarks</td></tr>
          <tr><td>FUNC</td><td>Berkeley Function Calling Leaderboard</td><td>Varied tool schema complexity</td></tr>
          <tr><td>SQLN</td><td>Spider; WikiSQL</td><td>Simple and complex schema variants</td></tr>
          <tr><td>CHAT</td><td>ShareGPT; LMSYS-Chat-1M</td><td>Real multi-turn conversations; MUST be anonymized</td></tr>
          <tr><td>COMP</td><td>The Stack; HumanEval</td><td>Real code files with cursor positions</td></tr>
          <tr><td>PERS</td><td>Synthetic persona datasets</td><td>Large system prompts (2K to 10K tokens); see characterization requirements above</td></tr>
          <tr><td>SUMM</td><td>CNN/DailyMail; arXiv; GovReport</td><td>Short and long document variants</td></tr>
          <tr><td>LDQA</td><td>NarrativeQA; QuALITY; Scrolls</td><td>Long-document QA benchmarks</td></tr>
          <tr><td>MDOC</td><td>Multi-XScience; WikiSum</td><td>Multi-source synthesis</td></tr>
          <tr><td>EXTR</td><td>SQuAD extractive; CORD; SROIE</td><td>Schema-constrained extraction</td></tr>
          <tr><td>JUDG</td><td>MT-Bench judgments; Arena-Hard</td><td>LLM evaluation datasets</td></tr>
          <tr><td>RAGN</td><td>Natural Questions plus retriever output</td><td>Realistic retrieved chunk assembly</td></tr>
          <tr><td>CGEN</td><td>WritingPrompts; Alpaca instructions</td><td>Creative and instructional generation</td></tr>
          <tr><td>REAS</td><td>GSM8K; MATH; ARC-Challenge</td><td>Problems requiring extended reasoning</td></tr>
          <tr><td>DGEN</td><td>Spider schema definitions; synthetically generated JSON schemas</td><td>Schema-directed generation; implementers MUST document schema generation method</td></tr>
          <tr><td>TRNS</td><td>WMT translation benchmarks</td><td>Multiple language pairs</td></tr>
          <tr><td>EDIT</td><td>JFLEG; IteraTeR</td><td>Grammar correction and style transfer</td></tr>
          <tr><td>AGNT</td><td>SWE-bench; WebArena trajectories</td><td>Multi-step tool use traces</td></tr>
          <tr><td>PLAN</td><td>ALFWorld; HotpotQA decompositions</td><td>Planning and decomposition tasks</td></tr>
        </tbody>
      </table>
    </section>

    <section anchor="appendix-b" numbered="true" toc="default">
      <name>Profile Parameter Quick Reference Table</name>
      <t>
        Sub-table 1 lists the profile identity, group, I/O ratio, and
        output constraint.
      </t>
      <table align="left">
        <thead>
          <tr>
            <th>Profile</th><th>Code</th><th>Group</th><th>I/O Ratio</th>
            <th>Output Constraint</th>
          </tr>
        </thead>
        <tbody>
          <tr><td>Embedding Generation</td><td>EMBED</td><td>A</td><td>PO</td><td>NT</td></tr>
          <tr><td>Cross-Encoder Reranking</td><td>XRANK</td><td>A</td><td>PO</td><td>NT</td></tr>
          <tr><td>Logprob Scoring</td><td>LOGPR</td><td>A</td><td>PO</td><td>NT</td></tr>
          <tr><td>Classification</td><td>CLAS</td><td>B</td><td>PD</td><td>MI</td></tr>
          <tr><td>Short-Form QA</td><td>SFQA</td><td>B</td><td>BA-PD</td><td>FF</td></tr>
          <tr><td>Generative Ranking</td><td>RANK</td><td>B</td><td>PD</td><td>ST-MI</td></tr>
          <tr><td>Entity Recognition</td><td>NER</td><td>B</td><td>PD</td><td>ST</td></tr>
          <tr><td>Function Calling</td><td>FUNC</td><td>B</td><td>PD</td><td>ST</td></tr>
          <tr><td>SQL Generation</td><td>SQLN</td><td>B</td><td>PD</td><td>ST</td></tr>
          <tr><td>Conversational Chat</td><td>CHAT</td><td>C</td><td>BA</td><td>FF</td></tr>
          <tr><td>Code Completion</td><td>COMP</td><td>C</td><td>PD-BA</td><td>SS</td></tr>
          <tr><td>Persona Dialogue</td><td>PERS</td><td>C</td><td>BA</td><td>FF</td></tr>
          <tr><td>Summarization</td><td>SUMM</td><td>D</td><td>PD</td><td>FF-SS</td></tr>
          <tr><td>Long Document QA</td><td>LDQA</td><td>D</td><td>PD</td><td>FF</td></tr>
          <tr><td>Multi-Doc Synthesis</td><td>MDOC</td><td>D</td><td>PD</td><td>SS</td></tr>
          <tr><td>Structured Extraction</td><td>EXTR</td><td>D</td><td>PD</td><td>ST</td></tr>
          <tr><td>LLM-as-Judge</td><td>JUDG</td><td>D</td><td>PD</td><td>ST-SS</td></tr>
          <tr><td>RAG</td><td>RAGN</td><td>D</td><td>PD-BA</td><td>FF</td></tr>
          <tr><td>Content Generation</td><td>CGEN</td><td>E</td><td>DD</td><td>FF</td></tr>
          <tr><td>Reasoning / CoT</td><td>REAS</td><td>E</td><td>DD</td><td>FF-SS</td></tr>
          <tr><td>Data Generation</td><td>DGEN</td><td>E</td><td>DD</td><td>ST</td></tr>
          <tr><td>Translation</td><td>TRNS</td><td>E</td><td>BA</td><td>FF</td></tr>
          <tr><td>Rewriting / Editing</td><td>EDIT</td><td>E</td><td>BA</td><td>FF</td></tr>
          <tr><td>Agentic / Tool Use</td><td>AGNT</td><td>F</td><td>BA/step</td><td>ST+FF</td></tr>
          <tr><td>Planning</td><td>PLAN</td><td>F</td><td>BA-DD</td><td>SS</td></tr>
        </tbody>
      </table>
      <t>
        Sub-table 2 repeats the profile key and lists latency class,
        concurrency, prefix sharing, and token ranges.
      </t>
      <table align="left">
        <thead>
          <tr>
            <th>Profile</th><th>Latency Class</th><th>Concurrency</th>
            <th>Prefix Sharing</th><th>Input Tokens</th><th>Output Tokens</th>
          </tr>
        </thead>
        <tbody>
          <tr><td>EMBED</td><td>NR-TO</td><td>SH</td><td>N</td><td>16-8K</td><td>N/A (vector)</td></tr>
          <tr><td>XRANK</td><td>NR</td><td>BB</td><td>D</td><td>128-1K</td><td>N/A (score)</td></tr>
          <tr><td>LOGPR</td><td>TO</td><td>SH</td><td>N</td><td>64-4K</td><td>N/A (probs)</td></tr>
          <tr><td>CLAS</td><td>NR-TO</td><td>SH</td><td>S</td><td>32-2K</td><td>1-10</td></tr>
          <tr><td>SFQA</td><td>IN-NR</td><td>SI</td><td>S</td><td>16-512</td><td>5-100</td></tr>
          <tr><td>RANK</td><td>NR</td><td>BB</td><td>D</td><td>256-4K</td><td>5-50</td></tr>
          <tr><td>NER</td><td>TO</td><td>SH</td><td>S</td><td>64-2K</td><td>20-200</td></tr>
          <tr><td>FUNC</td><td>NR</td><td>SI</td><td>H</td><td>512-4K</td><td>20-200</td></tr>
          <tr><td>SQLN</td><td>NR</td><td>SI-BB</td><td>H</td><td>256-8K</td><td>20-300</td></tr>
          <tr><td>CHAT</td><td>IN</td><td>SI</td><td>T</td><td>64-16K</td><td>50-1K</td></tr>
          <tr><td>COMP</td><td>IN</td><td>SI</td><td>D</td><td>256-8K</td><td>5-200</td></tr>
          <tr><td>PERS</td><td>IN</td><td>SI</td><td>S+T</td><td>2K-10K sys</td><td>100-2K</td></tr>
          <tr><td>SUMM</td><td>NR-TO</td><td>SI-SH</td><td>N</td><td>1K-32K</td><td>64-1K</td></tr>
          <tr><td>LDQA</td><td>NR</td><td>SI-BB</td><td>D</td><td>4K-128K</td><td>16-512</td></tr>
          <tr><td>MDOC</td><td>TO</td><td>SI</td><td>N</td><td>8K-128K</td><td>256-4K</td></tr>
          <tr><td>EXTR</td><td>NR-TO</td><td>SH</td><td>S</td><td>256-16K</td><td>50-500</td></tr>
          <tr><td>JUDG</td><td>TO</td><td>SH</td><td>S</td><td>512-8K</td><td>10-300</td></tr>
          <tr><td>RAGN</td><td>IN-NR</td><td>SI</td><td>S</td><td>528-8K</td><td>64-1K</td></tr>
          <tr><td>CGEN</td><td>IN-NR</td><td>SI</td><td>S</td><td>32-1K</td><td>256-8K</td></tr>
          <tr><td>REAS</td><td>NR</td><td>SI</td><td>S</td><td>64-2K</td><td>544-33K</td></tr>
          <tr><td>DGEN</td><td>TO</td><td>SH</td><td>S</td><td>128-2K</td><td>512-16K</td></tr>
          <tr><td>TRNS</td><td>NR-TO</td><td>SI-SH</td><td>S</td><td>32-8K</td><td>32-10K</td></tr>
          <tr><td>EDIT</td><td>NR</td><td>SI</td><td>S</td><td>64-4K</td><td>64-4K</td></tr>
          <tr><td>AGNT</td><td>NR-IN</td><td>CS</td><td>T</td><td>512-16K</td><td>20-500/step</td></tr>
          <tr><td>PLAN</td><td>NR</td><td>CS</td><td>T</td><td>128-4K</td><td>256-2K</td></tr>
        </tbody>
      </table>
    </section>

    <section anchor="acknowledgements" numbered="false" toc="default">
      <name>Acknowledgements</name>
      <t>
        The authors thank the IETF BMWG community for their foundational
        work in benchmarking methodology.
      </t>
    </section>

  </back>
</rfc>
