Media Over QUIC                                                   Y. Liu
Internet-Draft                                              Alibaba Inc.
Intended status: Standards Track                                  D. Liu
Expires: 31 December 2026                                  Alibaba Cloud
                                                            29 June 2026


                    Live Agent Interaction over MoQ
                draft-liu-moq-live-agent-interaction-00

Abstract

   This document defines a protocol for real-time interactive
   communication between users and AI agents over Media over QUIC
   Transport (MOQT).  It specifies how streaming inference outputs (ASR
   transcripts, LLM tokens, TTS audio) map to the MOQT object model,
   defines a turn-taking control protocol with barge-in support for
   voice interactions, and establishes track structure conventions for
   live agent sessions.  The protocol operates as an application-layer
   profile on top of MOQT without modifying transport semantics.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 31 December 2026.

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components


Liu & Liu               Expires 31 December 2026                [Page 1]

Internet-Draft             Live Agent over MoQ                 June 2026


   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Motivation and Use Cases  . . . . . . . . . . . . . . . .   3
       1.1.1.  Why Existing Approaches Are Insufficient  . . . . . .   4
     1.2.  Distinguishing Properties . . . . . . . . . . . . . . . .   5
     1.3.  Architecture Overview . . . . . . . . . . . . . . . . . .   6
       1.3.1.  Protocol Scope and Layering . . . . . . . . . . . . .   6
       1.3.2.  Design Principles . . . . . . . . . . . . . . . . . .   7
       1.3.3.  Protocol Components . . . . . . . . . . . . . . . . .   8
     1.4.  Conventions and Definitions . . . . . . . . . . . . . . .   9
     1.5.  Deployment Examples . . . . . . . . . . . . . . . . . . .   9
       1.5.1.  Example 1: With MoQ Relay . . . . . . . . . . . . . .   9
       1.5.2.  Example 2: Without Relay  . . . . . . . . . . . . . .  10
       1.5.3.  Protocol Compatibility  . . . . . . . . . . . . . . .  11
   2.  Object Model Mapping for Inference Streams  . . . . . . . . .  11
     2.1.  Mapping Principles  . . . . . . . . . . . . . . . . . . .  11
     2.2.  Agent Text Output Track . . . . . . . . . . . . . . . . .  12
       2.2.1.  Object Structure  . . . . . . . . . . . . . . . . . .  12
       2.2.2.  Batching Strategy . . . . . . . . . . . . . . . . . .  13
       2.2.3.  Partial and Final Semantics . . . . . . . . . . . . .  13
       2.2.4.  Group Lifecycle . . . . . . . . . . . . . . . . . . .  14
     2.3.  Agent Audio Output Track  . . . . . . . . . . . . . . . .  14
       2.3.1.  Object Structure  . . . . . . . . . . . . . . . . . .  14
       2.3.2.  Subgroup Semantics for Audio  . . . . . . . . . . . .  15
       2.3.3.  Cross-Track Synchronization . . . . . . . . . . . . .  15
     2.4.  User Audio Input Track  . . . . . . . . . . . . . . . . .  15
       2.4.1.  Object Structure  . . . . . . . . . . . . . . . . . .  15
       2.4.2.  Group Semantics . . . . . . . . . . . . . . . . . . .  15
     2.5.  Tool Output Track . . . . . . . . . . . . . . . . . . . .  16
       2.5.1.  Object Structure  . . . . . . . . . . . . . . . . . .  16
       2.5.2.  Delivery Requirements . . . . . . . . . . . . . . . .  16
   3.  Turn Control Protocol . . . . . . . . . . . . . . . . . . . .  16
     3.1.  Turn State Machine  . . . . . . . . . . . . . . . . . . .  16
     3.2.  Control Track . . . . . . . . . . . . . . . . . . . . . .  18
     3.3.  Barge-in Handling . . . . . . . . . . . . . . . . . . . .  18
       3.3.1.  Barge-in Signal Delivery  . . . . . . . . . . . . . .  19
       3.3.2.  Agent Interrupt Behavior  . . . . . . . . . . . . . .  19
       3.3.3.  Client Interrupt Behavior . . . . . . . . . . . . . .  20
     3.4.  VAD Integration . . . . . . . . . . . . . . . . . . . . .  20
     3.5.  Priority Assignment . . . . . . . . . . . . . . . . . . .  20
   4.  Track Structure and Naming  . . . . . . . . . . . . . . . . .  21
     4.1.  Namespace Convention  . . . . . . . . . . . . . . . . . .  21
     4.2.  Track Names . . . . . . . . . . . . . . . . . . . . . . .  21


Liu & Liu               Expires 31 December 2026                [Page 2]

Internet-Draft             Live Agent over MoQ                 June 2026


     4.3.  Catalog Integration . . . . . . . . . . . . . . . . . . .  22
   5.  Delivery Policies . . . . . . . . . . . . . . . . . . . . . .  22
     5.1.  Datagram vs Stream Selection  . . . . . . . . . . . . . .  23
   6.  Relay Considerations  . . . . . . . . . . . . . . . . . . . .  23
     6.1.  Relay Transparency  . . . . . . . . . . . . . . . . . . .  23
     6.2.  Caching Behavior  . . . . . . . . . . . . . . . . . . . .  23
     6.3.  Multi-Subscriber Scenarios  . . . . . . . . . . . . . . .  24
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  24
     7.1.  Authentication and Authorization  . . . . . . . . . . . .  24
     7.2.  End-to-End Encryption . . . . . . . . . . . . . . . . . .  24
     7.3.  Privacy Considerations  . . . . . . . . . . . . . . . . .  24
     7.4.  Denial of Service . . . . . . . . . . . . . . . . . . . .  25
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  25
     8.1.  MOQT Track Property Registrations . . . . . . . . . . . .  25
     8.2.  Control Signal Type Registry  . . . . . . . . . . . . . .  25
     8.3.  Object Payload Flags Registry . . . . . . . . . . . . . .  26
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  26
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  26
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  27
   Appendix A.  Interaction Examples . . . . . . . . . . . . . . . .  27
     A.1.  Basic Voice Conversation Turn . . . . . . . . . . . . . .  28
     A.2.  Barge-in During Agent Response  . . . . . . . . . . . . .  28
     A.3.  Concurrent Text and Audio Delivery  . . . . . . . . . . .  29
   Appendix B.  Design Rationale . . . . . . . . . . . . . . . . . .  30
     B.1.  Why Not a Custom Frame Layer  . . . . . . . . . . . . . .  30
     B.2.  Why Group = Turn  . . . . . . . . . . . . . . . . . . . .  30
     B.3.  Why Separate Control Track  . . . . . . . . . . . . . . .  30
   Appendix C.  Acknowledgements . . . . . . . . . . . . . . . . . .  31
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  31

1.  Introduction

   Large Language Models (LLMs) and multimodal AI systems have enabled a
   new class of interactive applications where users communicate with AI
   agents in real-time through voice and text.  These "live agent"
   interactions share characteristics with both traditional media
   streaming and conversational protocols, but fit neatly into neither
   category.

1.1.  Motivation and Use Cases

   The following application scenarios motivate the design of a
   dedicated protocol profile for live agent interaction over MOQT:

   *  *Voice AI Assistants*: A user speaks naturally to an AI agent and
      receives spoken responses in real-time.  The agent performs
      streaming ASR on user audio, generates a response via LLM, and
      synthesizes speech (TTS) delivered with sub-second latency.  The


Liu & Liu               Expires 31 December 2026                [Page 3]

Internet-Draft             Live Agent over MoQ                 June 2026


      user may interrupt the agent mid-response (barge-in), requiring
      immediate cessation of agent output.  This demands continuous
      bidirectional audio streaming, low time-to-first-audio latency,
      graceful interruption handling, and the ability to deliver partial
      text results ahead of audio for perceived responsiveness.

   *  *Real-Time Customer Service Agents*: In live commerce or customer
      support deployments, an AI agent handles simultaneous voice or
      text interactions with customers, accessing external tools
      (inventory lookup, order status, payment processing) and relaying
      structured results alongside natural language responses.  This
      demands reliable delivery of tool results alongside best-effort
      audio delivery, relay-based fan-out for scaling to thousands of
      concurrent sessions, and per-session isolation with independent
      priority and timeout policies.

   *  *Multimodal Scene-Aware Agents*: A user points their device camera
      at a real-world scene (e.g., a landmark, exhibit, or street sign)
      while speaking to an AI agent that acts as a digital tour guide.
      The agent subscribes to the user's audio and video input tracks,
      performs visual understanding and speech recognition jointly, and
      publishes spoken narration, text annotations, and contextual
      information about the scene.  This demands concurrent processing
      of multiple input modalities (audio + video), low-latency
      multimodal fusion at the agent backend, multiple independent
      output tracks with heterogeneous delivery requirements (reliable
      text vs. best-effort audio), and partial reliability where stale
      video frames or audio segments may be dropped without
      retransmission.

1.1.1.  Why Existing Approaches Are Insufficient

   Live agent interactions impose strict latency budgets: users expect
   sub-second time-to-first-token and time-to-first-audio for the
   interaction to feel comparable to natural conversational turn-taking.
   HTTP-based streaming approaches (SSE, WebSocket) operate over TCP,
   where head-of-line blocking, connection-level flow control, and lack
   of stream multiplexing make it difficult to meet these latency
   targets — particularly when multiple output modalities (text, audio,
   tool results) must be delivered concurrently with independent
   priority and reliability requirements.  Furthermore, these approaches
   cannot express per-object delivery timeouts or relay-assisted fan-out
   at the transport level, forcing application-layer workarounds that
   add complexity and latency.  Purpose-built AI inference APIs operate
   in request-response or unidirectional streaming modes without support
   for concurrent input processing, turn management, or barge-in.


Liu & Liu               Expires 31 December 2026                [Page 4]

Internet-Draft             Live Agent over MoQ                 June 2026


   Media over QUIC Transport addresses these limitations at the
   transport layer: QUIC's stream multiplexing eliminates head-of-line
   blocking between modalities, MOQT's priority system ensures latency-
   critical signals (barge-in, user audio) are scheduled first, and
   delivery timeouts allow stale data to be discarded without blocking
   fresh output.  The relay infrastructure provides scalability without
   per-connection state at the agent backend.  However, MOQT lacks
   application-layer conventions for mapping AI inference semantics onto
   these primitives.  This document fills that gap.

1.2.  Distinguishing Properties

   A live agent interaction has the following distinguishing properties:

   *  *Asymmetric streaming*: User input is continuous (audio stream),
      while agent output is incremental and multi-modal (text tokens,
      synthesized audio, tool results).

   *  *Turn-based with interruption*: Unlike media broadcast, the
      interaction follows a dialogue structure where either party can
      take or yield the floor.

   *  *Latency-critical incremental delivery*: Users perceive agent
      responsiveness through time-to-first-token and time-to-first-
      audio, requiring sub-second delivery of partial results.

   *  *Heterogeneous reliability requirements*: Within a single turn,
      partial ASR transcripts are ephemeral (can be superseded), final
      transcripts are authoritative, TTS audio is time-bounded, and tool
      results must be delivered reliably.

   Media over QUIC Transport [MOQT] provides a publish/subscribe
   protocol with features well-suited to these requirements: prioritized
   delivery, partial reliability through delivery timeouts, group-based
   object organization, and relay infrastructure for scalability.
   However, MOQT defines no application-layer semantics for mapping
   inference streams to its object model, nor for managing
   conversational turn-taking.

   This document specifies:

   *  A mapping of streaming inference outputs to the MOQT object data
      model (Section 2).

   *  A turn control protocol for managing dialogue state and handling
      barge-in interruptions (Section 3).


Liu & Liu               Expires 31 December 2026                [Page 5]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  Track structure conventions and naming for live agent sessions
      (Section 4).

   *  Delivery policies appropriate for each stream type (Section 5).

1.3.  Architecture Overview

1.3.1.  Protocol Scope and Layering

   This document defines an application-layer profile that operates on
   top of MOQT [MOQT] without modifying its transport semantics.  The
   relationship to the MoQ protocol suite is illustrated below:

  +-------------------------------------------------------------------+
  |              Application Layer (Live Agent Interaction)            |
  |                                                                   |
  |  Maps conversational structure (turns, steps, frames) onto the    |
  |  MOQT object hierarchy; adds turn-taking control via signals.     |
  +-------------------------------------------------------------------+
           |                    |                      |
           v                    v                      v
  +------------------+  +-----------------+  +-------------------+
  | MoQ Transport    |  | LOC Container   |  | MoQ Secure        |
  | (MOQT)           |  | (Audio/Video)   |  | Objects (E2E)     |
  | - Object Model   |  | - Codec framing |  | - Encryption      |
  | - Pub/Sub        |  | - Timing        |  | - Authentication  |
  | - Relay          |  |                 |  |                   |
  | - Priority       |  |                 |  |                   |
  | - Delivery       |  |                 |  |                   |
  +------------------+  +-----------------+  +-------------------+
           |
           v
  +-------------------------------------------------------------------+
  |                QUIC / WebTransport                                 |
  |  - Stream multiplexing    - Datagram extension                    |
  |  - TLS 1.3 encryption    - Congestion control                    |
  |  - 0-RTT resumption      - Flow control                          |
  +-------------------------------------------------------------------+

                       Figure 1: Protocol Layering

   The following table summarizes how live agent domain concepts map to
   MOQT primitives:


Liu & Liu               Expires 31 December 2026                [Page 6]

Internet-Draft             Live Agent over MoQ                 June 2026


    +================+===========+====================================+
    | Domain Concept | MOQT      | Semantics                          |
    |                | Primitive |                                    |
    +================+===========+====================================+
    | Conversation   | Group     | Atomic dialogue unit; GROUP_ORDER  |
    | Turn           |           | descending prioritizes latest turn |
    +----------------+-----------+------------------------------------+
    | Inference Step | Subgroup  | A sentence, audio segment, or tool |
    |                |           | call within a turn                 |
    +----------------+-----------+------------------------------------+
    | Token Batch /  | Object    | Minimum delivery unit; subject to  |
    | Audio Frame    |           | OBJECT_DELIVERY_TIMEOUT            |
    +----------------+-----------+------------------------------------+
    | Stream         | Track     | Independent subscribe, priority,   |
    | Modality       |           | and reliability per modality       |
    +----------------+-----------+------------------------------------+
    | Barge-in       | Datagram  | Highest priority (0x00); bypasses  |
    | Signal         |           | head-of-line blocking              |
    +----------------+-----------+------------------------------------+
    | Turn Control   | Control   | Reliable delivery of state-machine |
    |                | Track     | transitions                        |
    +----------------+-----------+------------------------------------+

                  Table 1: Domain Concept to MOQT Mapping

   This document:

   *  USES the MOQT object model (Track, Group, Subgroup, Object) to
      represent conversational structure.

   *  USES LOC [LOC] as the container format for audio payloads.

   *  USES MOQT native mechanisms (SUBSCRIBE, priority, delivery
      timeouts, GROUP_ORDER) for QoS enforcement.

   *  MAY USE Secure Objects [SECURE-OBJECTS] for end-to-end encryption
      of agent output through untrusted relays.

   *  DOES NOT define new transport-layer framing or modify MOQT wire
      format.

1.3.2.  Design Principles

   The protocol is guided by the following architectural principles:

   Native MOQT Integration:  Map application semantics directly to the


Liu & Liu               Expires 31 December 2026                [Page 7]

Internet-Draft             Live Agent over MoQ                 June 2026


      MOQT object hierarchy rather than introducing intermediate framing
      layers.  This ensures MOQT relays can perform correct scheduling,
      timeout-based discard, and caching without understanding
      application-layer payload formats.

   Relay Transparency:  All protocol operations MUST work through
      unmodified MOQT relays.  The relay sees standard Tracks, Groups,
      Subgroups, and Objects with associated priorities and timeouts.
      No relay-side payload inspection is required.

   Asymmetric by Design:  The protocol explicitly models the user-to-
      agent asymmetry: user input is continuous and latency-critical for
      the agent; agent output is incremental, multi-modal, and
      interruptible.  This asymmetry is reflected in priority
      assignment, timeout configuration, and track structure.

   Latency Budget Awareness:  Every protocol mechanism is evaluated
      against its contribution to end-to-end latency.  Zero additional
      round-trips for session setup (reuse MOQT session).  Datagram
      delivery for time-critical signals.  Batching strategies that
      bound flush latency.

   Partial Reliability as a Feature:  Not all data within a turn has
      equal value.  The protocol assigns per-track and per-subgroup
      delivery timeouts that allow the transport to discard stale data
      (old audio frames, superseded partial transcripts) while
      guaranteeing delivery of authoritative results (final text, tool
      outputs).

   Modality Agnostic:  The protocol does not mandate specific codecs,
      model architectures, or inference pipelines.  It defines
      structural conventions (Group=Turn, Subgroup=inference step) that
      apply regardless of whether the agent produces text, audio, video,
      or structured data.

1.3.3.  Protocol Components

   This document comprises four logical components:

   1.  *Inference Stream Delivery* (Section 2): Defines how streaming
       outputs from ASR, LLM, and TTS pipelines map to the MOQT object
       data model.  Covers text token batching, audio segmentation, tool
       result framing, and cross-track synchronization.

   2.  *Turn Control Protocol* (Section 3): Defines the conversational
       state machine, control signal format, barge-in handling, VAD
       integration, and priority assignment for managing dialogue flow.


Liu & Liu               Expires 31 December 2026                [Page 8]

Internet-Draft             Live Agent over MoQ                 June 2026


   3.  *Track Structure and Naming* (Section 4): Defines namespace
       conventions, standard track names, and catalog integration for
       live agent sessions.

   4.  *Delivery Policies* (Section 5): Specifies per-track timeout
       configurations, transport selection guidelines (Datagram vs
       Stream), and relay caching behavior.

1.4.  Conventions and Definitions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   The following terms are used in this document:

   Live Agent Session:  A stateful interaction between a user endpoint
      and an AI agent backend, conducted over one or more MOQT sessions.

   Turn:  A contiguous period during which one party (user or agent)
      holds the conversational floor.  Mapped to a MOQT Group.

   Inference Stream:  A sequence of incremental outputs from an AI model
      (e.g., LLM tokens, ASR transcripts, TTS audio chunks).

   Barge-in:  An event where the user begins speaking while the agent is
      still producing output, causing the agent to yield the floor.

   Partial Result:  An intermediate inference output that may be
      superseded by subsequent results (e.g., partial ASR transcript).

   Final Result:  A definitive inference output that will not be further
      modified.

1.5.  Deployment Examples

   This protocol is compatible with multiple deployment topologies.  Two
   examples are illustrated below.

1.5.1.  Example 1: With MoQ Relay


Liu & Liu               Expires 31 December 2026                [Page 9]

Internet-Draft             Live Agent over MoQ                 June 2026


   User Device             MoQ Relay              Agent Backend
   (App/Browser)          (Cache/Fan-out)        (Omni-LLM)
        |                      |                       |
        |===== QUIC/WebTransport session =============>|
        |                      |                       |
        |--- Audio Track ----->|-------- fwd --------->|
        |--- Video Track ----->|-------- fwd --------->|
        |                      |                       |
        |<-- Audio Track ------|<------ publish -------|
        |<-- Text Track -------|<------ publish -------|
        |<-- Tool Results -----|<------ publish -------|
        |                      |                       |
        |--- Control Signals ->|-------- fwd --------->|
        |<-- Control Signals --|<------ publish -------|

                    Figure 2: Deployment with MoQ Relay

   The relay forwards user input to the agent and fans out agent output
   to subscribers.  This topology is suited for scenarios requiring:

   *  Multiple subscribers to a single agent session (monitoring,
      recording, accessibility overlays).

   *  Geographic distribution where relays are placed close to users.

   *  Caching of agent output for late-joining clients.

1.5.2.  Example 2: Without Relay

   User Device                                    Agent Backend
   (App/Browser)                                  (Omni-LLM)
        |                                              |
        |===== QUIC/WebTransport session =============>|
        |                                              |
        |--- Audio Track ----------------------------->|
        |--- Video Track ----------------------------->|
        |                                              |
        |<-- Audio Track ------------------------------|
        |<-- Text Track -------------------------------|
        |<-- Tool Results -----------------------------|
        |                                              |
        |--- Control Signals ------------------------->|
        |<-- Control Signals --------------------------|

                     Figure 3: Deployment without Relay

   The client connects directly to the agent backend.  This topology is
   suited for scenarios requiring:


Liu & Liu               Expires 31 December 2026               [Page 10]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  Minimal latency (no intermediate hop).

   *  Simpler deployment without relay infrastructure.

   *  Single-subscriber sessions (1:1 user-to-agent interactions).

1.5.3.  Protocol Compatibility

   This specification operates correctly under both topologies.  The
   application-layer semantics (track structure, turn control, object
   model mapping) are identical regardless of whether a MoQ relay is
   present:

   *  With relay: the relay handles subscription management, priority-
      based scheduling, and delivery timeout enforcement transparently.

   *  Without relay: the agent backend itself implements MOQT session
      handling.  Priority and timeout semantics still apply to the QUIC
      streams between client and agent.

   Deployments MAY combine both topologies, for example using direct
   connections for latency-sensitive single-user sessions while routing
   multi-subscriber sessions through relays.

2.  Object Model Mapping for Inference Streams

   This section defines how streaming inference outputs map to the MOQT
   object data model defined in Section 2 of [MOQT].

2.1.  Mapping Principles

   The MOQT object hierarchy consists of Track > Group > Subgroup >
   Object.  This document assigns conversational semantics to each
   level:


Liu & Liu               Expires 31 December 2026               [Page 11]

Internet-Draft             Live Agent over MoQ                 June 2026


   +============+=====================+===============================+
   | MOQT Level | Live Agent Semantic | Rationale                     |
   +============+=====================+===============================+
   | Track      | Stream type (audio, | Independent subscription unit |
   |            | text, control)      |                               |
   +------------+---------------------+-------------------------------+
   | Group      | Conversation turn   | Atomic unit of dialogue;      |
   |            |                     | enables turn-level operations |
   +------------+---------------------+-------------------------------+
   | Subgroup   | Inference step      | Logical segment: a sentence,  |
   |            | within a turn       | an audio segment, a tool call |
   +------------+---------------------+-------------------------------+
   | Object     | Atomic delivery     | Smallest independently        |
   |            | unit                | decodable/renderable item     |
   +------------+---------------------+-------------------------------+

                  Table 2: Object Model Semantic Mapping

   This mapping enables:

   *  Subscribing to a specific turn onwards (Group-based filtering).

   *  Dropping an entire stale turn when interrupted (Group-level
      discard).

   *  Prioritizing recent turns over old ones (GROUP_ORDER =
      descending).

   *  Independent reliability per inference step (Subgroup-level
      timeouts).

2.2.  Agent Text Output Track

   The agent text output track carries streaming LLM token output.

2.2.1.  Object Structure

   Each Object in the text output track carries a *token batch*: one or
   more sequential tokens generated within a single flush interval.


Liu & Liu               Expires 31 December 2026               [Page 12]

Internet-Draft             Live Agent over MoQ                 June 2026


   Text Output Object Payload:
   +--------+--------+-------------------------------------------+
   | Field  | Type   | Description                               |
   +--------+--------+-------------------------------------------+
   | flags  | uint8  | 0x01=partial, 0x02=final, 0x04=cancelled  |
   | seq    | varint | Sequence number within subgroup           |
   | count  | varint | Number of tokens in this batch            |
   | tokens | UTF-8  | Concatenated token text                   |
   +--------+--------+-------------------------------------------+

                    Figure 4: Text Output Object Format

2.2.2.  Batching Strategy

   Implementations SHOULD batch tokens to amortize per-object overhead.
   The following strategies are RECOMMENDED:

   *  *Time-based*: Flush every 50ms, collecting all tokens generated in
      that interval into a single Object.

   *  *Size-based*: Flush when accumulated token text reaches 128 bytes.

   *  *Semantic-based*: Flush at sentence boundaries or punctuation
      marks.

   An implementation MUST flush immediately when:

   *  The inference step completes (flags = 0x02, final).

   *  A barge-in interrupt is received (flags = 0x04, cancelled).

   *  The subgroup ends (last object in subgroup).

2.2.3.  Partial and Final Semantics

   Within a Subgroup (inference step), Objects are delivered
   incrementally:

   *  Objects with flags=0x01 (partial) represent tokens generated so
      far.  A subscriber MAY render them immediately for real-time
      display but MUST be prepared for them to be superseded.

   *  An Object with flags=0x02 (final) indicates the inference step is
      complete.  The subscriber SHOULD treat the concatenation of all
      Objects in the Subgroup as the definitive output.


Liu & Liu               Expires 31 December 2026               [Page 13]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  An Object with flags=0x04 (cancelled) indicates the inference step
      was interrupted (e.g., by barge-in).  The subscriber SHOULD
      discard or visually mark the incomplete output.

2.2.4.  Group Lifecycle

   A new Group is created when:

   *  The agent begins responding to a new user turn.

   *  The turn counter increments (see Section 3.1).

   The Group is closed (LARGEST_OBJECT property set) when:

   *  The agent completes its full response for this turn.

   *  The agent is interrupted by barge-in (final Object has cancelled
      flag).

2.3.  Agent Audio Output Track

   The agent audio output track carries TTS-synthesized audio.  Audio
   payloads SHOULD use the LOC container format [LOC] for encoding.

2.3.1.  Object Structure

   Each Object carries one audio segment (typically 20-60ms of audio).
   The LOC header provides codec identification and timing.  This
   document adds an optional extension header for text alignment:

   Audio Output Object Payload:
   +------------------+--------+--------------------------------------+
   | Field            | Type   | Description                          |
   +------------------+--------+--------------------------------------+
   | loc_header       | LOC    | Standard LOC audio header            |
   | audio_data       | bytes  | Encoded audio samples                |
   | align_seq (opt)  | varint | Corresponding text object seq number |
   | align_offset(opt)| varint | Character offset within text object  |
   +------------------+--------+--------------------------------------+

                    Figure 5: Audio Output Object Format

   The optional alignment fields enable the subscriber to synchronize
   text highlighting with audio playback (e.g., karaoke-style display).


Liu & Liu               Expires 31 December 2026               [Page 14]

Internet-Draft             Live Agent over MoQ                 June 2026


2.3.2.  Subgroup Semantics for Audio

   Each Subgroup in the audio track corresponds to one utterance or
   sentence boundary in the agent's response.  This enables:

   *  Dropping a complete sentence if delivery is too late
      (SUBGROUP_DELIVERY_TIMEOUT).

   *  Rendering audio sentence-by-sentence with natural pauses.

   *  Aligning with text Subgroups at sentence granularity.

2.3.3.  Cross-Track Synchronization

   The agent text track and agent audio track use the same Group ID for
   the same conversational turn.  Within a turn:

   *  Text Subgroup N corresponds to Audio Subgroup N (same sentence).

   *  The align_seq field in audio Objects references the text Object
      sequence number being spoken at that audio moment.

   This enables a subscriber receiving both tracks to:

   *  Display text as it arrives (lower latency than audio).

   *  Highlight the currently-spoken text segment during audio playback.

   *  Fall back to text-only if audio delivery times out.

2.4.  User Audio Input Track

   The user publishes a continuous audio input track.

2.4.1.  Object Structure

   Each Object carries a fixed-duration audio frame (typically 20ms)
   using the LOC container format.

2.4.2.  Group Semantics

   Groups in the user audio track are segmented by voice activity:

   *  A new Group begins when the user starts speaking (VAD trigger).

   *  The Group ends when the user stops speaking (silence detection).

   This enables the agent backend to:


Liu & Liu               Expires 31 December 2026               [Page 15]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  Subscribe starting from the latest Group (skip silence gaps).

   *  Process each utterance as a unit.

   *  Implement endpoint detection without additional signaling.

2.5.  Tool Output Track

   The agent MAY publish a tool output track for structured results from
   tool/function calls.

2.5.1.  Object Structure

   Tool Output Object Payload:
   +-----------+--------+------------------------------------------+
   | Field     | Type   | Description                              |
   +-----------+--------+------------------------------------------+
   | flags     | uint8  | 0x01=invocation, 0x02=result, 0x04=error |
   | tool_id   | varint | Tool/function identifier                 |
   | call_id   | varint | Unique call instance identifier          |
   | payload   | bytes  | JSON-encoded tool call or result         |
   +-----------+--------+------------------------------------------+

                    Figure 6: Tool Output Object Format

2.5.2.  Delivery Requirements

   Tool outputs MUST be delivered reliably (no delivery timeout).  Tool
   invocation and result Objects are always flagged final (0x02).  The
   subscriber MUST NOT discard tool results due to lateness.

3.  Turn Control Protocol

   This section defines the control protocol for managing conversational
   turns between the user and agent.

3.1.  Turn State Machine

   A live agent session maintains the following turn states:


Liu & Liu               Expires 31 December 2026               [Page 16]

Internet-Draft             Live Agent over MoQ                 June 2026


                       speech_start
            +--------+----------->+---------+
            |  IDLE  |            |  USER   |
            |        |<-----------+ SPEAKING|
            +---+----+ speech_end +----+----+
                ^                      |
                |                      | (agent begins inference)
                |                      v
                |               +------+------+
                | turn_complete |   AGENT     |
                +--------------+  PROCESSING |
                |               +------+------+
                |                      |
                |                      | (first output produced)
                |                      v
                |               +------+------+
                | turn_complete |   AGENT     |<---+
                +--------------+  SPEAKING   |    | (output continues)
                                +------+------+----+
                                       |
                       barge_in        |
                     +--------+       |
                     |  USER  |<------+
                     |SPEAKING|
                     +--------+

                        Figure 7: Turn State Machine

   State transitions:

   *  *IDLE → USER_SPEAKING*: User audio VAD detects speech onset.

   *  *USER_SPEAKING → AGENT_PROCESSING*: User speech ends (silence
      timeout or explicit end-of-turn signal).

   *  *AGENT_PROCESSING → AGENT_SPEAKING*: Agent produces first output
      Object in any output track.

   *  *AGENT_SPEAKING → IDLE*: Agent completes response (closes Group in
      all output tracks).

   *  *AGENT_SPEAKING → USER_SPEAKING*: Barge-in event (user starts
      speaking while agent is outputting).


Liu & Liu               Expires 31 December 2026               [Page 17]

Internet-Draft             Live Agent over MoQ                 June 2026


3.2.  Control Track

   Turn control signals are exchanged on a dedicated bidirectional
   control track pair (one per direction).  Control Objects use the
   following format:

   Control Object Payload:
   +-----------+--------+------------------------------------------+
   | Field     | Type   | Description                              |
   +-----------+--------+------------------------------------------+
   | signal    | varint | Signal type (see below)                  |
   | turn_id   | varint | Current turn Group ID                    |
   | timestamp | varint | Sender wall-clock time (ms since epoch)  |
   | payload   | bytes  | Signal-specific data (may be empty)      |
   +-----------+--------+------------------------------------------+

   Signal Types:
     0x01 = SPEECH_START     (user → agent)
     0x02 = SPEECH_END       (user → agent)
     0x03 = BARGE_IN         (user → agent)
     0x04 = TURN_STARTED     (agent → user)
     0x05 = TURN_COMPLETE    (agent → user)
     0x06 = INTERRUPT_ACK    (agent → user)
     0x07 = THINKING         (agent → user)

                      Figure 8: Control Signal Format

3.3.  Barge-in Handling

   Barge-in is the critical interaction where a user interrupts the
   agent's ongoing output.  The protocol defines the following sequence:


Liu & Liu               Expires 31 December 2026               [Page 18]

Internet-Draft             Live Agent over MoQ                 June 2026


   User Device                                  Agent Backend
        |                                            |
        |  [user starts speaking over agent output]  |
        |                                            |
        |-- Control: BARGE_IN (turn_id=N) ---------> |
        |          (via Datagram, highest priority)   |
        |                                            |
        |       [agent stops TTS, notes position]    |
        |                                            |
        |<- Control: INTERRUPT_ACK (turn_id=N) ----- |
        |     payload: {interrupted_group: N,        |
        |              interrupted_subgroup: M,      |
        |              interrupted_object: K}        |
        |                                            |
        |  [agent closes Group N with cancelled flag]|
        |                                            |
        |<- Text Object: flags=cancelled ----------- |
        |<- Audio: subgroup FIN -------------------- |
        |                                            |
        |  [agent begins processing new user input]  |
        |                                            |

                        Figure 9: Barge-in Sequence

3.3.1.  Barge-in Signal Delivery

   The BARGE_IN signal has the following delivery requirements:

   *  MUST be sent via MOQT Datagram for minimum latency.

   *  MUST be assigned the highest publisher priority (0x00).

   *  SHOULD be sent immediately upon local VAD detection, without
      waiting for speech_end.

   *  The agent MUST process BARGE_IN within one processing cycle
      (target: < 50ms from receipt to output cessation).

3.3.2.  Agent Interrupt Behavior

   Upon receiving BARGE_IN, the agent MUST:

   1.  Cease generating new output Objects for the current turn.

   2.  Close the current output Group with a cancelled Object
       (flags=0x04 in text track, stream FIN in audio track).

   3.  Send INTERRUPT_ACK with the position where output stopped.


Liu & Liu               Expires 31 December 2026               [Page 19]

Internet-Draft             Live Agent over MoQ                 June 2026


   4.  Transition to processing the new user input.

   The agent SHOULD NOT:

   *  Abruptly truncate mid-audio-frame (finish current audio Object).

   *  Discard context from the interrupted response (the agent has it in
      its context window for the next turn).

3.3.3.  Client Interrupt Behavior

   Upon sending BARGE_IN, the client SHOULD:

   *  Immediately stop audio playback of the agent's output.

   *  Visually indicate the response was interrupted (e.g., fade text).

   *  Begin capturing and publishing user audio for the new turn.

3.4.  VAD Integration

   Speech activity detection events drive the turn state machine.  This
   document does not mandate a specific detection algorithm (traditional
   energy-based VAD, neural VAD, or other approaches) but defines the
   signaling semantics:

   *  *SPEECH_START*: Published when the implementation determines that
      the user has begun speaking.

   *  *SPEECH_END*: Published when the implementation determines that
      the user has finished speaking.

   *  *BARGE_IN*: Published when SPEECH_START occurs during
      AGENT_SPEAKING state.  This is a composite signal (implies
      SPEECH_START + interrupt request).

   VAD signals are sent on the user→agent control track.
   Implementations MAY perform VAD on the client, on the relay, or on
   the agent backend.  When VAD is performed on the client, it SHOULD be
   sent as Datagram for lowest latency.

3.5.  Priority Assignment

   The following priority assignments are RECOMMENDED for live agent
   sessions (lower numeric value = higher priority):


Liu & Liu               Expires 31 December 2026               [Page 20]

Internet-Draft             Live Agent over MoQ                 June 2026


     +============================+==========+======================+
     | Track/Signal               | Priority | Rationale            |
     +============================+==========+======================+
     | Control signals (BARGE_IN) | 0x00     | Must preempt all     |
     |                            |          | other traffic        |
     +----------------------------+----------+----------------------+
     | Control signals (other)    | 0x01     | Turn management is   |
     |                            |          | time-critical        |
     +----------------------------+----------+----------------------+
     | User audio input           | 0x02     | Agent cannot process |
     |                            |          | without input        |
     +----------------------------+----------+----------------------+
     | Agent audio output         | 0x03     | Primary user-        |
     |                            |          | perceived output     |
     +----------------------------+----------+----------------------+
     | Agent text output          | 0x04     | Secondary output     |
     |                            |          | (lower bandwidth)    |
     +----------------------------+----------+----------------------+
     | Tool results               | 0x05     | Non-time-critical    |
     |                            |          | structured data      |
     +----------------------------+----------+----------------------+

                 Table 3: Recommended Priority Assignment

   Within agent output tracks, GROUP_ORDER SHOULD be set to descending
   (deliver newest group first) so that relay congestion drops stale
   turns rather than current ones.

4.  Track Structure and Naming

4.1.  Namespace Convention

   A live agent session uses the following namespace structure:

   Track Namespace: moqt://{authority}/agent/{session-id}/

   Where:

   *  {authority} is the domain of the agent service.

   *  {session-id} is a unique session identifier (RECOMMENDED: UUIDv7).

4.2.  Track Names

   The following track names are defined within a session namespace:


Liu & Liu               Expires 31 December 2026               [Page 21]

Internet-Draft             Live Agent over MoQ                 June 2026


      +===============+==============+==============================+
      | Track Name    | Direction    | Content                      |
      +===============+==============+==============================+
      | input/audio   | User → Agent | User microphone audio (LOC)  |
      +---------------+--------------+------------------------------+
      | input/text    | User → Agent | User text messages           |
      +---------------+--------------+------------------------------+
      | output/audio  | Agent → User | TTS synthesized audio (LOC)  |
      +---------------+--------------+------------------------------+
      | output/text   | Agent → User | Streaming LLM text tokens    |
      +---------------+--------------+------------------------------+
      | output/tool   | Agent → User | Tool invocations and results |
      +---------------+--------------+------------------------------+
      | control/user  | User → Agent | User control signals         |
      +---------------+--------------+------------------------------+
      | control/agent | Agent → User | Agent control signals        |
      +---------------+--------------+------------------------------+

                       Table 4: Standard Track Names

   Additional tracks MAY be defined for:

   *  input/video: User camera input.

   *  output/video: Agent avatar or visual output.

   *  meta/catalog: Session catalog in MSF format [MSF].

4.3.  Catalog Integration

   A live agent session SHOULD publish a catalog track conforming to the
   MOQT Streaming Format [MSF].  The catalog declares:

   *  Available tracks and their codec parameters.

   *  Agent capabilities (supported input modalities, languages).

   *  Session metadata (model identifier, context window size).

   The catalog enables late-joining subscribers and relay-assisted
   discovery of session characteristics.

5.  Delivery Policies


Liu & Liu               Expires 31 December 2026               [Page 22]

Internet-Draft             Live Agent over MoQ                 June 2026


5.1.  Datagram vs Stream Selection

    +============+===================+==========+=====================+
    | Track      | Default Transport | Fallback | Condition           |
    +============+===================+==========+=====================+
    | control/*  | Datagram          | Stream   | Signal < MTU        |
    | (BARGE_IN) |                   |          |                     |
    +------------+-------------------+----------+---------------------+
    | control/*  | Stream            | —        | Reliable delivery   |
    | (other)    |                   |          | needed              |
    +------------+-------------------+----------+---------------------+
    | input/     | Stream            | Datagram | If partial          |
    | audio      |                   |          | reliability desired |
    +------------+-------------------+----------+---------------------+
    | output/    | Stream            | Datagram | For loss-tolerant   |
    | audio      |                   |          | low-latency         |
    +------------+-------------------+----------+---------------------+
    | output/    | Stream            | —        | Must be reliable    |
    | text       |                   |          |                     |
    +------------+-------------------+----------+---------------------+
    | output/    | Stream            | —        | Must be reliable    |
    | tool       |                   |          |                     |
    +------------+-------------------+----------+---------------------+

                  Table 5: Transport Selection Guidelines

6.  Relay Considerations

6.1.  Relay Transparency

   This protocol is designed to operate through standard MOQT relays
   without relay modification.  Relays treat live agent traffic as
   normal MOQT objects with the following beneficial behaviors:

   *  *Priority-based scheduling*: Relays respect publisher priority,
      ensuring control signals and user audio are forwarded first under
      congestion.

   *  *Timeout-based expiry*: Relays discard Objects that exceed their
      delivery timeout, preventing stale audio from consuming bandwidth.

   *  *Group-order delivery*: With descending group order, relays under
      congestion naturally shed older turns.

6.2.  Caching Behavior

   Relays MAY cache agent output Objects for the duration specified by
   the MAX_CACHE_DURATION track property.  This enables:


Liu & Liu               Expires 31 December 2026               [Page 23]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  Late-joining clients to receive the current turn's output.

   *  Reconnecting clients to resume from where they left off.

   Relays SHOULD NOT cache:

   *  Control track Objects (they are ephemeral state transitions).

   *  User audio input (privacy-sensitive, single-consumer).

6.3.  Multi-Subscriber Scenarios

   A single agent session MAY have multiple subscribers to output tracks
   (e.g., accessibility tools, monitoring, recording).  The relay
   naturally fans out agent output to all subscribers without additional
   agent-side overhead.

7.  Security Considerations

7.1.  Authentication and Authorization

   Live agent sessions MUST authenticate both the user and agent
   endpoints.  The MOQT AUTHORIZATION_TOKEN parameter (Section 10.2.2 of
   [MOQT]) SHOULD be used for per-track authorization.

   User audio input tracks contain sensitive biometric data and MUST be
   restricted to the intended agent subscriber.  Relays MUST enforce
   subscription authorization for input tracks.

7.2.  End-to-End Encryption

   For deployments where relay operators are not fully trusted, agent
   output tracks MAY use end-to-end encryption as defined in
   [SECURE-OBJECTS].  Control tracks SHOULD NOT be E2E encrypted as
   relay-level inspection may be needed for priority enforcement.

7.3.  Privacy Considerations

   *  User audio MUST NOT be cached by relays beyond the immediate
      delivery requirement.

   *  Session IDs MUST be cryptographically random (UUIDv7 with random
      component) to prevent session correlation attacks.

   *  Control signals (VAD events, barge-in) leak interaction timing
      metadata.  Implementations MAY add padding to control track
      Objects to mitigate traffic analysis.


Liu & Liu               Expires 31 December 2026               [Page 24]

Internet-Draft             Live Agent over MoQ                 June 2026


7.4.  Denial of Service

   *  Barge-in signals are high-priority and processed immediately.
      Implementations MUST rate-limit barge-in signals per session
      (RECOMMENDED: maximum 10 per second) to prevent priority inversion
      attacks.

   *  Relays SHOULD enforce per-session bandwidth quotas to prevent a
      single agent session from starving other traffic.

8.  IANA Considerations

8.1.  MOQT Track Property Registrations

   This document registers the following track properties in the "MOQT
   Track Properties" registry:

    +====================+=============+========+====================+
    | Property Name      | Property ID | Type   | Description        |
    +====================+=============+========+====================+
    | AGENT_SESSION_ROLE | TBD         | varint | 0=user, 1=agent    |
    +--------------------+-------------+--------+--------------------+
    | TURN_GROUP_ORDER   | TBD         | varint | Confirms           |
    |                    |             |        | Group=Turn mapping |
    +--------------------+-------------+--------+--------------------+

                  Table 6: Track Property Registrations

8.2.  Control Signal Type Registry

   IANA is requested to create a "Live Agent Control Signal Types"
   registry under the "Media over QUIC (MoQ)" group.  The registration
   procedure is Specification Required.

   Initial registrations:


Liu & Liu               Expires 31 December 2026               [Page 25]

Internet-Draft             Live Agent over MoQ                 June 2026


                  +=======+===============+=============+
                  | Value | Signal Name   | Reference   |
                  +=======+===============+=============+
                  | 0x01  | SPEECH_START  | Section 3.4 |
                  +-------+---------------+-------------+
                  | 0x02  | SPEECH_END    | Section 3.4 |
                  +-------+---------------+-------------+
                  | 0x03  | BARGE_IN      | Section 3.3 |
                  +-------+---------------+-------------+
                  | 0x04  | TURN_STARTED  | Section 3.1 |
                  +-------+---------------+-------------+
                  | 0x05  | TURN_COMPLETE | Section 3.1 |
                  +-------+---------------+-------------+
                  | 0x06  | INTERRUPT_ACK | Section 3.3 |
                  +-------+---------------+-------------+
                  | 0x07  | THINKING      | Section 3.2 |
                  +-------+---------------+-------------+

                   Table 7: Control Signal Type Registry

   Values 0x08-0xFF are available for assignment.

8.3.  Object Payload Flags Registry

   IANA is requested to create a "Live Agent Object Flags" registry.

   Initial registrations:

         +=====+===========+=========================+===========+
         | Bit | Flag Name | Description             | Reference |
         +=====+===========+=========================+===========+
         | 0   | PARTIAL   | Object is intermediate, | This      |
         |     |           | may be superseded       | document  |
         +-----+-----------+-------------------------+-----------+
         | 1   | FINAL     | Object is definitive    | This      |
         |     |           |                         | document  |
         +-----+-----------+-------------------------+-----------+
         | 2   | CANCELLED | Object indicates        | This      |
         |     |           | interruption            | document  |
         +-----+-----------+-------------------------+-----------+

                       Table 8: Object Flags Registry

9.  References

9.1.  Normative References


Liu & Liu               Expires 31 December 2026               [Page 26]

Internet-Draft             Live Agent over MoQ                 June 2026


   [LOC]      Zanaty, M., Nandakumar, S., and P. Thatcher, "Low Overhead
              Media Container", Work in Progress, Internet-Draft, draft-
              ietf-moq-loc-02, 15 March 2026,
              <https://datatracker.ietf.org/doc/html/draft-ietf-moq-loc-
              02>.

   [MOQT]     Nandakumar, S., Vasiliev, V., Swett, I., and A. Frindell,
              "Media over QUIC Transport", Work in Progress, Internet-
              Draft, draft-ietf-moq-transport-18, 12 May 2026,
              <https://datatracker.ietf.org/doc/html/draft-ietf-moq-
              transport-18>.

   [QUIC]     Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based
              Multiplexed and Secure Transport", RFC 9000,
              DOI 10.17487/RFC9000, May 2021,
              <https://www.rfc-editor.org/rfc/rfc9000>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

9.2.  Informative References

   [A2A]      Liu, D. and S. Krishnan, "Agent Protocol over MoQ", Work
              in Progress, Internet-Draft, draft-liu-agent-protocol-
              over-moq-00, 2 March 2026,
              <https://datatracker.ietf.org/doc/html/draft-liu-agent-
              protocol-over-moq-00>.

   [MSF]      Law, W. and S. Nandakumar, "MOQT Streaming Format", Work
              in Progress, Internet-Draft, draft-ietf-moq-msf-01, 2 June
              2026, <https://datatracker.ietf.org/doc/html/draft-ietf-
              moq-msf-01>.

   [SECURE-OBJECTS]
              Jennings, C. F., Nandakumar, S., and R. Barnes, "End-to-
              End Secure Objects for Media over QUIC Transport", Work in
              Progress, Internet-Draft, draft-ietf-moq-secure-objects-
              00, 2 March 2026, <https://datatracker.ietf.org/doc/html/
              draft-ietf-moq-secure-objects-00>.

Appendix A.  Interaction Examples


Liu & Liu               Expires 31 December 2026               [Page 27]

Internet-Draft             Live Agent over MoQ                 June 2026


A.1.  Basic Voice Conversation Turn

Time  User Device              Relay          Agent Backend
 |
 |    [User speaks: "What's the weather?"]
 |
 t0   PUBLISH input/audio Group=1 ------>-----> ASR processes
 t0   Control: SPEECH_START ----------->------>
 |
 t1   [User stops speaking]
 t1   Control: SPEECH_END ------------->------>
 |                                              LLM generates response
 t2                            <------<------- Control: TURN_STARTED
 t2                            <------<------- PUBLISH output/text
 |                                              Group=1, Subgroup=0
 |                                              Object 0: "The weather"
 |                                              Object 1: " in Hangzhou"
 |                                              Object 2: " is sunny,"
 t3                            <------<------- PUBLISH output/audio
 |                                              Group=1, Subgroup=0
 |                                              [TTS: "The weather in
 |                                               Hangzhou is sunny,"]
 |
 t4                            <------<------- Object (text, final):
 |                                              " 28°C today."
 t4                            <------<------- Control: TURN_COMPLETE

                 Figure 10: Basic Voice Turn Example

A.2.  Barge-in During Agent Response


Liu & Liu               Expires 31 December 2026               [Page 28]

Internet-Draft             Live Agent over MoQ                 June 2026


Time  User Device              Relay          Agent Backend
 |
 |    [Agent is speaking: "The weather forecast shows..."]
 |    [Agent output: Group=1, currently at Subgroup=2]
 |
 t0   [User interrupts: "Stop, just tell me temperature"]
 t0   Control: BARGE_IN (turn=1) ----->------> received
 |
 t1                                            [stops TTS generation]
 t1                            <------<------- Control: INTERRUPT_ACK
 |                                              {interrupted: G=1,SG=2,O=5}
 t1                            <------<------- Text Object(cancelled)
 t1                            <------<------- Audio subgroup FIN
 |
 t2   PUBLISH input/audio Group=2 ---->------> ASR: "just tell me temp"
 t2   Control: SPEECH_START ---------->------>
 |
 t3   Control: SPEECH_END ------------>------>
 |                                              LLM: context includes
 |                                              interrupted response
 t4                            <------<------- Control: TURN_STARTED
 t4                            <------<------- Text Group=2: "It's 28°C."
 t4                            <------<------- Audio Group=2: [TTS]
 t5                            <------<------- Control: TURN_COMPLETE

                     Figure 11: Barge-in Example

A.3.  Concurrent Text and Audio Delivery

   Time  Subscriber View (User Device)
    |
    t0   [Subscribe to output/text AND output/audio, same Group ID]
    |
    t1   Text Object arrives: "The answer is"     → render immediately
    t2   Text Object arrives: " forty-two."       → append to display
    |
    t3   Audio Object arrives: [TTS "The answer"] → begin playback
    |    Text highlighting: "The answer" underlined (via align_seq)
    |
    t4   Audio Object arrives: [TTS "is forty"]   → continue playback
    |    Text highlighting advances: "is forty"
    |
    t5   Audio Object arrives: [TTS "-two."]      → finish playback
    |    Text highlighting: "-two."
    |
    |    [Text arrived ~200ms before audio — user saw text first,
    |     then heard it spoken, with synchronized highlighting]


Liu & Liu               Expires 31 December 2026               [Page 29]

Internet-Draft             Live Agent over MoQ                 June 2026


               Figure 12: Cross-Track Synchronization Example

Appendix B.  Design Rationale

B.1.  Why Not a Custom Frame Layer

   This document maps directly to the native MOQT object model rather
   than introducing a custom frame layer because:

   *  MOQT Groups/Subgroups already provide the sequencing and
      boundaries needed for turns and inference steps.

   *  MOQT delivery timeouts and priorities operate at the Object level,
      which is the right granularity for inference delivery.

   *  Standard MOQT relays can handle live agent traffic without
      modification or frame parsing.

   *  Reusing the object model means existing MOQT tooling (monitoring,
      debugging, relay management) works unchanged.

B.2.  Why Group = Turn

   Alternatives considered:

   *  *Group = entire session*: Loses the ability to discard stale turns
      and prevents Group-level priority ordering.

   *  *Group = single inference step*: Too fine-grained; creates
      excessive Group metadata overhead and prevents turn-level
      operations.

   *  *Group = time window (e.g., 1 second)*: Arbitrary boundary that
      doesn't align with application semantics; complicates barge-in.

   Group = Turn provides the natural boundary for: - What to discard
   when interrupted (the current turn). - What to prioritize (the latest
   turn). - What to cache for late-joiners (the most recent complete
   turn).

B.3.  Why Separate Control Track

   Embedding control signals in-band with media or text Objects was
   considered but rejected because:

   *  Control signals require different delivery characteristics
      (Datagram, highest priority, reliable).


Liu & Liu               Expires 31 December 2026               [Page 30]

Internet-Draft             Live Agent over MoQ                 June 2026


   *  Relays can apply priority to entire tracks but not to individual
      Objects within a track.

   *  Subscribers may want control-only subscription (e.g., turn status
      for UI state management without receiving media).

Appendix C.  Acknowledgements

   The authors would like to thank the participants of the MoQ working
   group for their contributions to the underlying transport protocol
   that makes this work possible.

Authors' Addresses

   Yanmei Liu
   Alibaba Inc.
   Email: miaoji.lym@alibaba-inc.com

   Additional contact information:

      刘彦梅
      Alibaba Inc.


   Dapeng Liu
   Alibaba Cloud
   Email: max.ldp@alibaba-inc.com


Liu & Liu               Expires 31 December 2026               [Page 31]