| Internet-Draft | Live Agent over MoQ | June 2026 |
| Liu & Liu | Expires 31 December 2026 | [Page] |
This document defines a protocol for real-time interactive communication between users and AI agents over Media over QUIC Transport (MOQT). It specifies how streaming inference outputs (ASR transcripts, LLM tokens, TTS audio) map to the MOQT object model, defines a turn-taking control protocol with barge-in support for voice interactions, and establishes track structure conventions for live agent sessions. The protocol operates as an application-layer profile on top of MOQT without modifying transport semantics.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 31 December 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Large Language Models (LLMs) and multimodal AI systems have enabled a new class of interactive applications where users communicate with AI agents in real-time through voice and text. These "live agent" interactions share characteristics with both traditional media streaming and conversational protocols, but fit neatly into neither category.¶
The following application scenarios motivate the design of a dedicated protocol profile for live agent interaction over MOQT:¶
Voice AI Assistants: A user speaks naturally to an AI agent and receives spoken responses in real-time. The agent performs streaming ASR on user audio, generates a response via LLM, and synthesizes speech (TTS) delivered with sub-second latency. The user may interrupt the agent mid-response (barge-in), requiring immediate cessation of agent output. This demands continuous bidirectional audio streaming, low time-to-first-audio latency, graceful interruption handling, and the ability to deliver partial text results ahead of audio for perceived responsiveness.¶
Real-Time Customer Service Agents: In live commerce or customer support deployments, an AI agent handles simultaneous voice or text interactions with customers, accessing external tools (inventory lookup, order status, payment processing) and relaying structured results alongside natural language responses. This demands reliable delivery of tool results alongside best-effort audio delivery, relay-based fan-out for scaling to thousands of concurrent sessions, and per-session isolation with independent priority and timeout policies.¶
Multimodal Scene-Aware Agents: A user points their device camera at a real-world scene (e.g., a landmark, exhibit, or street sign) while speaking to an AI agent that acts as a digital tour guide. The agent subscribes to the user's audio and video input tracks, performs visual understanding and speech recognition jointly, and publishes spoken narration, text annotations, and contextual information about the scene. This demands concurrent processing of multiple input modalities (audio + video), low-latency multimodal fusion at the agent backend, multiple independent output tracks with heterogeneous delivery requirements (reliable text vs. best-effort audio), and partial reliability where stale video frames or audio segments may be dropped without retransmission.¶
Live agent interactions impose strict latency budgets: users expect sub-second time-to-first-token and time-to-first-audio for the interaction to feel comparable to natural conversational turn-taking. HTTP-based streaming approaches (SSE, WebSocket) operate over TCP, where head-of-line blocking, connection-level flow control, and lack of stream multiplexing make it difficult to meet these latency targets — particularly when multiple output modalities (text, audio, tool results) must be delivered concurrently with independent priority and reliability requirements. Furthermore, these approaches cannot express per-object delivery timeouts or relay-assisted fan-out at the transport level, forcing application-layer workarounds that add complexity and latency. Purpose-built AI inference APIs operate in request-response or unidirectional streaming modes without support for concurrent input processing, turn management, or barge-in.¶
Media over QUIC Transport addresses these limitations at the transport layer: QUIC's stream multiplexing eliminates head-of-line blocking between modalities, MOQT's priority system ensures latency-critical signals (barge-in, user audio) are scheduled first, and delivery timeouts allow stale data to be discarded without blocking fresh output. The relay infrastructure provides scalability without per-connection state at the agent backend. However, MOQT lacks application-layer conventions for mapping AI inference semantics onto these primitives. This document fills that gap.¶
A live agent interaction has the following distinguishing properties:¶
Asymmetric streaming: User input is continuous (audio stream), while agent output is incremental and multi-modal (text tokens, synthesized audio, tool results).¶
Turn-based with interruption: Unlike media broadcast, the interaction follows a dialogue structure where either party can take or yield the floor.¶
Latency-critical incremental delivery: Users perceive agent responsiveness through time-to-first-token and time-to-first-audio, requiring sub-second delivery of partial results.¶
Heterogeneous reliability requirements: Within a single turn, partial ASR transcripts are ephemeral (can be superseded), final transcripts are authoritative, TTS audio is time-bounded, and tool results must be delivered reliably.¶
Media over QUIC Transport [MOQT] provides a publish/subscribe protocol with features well-suited to these requirements: prioritized delivery, partial reliability through delivery timeouts, group-based object organization, and relay infrastructure for scalability. However, MOQT defines no application-layer semantics for mapping inference streams to its object model, nor for managing conversational turn-taking.¶
This document specifies:¶
A mapping of streaming inference outputs to the MOQT object data model (Section 2).¶
A turn control protocol for managing dialogue state and handling barge-in interruptions (Section 3).¶
Track structure conventions and naming for live agent sessions (Section 4).¶
Delivery policies appropriate for each stream type (Section 5).¶
This document defines an application-layer profile that operates on top of MOQT [MOQT] without modifying its transport semantics. The relationship to the MoQ protocol suite is illustrated below:¶
+-------------------------------------------------------------------+
| Application Layer (Live Agent Interaction) |
| |
| Maps conversational structure (turns, steps, frames) onto the |
| MOQT object hierarchy; adds turn-taking control via signals. |
+-------------------------------------------------------------------+
| | |
v v v
+------------------+ +-----------------+ +-------------------+
| MoQ Transport | | LOC Container | | MoQ Secure |
| (MOQT) | | (Audio/Video) | | Objects (E2E) |
| - Object Model | | - Codec framing | | - Encryption |
| - Pub/Sub | | - Timing | | - Authentication |
| - Relay | | | | |
| - Priority | | | | |
| - Delivery | | | | |
+------------------+ +-----------------+ +-------------------+
|
v
+-------------------------------------------------------------------+
| QUIC / WebTransport |
| - Stream multiplexing - Datagram extension |
| - TLS 1.3 encryption - Congestion control |
| - 0-RTT resumption - Flow control |
+-------------------------------------------------------------------+
The following table summarizes how live agent domain concepts map to MOQT primitives:¶
| Domain Concept | MOQT Primitive | Semantics |
|---|---|---|
| Conversation Turn | Group | Atomic dialogue unit; GROUP_ORDER descending prioritizes latest turn |
| Inference Step | Subgroup | A sentence, audio segment, or tool call within a turn |
| Token Batch / Audio Frame | Object | Minimum delivery unit; subject to OBJECT_DELIVERY_TIMEOUT |
| Stream Modality | Track | Independent subscribe, priority, and reliability per modality |
| Barge-in Signal | Datagram | Highest priority (0x00); bypasses head-of-line blocking |
| Turn Control | Control Track | Reliable delivery of state-machine transitions |
This document:¶
USES the MOQT object model (Track, Group, Subgroup, Object) to represent conversational structure.¶
USES MOQT native mechanisms (SUBSCRIBE, priority, delivery timeouts, GROUP_ORDER) for QoS enforcement.¶
MAY USE Secure Objects [SECURE-OBJECTS] for end-to-end encryption of agent output through untrusted relays.¶
DOES NOT define new transport-layer framing or modify MOQT wire format.¶
The protocol is guided by the following architectural principles:¶
Map application semantics directly to the MOQT object hierarchy rather than introducing intermediate framing layers. This ensures MOQT relays can perform correct scheduling, timeout-based discard, and caching without understanding application-layer payload formats.¶
All protocol operations MUST work through unmodified MOQT relays. The relay sees standard Tracks, Groups, Subgroups, and Objects with associated priorities and timeouts. No relay-side payload inspection is required.¶
The protocol explicitly models the user-to-agent asymmetry: user input is continuous and latency-critical for the agent; agent output is incremental, multi-modal, and interruptible. This asymmetry is reflected in priority assignment, timeout configuration, and track structure.¶
Every protocol mechanism is evaluated against its contribution to end-to-end latency. Zero additional round-trips for session setup (reuse MOQT session). Datagram delivery for time-critical signals. Batching strategies that bound flush latency.¶
Not all data within a turn has equal value. The protocol assigns per-track and per-subgroup delivery timeouts that allow the transport to discard stale data (old audio frames, superseded partial transcripts) while guaranteeing delivery of authoritative results (final text, tool outputs).¶
The protocol does not mandate specific codecs, model architectures, or inference pipelines. It defines structural conventions (Group=Turn, Subgroup=inference step) that apply regardless of whether the agent produces text, audio, video, or structured data.¶
This document comprises four logical components:¶
Inference Stream Delivery (Section 2): Defines how streaming outputs from ASR, LLM, and TTS pipelines map to the MOQT object data model. Covers text token batching, audio segmentation, tool result framing, and cross-track synchronization.¶
Turn Control Protocol (Section 3): Defines the conversational state machine, control signal format, barge-in handling, VAD integration, and priority assignment for managing dialogue flow.¶
Track Structure and Naming (Section 4): Defines namespace conventions, standard track names, and catalog integration for live agent sessions.¶
Delivery Policies (Section 5): Specifies per-track timeout configurations, transport selection guidelines (Datagram vs Stream), and relay caching behavior.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following terms are used in this document:¶
A stateful interaction between a user endpoint and an AI agent backend, conducted over one or more MOQT sessions.¶
A contiguous period during which one party (user or agent) holds the conversational floor. Mapped to a MOQT Group.¶
A sequence of incremental outputs from an AI model (e.g., LLM tokens, ASR transcripts, TTS audio chunks).¶
An event where the user begins speaking while the agent is still producing output, causing the agent to yield the floor.¶
An intermediate inference output that may be superseded by subsequent results (e.g., partial ASR transcript).¶
A definitive inference output that will not be further modified.¶
This protocol is compatible with multiple deployment topologies. Two examples are illustrated below.¶
User Device MoQ Relay Agent Backend
(App/Browser) (Cache/Fan-out) (Omni-LLM)
| | |
|===== QUIC/WebTransport session =============>|
| | |
|--- Audio Track ----->|-------- fwd --------->|
|--- Video Track ----->|-------- fwd --------->|
| | |
|<-- Audio Track ------|<------ publish -------|
|<-- Text Track -------|<------ publish -------|
|<-- Tool Results -----|<------ publish -------|
| | |
|--- Control Signals ->|-------- fwd --------->|
|<-- Control Signals --|<------ publish -------|
The relay forwards user input to the agent and fans out agent output to subscribers. This topology is suited for scenarios requiring:¶
User Device Agent Backend
(App/Browser) (Omni-LLM)
| |
|===== QUIC/WebTransport session =============>|
| |
|--- Audio Track ----------------------------->|
|--- Video Track ----------------------------->|
| |
|<-- Audio Track ------------------------------|
|<-- Text Track -------------------------------|
|<-- Tool Results -----------------------------|
| |
|--- Control Signals ------------------------->|
|<-- Control Signals --------------------------|
The client connects directly to the agent backend. This topology is suited for scenarios requiring:¶
This specification operates correctly under both topologies. The application-layer semantics (track structure, turn control, object model mapping) are identical regardless of whether a MoQ relay is present:¶
With relay: the relay handles subscription management, priority-based scheduling, and delivery timeout enforcement transparently.¶
Without relay: the agent backend itself implements MOQT session handling. Priority and timeout semantics still apply to the QUIC streams between client and agent.¶
Deployments MAY combine both topologies, for example using direct connections for latency-sensitive single-user sessions while routing multi-subscriber sessions through relays.¶
This section defines how streaming inference outputs map to the MOQT object data model defined in Section 2 of [MOQT].¶
The MOQT object hierarchy consists of Track > Group > Subgroup > Object. This document assigns conversational semantics to each level:¶
| MOQT Level | Live Agent Semantic | Rationale |
|---|---|---|
| Track | Stream type (audio, text, control) | Independent subscription unit |
| Group | Conversation turn | Atomic unit of dialogue; enables turn-level operations |
| Subgroup | Inference step within a turn | Logical segment: a sentence, an audio segment, a tool call |
| Object | Atomic delivery unit | Smallest independently decodable/renderable item |
This mapping enables:¶
The agent text output track carries streaming LLM token output.¶
Each Object in the text output track carries a token batch: one or more sequential tokens generated within a single flush interval.¶
Text Output Object Payload: +--------+--------+-------------------------------------------+ | Field | Type | Description | +--------+--------+-------------------------------------------+ | flags | uint8 | 0x01=partial, 0x02=final, 0x04=cancelled | | seq | varint | Sequence number within subgroup | | count | varint | Number of tokens in this batch | | tokens | UTF-8 | Concatenated token text | +--------+--------+-------------------------------------------+
Implementations SHOULD batch tokens to amortize per-object overhead. The following strategies are RECOMMENDED:¶
Time-based: Flush every 50ms, collecting all tokens generated in that interval into a single Object.¶
Size-based: Flush when accumulated token text reaches 128 bytes.¶
Semantic-based: Flush at sentence boundaries or punctuation marks.¶
An implementation MUST flush immediately when:¶
Within a Subgroup (inference step), Objects are delivered incrementally:¶
Objects with flags=0x01 (partial) represent tokens generated so far. A subscriber MAY render them immediately for real-time display but MUST be prepared for them to be superseded.¶
An Object with flags=0x02 (final) indicates the inference step is complete. The subscriber SHOULD treat the concatenation of all Objects in the Subgroup as the definitive output.¶
An Object with flags=0x04 (cancelled) indicates the inference step was interrupted (e.g., by barge-in). The subscriber SHOULD discard or visually mark the incomplete output.¶
A new Group is created when:¶
The agent begins responding to a new user turn.¶
The turn counter increments (see Section 3.1).¶
The Group is closed (LARGEST_OBJECT property set) when:¶
The agent audio output track carries TTS-synthesized audio. Audio payloads SHOULD use the LOC container format [LOC] for encoding.¶
Each Object carries one audio segment (typically 20-60ms of audio). The LOC header provides codec identification and timing. This document adds an optional extension header for text alignment:¶
Audio Output Object Payload: +------------------+--------+--------------------------------------+ | Field | Type | Description | +------------------+--------+--------------------------------------+ | loc_header | LOC | Standard LOC audio header | | audio_data | bytes | Encoded audio samples | | align_seq (opt) | varint | Corresponding text object seq number | | align_offset(opt)| varint | Character offset within text object | +------------------+--------+--------------------------------------+
The optional alignment fields enable the subscriber to synchronize text highlighting with audio playback (e.g., karaoke-style display).¶
Each Subgroup in the audio track corresponds to one utterance or sentence boundary in the agent's response. This enables:¶
The agent text track and agent audio track use the same Group ID for the same conversational turn. Within a turn:¶
Text Subgroup N corresponds to Audio Subgroup N (same sentence).¶
The align_seq field in audio Objects references the text Object
sequence number being spoken at that audio moment.¶
This enables a subscriber receiving both tracks to:¶
The user publishes a continuous audio input track.¶
Each Object carries a fixed-duration audio frame (typically 20ms) using the LOC container format.¶
The agent MAY publish a tool output track for structured results from tool/function calls.¶
Tool Output Object Payload: +-----------+--------+------------------------------------------+ | Field | Type | Description | +-----------+--------+------------------------------------------+ | flags | uint8 | 0x01=invocation, 0x02=result, 0x04=error | | tool_id | varint | Tool/function identifier | | call_id | varint | Unique call instance identifier | | payload | bytes | JSON-encoded tool call or result | +-----------+--------+------------------------------------------+
Tool outputs MUST be delivered reliably (no delivery timeout). Tool invocation and result Objects are always flagged final (0x02). The subscriber MUST NOT discard tool results due to lateness.¶
This section defines the control protocol for managing conversational turns between the user and agent.¶
A live agent session maintains the following turn states:¶
speech_start
+--------+----------->+---------+
| IDLE | | USER |
| |<-----------+ SPEAKING|
+---+----+ speech_end +----+----+
^ |
| | (agent begins inference)
| v
| +------+------+
| turn_complete | AGENT |
+--------------+ PROCESSING |
| +------+------+
| |
| | (first output produced)
| v
| +------+------+
| turn_complete | AGENT |<---+
+--------------+ SPEAKING | | (output continues)
+------+------+----+
|
barge_in |
+--------+ |
| USER |<------+
|SPEAKING|
+--------+
State transitions:¶
IDLE → USER_SPEAKING: User audio VAD detects speech onset.¶
USER_SPEAKING → AGENT_PROCESSING: User speech ends (silence timeout or explicit end-of-turn signal).¶
AGENT_PROCESSING → AGENT_SPEAKING: Agent produces first output Object in any output track.¶
AGENT_SPEAKING → IDLE: Agent completes response (closes Group in all output tracks).¶
AGENT_SPEAKING → USER_SPEAKING: Barge-in event (user starts speaking while agent is outputting).¶
Turn control signals are exchanged on a dedicated bidirectional control track pair (one per direction). Control Objects use the following format:¶
Control Object Payload: +-----------+--------+------------------------------------------+ | Field | Type | Description | +-----------+--------+------------------------------------------+ | signal | varint | Signal type (see below) | | turn_id | varint | Current turn Group ID | | timestamp | varint | Sender wall-clock time (ms since epoch) | | payload | bytes | Signal-specific data (may be empty) | +-----------+--------+------------------------------------------+ Signal Types: 0x01 = SPEECH_START (user → agent) 0x02 = SPEECH_END (user → agent) 0x03 = BARGE_IN (user → agent) 0x04 = TURN_STARTED (agent → user) 0x05 = TURN_COMPLETE (agent → user) 0x06 = INTERRUPT_ACK (agent → user) 0x07 = THINKING (agent → user)
Barge-in is the critical interaction where a user interrupts the agent's ongoing output. The protocol defines the following sequence:¶
User Device Agent Backend
| |
| [user starts speaking over agent output] |
| |
|-- Control: BARGE_IN (turn_id=N) ---------> |
| (via Datagram, highest priority) |
| |
| [agent stops TTS, notes position] |
| |
|<- Control: INTERRUPT_ACK (turn_id=N) ----- |
| payload: {interrupted_group: N, |
| interrupted_subgroup: M, |
| interrupted_object: K} |
| |
| [agent closes Group N with cancelled flag]|
| |
|<- Text Object: flags=cancelled ----------- |
|<- Audio: subgroup FIN -------------------- |
| |
| [agent begins processing new user input] |
| |
The BARGE_IN signal has the following delivery requirements:¶
MUST be sent via MOQT Datagram for minimum latency.¶
MUST be assigned the highest publisher priority (0x00).¶
SHOULD be sent immediately upon local VAD detection, without waiting for speech_end.¶
The agent MUST process BARGE_IN within one processing cycle (target: < 50ms from receipt to output cessation).¶
Upon receiving BARGE_IN, the agent MUST:¶
Cease generating new output Objects for the current turn.¶
Close the current output Group with a cancelled Object (flags=0x04 in text track, stream FIN in audio track).¶
Send INTERRUPT_ACK with the position where output stopped.¶
Transition to processing the new user input.¶
The agent SHOULD NOT:¶
Speech activity detection events drive the turn state machine. This document does not mandate a specific detection algorithm (traditional energy-based VAD, neural VAD, or other approaches) but defines the signaling semantics:¶
SPEECH_START: Published when the implementation determines that the user has begun speaking.¶
SPEECH_END: Published when the implementation determines that the user has finished speaking.¶
BARGE_IN: Published when SPEECH_START occurs during AGENT_SPEAKING state. This is a composite signal (implies SPEECH_START + interrupt request).¶
VAD signals are sent on the user→agent control track. Implementations MAY perform VAD on the client, on the relay, or on the agent backend. When VAD is performed on the client, it SHOULD be sent as Datagram for lowest latency.¶
The following priority assignments are RECOMMENDED for live agent sessions (lower numeric value = higher priority):¶
| Track/Signal | Priority | Rationale |
|---|---|---|
| Control signals (BARGE_IN) | 0x00 | Must preempt all other traffic |
| Control signals (other) | 0x01 | Turn management is time-critical |
| User audio input | 0x02 | Agent cannot process without input |
| Agent audio output | 0x03 | Primary user-perceived output |
| Agent text output | 0x04 | Secondary output (lower bandwidth) |
| Tool results | 0x05 | Non-time-critical structured data |
Within agent output tracks, GROUP_ORDER SHOULD be set to descending (deliver newest group first) so that relay congestion drops stale turns rather than current ones.¶
A live agent session uses the following namespace structure:¶
Track Namespace: moqt://{authority}/agent/{session-id}/
¶
Where:¶
The following track names are defined within a session namespace:¶
| Track Name | Direction | Content |
|---|---|---|
| input/audio | User → Agent | User microphone audio (LOC) |
| input/text | User → Agent | User text messages |
| output/audio | Agent → User | TTS synthesized audio (LOC) |
| output/text | Agent → User | Streaming LLM text tokens |
| output/tool | Agent → User | Tool invocations and results |
| control/user | User → Agent | User control signals |
| control/agent | Agent → User | Agent control signals |
Additional tracks MAY be defined for:¶
A live agent session SHOULD publish a catalog track conforming to the MOQT Streaming Format [MSF]. The catalog declares:¶
Available tracks and their codec parameters.¶
Agent capabilities (supported input modalities, languages).¶
Session metadata (model identifier, context window size).¶
The catalog enables late-joining subscribers and relay-assisted discovery of session characteristics.¶
| Track | Default Transport | Fallback | Condition |
|---|---|---|---|
| control/* (BARGE_IN) | Datagram | Stream | Signal < MTU |
| control/* (other) | Stream | — | Reliable delivery needed |
| input/audio | Stream | Datagram | If partial reliability desired |
| output/audio | Stream | Datagram | For loss-tolerant low-latency |
| output/text | Stream | — | Must be reliable |
| output/tool | Stream | — | Must be reliable |
This protocol is designed to operate through standard MOQT relays without relay modification. Relays treat live agent traffic as normal MOQT objects with the following beneficial behaviors:¶
Priority-based scheduling: Relays respect publisher priority, ensuring control signals and user audio are forwarded first under congestion.¶
Timeout-based expiry: Relays discard Objects that exceed their delivery timeout, preventing stale audio from consuming bandwidth.¶
Group-order delivery: With descending group order, relays under congestion naturally shed older turns.¶
Relays MAY cache agent output Objects for the duration specified by the MAX_CACHE_DURATION track property. This enables:¶
Late-joining clients to receive the current turn's output.¶
Reconnecting clients to resume from where they left off.¶
Relays SHOULD NOT cache:¶
A single agent session MAY have multiple subscribers to output tracks (e.g., accessibility tools, monitoring, recording). The relay naturally fans out agent output to all subscribers without additional agent-side overhead.¶
For deployments where relay operators are not fully trusted, agent output tracks MAY use end-to-end encryption as defined in [SECURE-OBJECTS]. Control tracks SHOULD NOT be E2E encrypted as relay-level inspection may be needed for priority enforcement.¶
User audio MUST NOT be cached by relays beyond the immediate delivery requirement.¶
Session IDs MUST be cryptographically random (UUIDv7 with random component) to prevent session correlation attacks.¶
Control signals (VAD events, barge-in) leak interaction timing metadata. Implementations MAY add padding to control track Objects to mitigate traffic analysis.¶
Barge-in signals are high-priority and processed immediately. Implementations MUST rate-limit barge-in signals per session (RECOMMENDED: maximum 10 per second) to prevent priority inversion attacks.¶
Relays SHOULD enforce per-session bandwidth quotas to prevent a single agent session from starving other traffic.¶
This document registers the following track properties in the "MOQT Track Properties" registry:¶
| Property Name | Property ID | Type | Description |
|---|---|---|---|
| AGENT_SESSION_ROLE | TBD | varint | 0=user, 1=agent |
| TURN_GROUP_ORDER | TBD | varint | Confirms Group=Turn mapping |
IANA is requested to create a "Live Agent Control Signal Types" registry under the "Media over QUIC (MoQ)" group. The registration procedure is Specification Required.¶
Initial registrations:¶
| Value | Signal Name | Reference |
|---|---|---|
| 0x01 | SPEECH_START | Section 3.4 |
| 0x02 | SPEECH_END | Section 3.4 |
| 0x03 | BARGE_IN | Section 3.3 |
| 0x04 | TURN_STARTED | Section 3.1 |
| 0x05 | TURN_COMPLETE | Section 3.1 |
| 0x06 | INTERRUPT_ACK | Section 3.3 |
| 0x07 | THINKING | Section 3.2 |
Values 0x08-0xFF are available for assignment.¶
IANA is requested to create a "Live Agent Object Flags" registry.¶
Initial registrations:¶
| Bit | Flag Name | Description | Reference |
|---|---|---|---|
| 0 | PARTIAL | Object is intermediate, may be superseded | This document |
| 1 | FINAL | Object is definitive | This document |
| 2 | CANCELLED | Object indicates interruption | This document |
Time User Device Relay Agent Backend | | [User speaks: "What's the weather?"] | t0 PUBLISH input/audio Group=1 ------>-----> ASR processes t0 Control: SPEECH_START ----------->------> | t1 [User stops speaking] t1 Control: SPEECH_END ------------->------> | LLM generates response t2 <------<------- Control: TURN_STARTED t2 <------<------- PUBLISH output/text | Group=1, Subgroup=0 | Object 0: "The weather" | Object 1: " in Hangzhou" | Object 2: " is sunny," t3 <------<------- PUBLISH output/audio | Group=1, Subgroup=0 | [TTS: "The weather in | Hangzhou is sunny,"] | t4 <------<------- Object (text, final): | " 28°C today." t4 <------<------- Control: TURN_COMPLETE
Time User Device Relay Agent Backend
|
| [Agent is speaking: "The weather forecast shows..."]
| [Agent output: Group=1, currently at Subgroup=2]
|
t0 [User interrupts: "Stop, just tell me temperature"]
t0 Control: BARGE_IN (turn=1) ----->------> received
|
t1 [stops TTS generation]
t1 <------<------- Control: INTERRUPT_ACK
| {interrupted: G=1,SG=2,O=5}
t1 <------<------- Text Object(cancelled)
t1 <------<------- Audio subgroup FIN
|
t2 PUBLISH input/audio Group=2 ---->------> ASR: "just tell me temp"
t2 Control: SPEECH_START ---------->------>
|
t3 Control: SPEECH_END ------------>------>
| LLM: context includes
| interrupted response
t4 <------<------- Control: TURN_STARTED
t4 <------<------- Text Group=2: "It's 28°C."
t4 <------<------- Audio Group=2: [TTS]
t5 <------<------- Control: TURN_COMPLETE
Time Subscriber View (User Device) | t0 [Subscribe to output/text AND output/audio, same Group ID] | t1 Text Object arrives: "The answer is" → render immediately t2 Text Object arrives: " forty-two." → append to display | t3 Audio Object arrives: [TTS "The answer"] → begin playback | Text highlighting: "The answer" underlined (via align_seq) | t4 Audio Object arrives: [TTS "is forty"] → continue playback | Text highlighting advances: "is forty" | t5 Audio Object arrives: [TTS "-two."] → finish playback | Text highlighting: "-two." | | [Text arrived ~200ms before audio — user saw text first, | then heard it spoken, with synchronized highlighting]
This document maps directly to the native MOQT object model rather than introducing a custom frame layer because:¶
MOQT Groups/Subgroups already provide the sequencing and boundaries needed for turns and inference steps.¶
MOQT delivery timeouts and priorities operate at the Object level, which is the right granularity for inference delivery.¶
Standard MOQT relays can handle live agent traffic without modification or frame parsing.¶
Reusing the object model means existing MOQT tooling (monitoring, debugging, relay management) works unchanged.¶
Alternatives considered:¶
Group = entire session: Loses the ability to discard stale turns and prevents Group-level priority ordering.¶
Group = single inference step: Too fine-grained; creates excessive Group metadata overhead and prevents turn-level operations.¶
Group = time window (e.g., 1 second): Arbitrary boundary that doesn't align with application semantics; complicates barge-in.¶
Group = Turn provides the natural boundary for: - What to discard when interrupted (the current turn). - What to prioritize (the latest turn). - What to cache for late-joiners (the most recent complete turn).¶
Embedding control signals in-band with media or text Objects was considered but rejected because:¶
Control signals require different delivery characteristics (Datagram, highest priority, reliable).¶
Relays can apply priority to entire tracks but not to individual Objects within a track.¶
Subscribers may want control-only subscription (e.g., turn status for UI state management without receiving media).¶
The authors would like to thank the participants of the MoQ working group for their contributions to the underlying transport protocol that makes this work possible.¶