<?xml version="1.0" encoding="utf-8"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="info"
     docName="draft-chen-nmrg-multi-provider-inference-api-01"
     ipr="trust200902"
     submissionType="IRTF"
     tocInclude="true"
     version="3">
  <front>
    <title abbrev="Multi-Provider Inference API">
      Multi-Provider Extensions for Agentic AI Inference APIs
    </title>
    <author fullname="Huamin Chen" initials="H." surname="Chen">
      <organization>Red Hat</organization>
      <address>
        <postal>
          <street></street>
          <city>Boston</city>
          <region>MA</region>
          <code>02210</code>
          <country>USA</country>
        </postal>
        <email>hchen@redhat.com</email>
      </address>
    </author>
    <author fullname="Luay Jalil" initials="L." surname="Jalil">
      <organization>Verizon</organization>
      <address>
        <postal>
          <street></street>
          <city>Richardson</city>
          <region>TX</region>
          <country>USA</country>
        </postal>
        <email>luay.jalil@verizon.com</email>
      </address>
    </author>
    <author fullname="Nabeel Cocker" initials="N." surname="Cocker">
      <organization>Red Hat</organization>
      <address>
        <postal>
          <street></street>
          <city>New York</city>
          <region>NY</region>
          <country>USA</country>
        </postal>
        <email>ncocker@redhat.com</email>
      </address>
    </author>
    <date year="2026" month="March" day="14"/>
    <area>Internet Research Task Force</area>
    <workgroup>Network Management Research Group</workgroup>
    <keyword>AI</keyword>
    <keyword>inference</keyword>
    <keyword>distributed systems</keyword>
    <keyword>inference API</keyword>
    <keyword>multi-provider</keyword>
    <keyword>agentic AI</keyword>
    <keyword>RBAC</keyword>
    <keyword>rate limiting</keyword>
    <abstract>
      <t>
        This document specifies extensions for multi-provider distributed AI 
        inference using the widely-adopted OpenAI Responses API as the reference 
        interface standard. These extensions enable provider diversity, load 
        balancing, failover, and capability negotiation in distributed inference 
        environments while maintaining full backward compatibility with existing 
        implementations. The extensions do not require changes to standard API 
        usage patterns or existing client applications.
      </t>
      <t>
        By treating the OpenAI Responses API as a de facto standard interface 
        (similar to how HTTP serves as a standard protocol), these extensions 
        provide an optional enhancement layer for multi-provider orchestration, 
        intelligent routing, and distributed inference capabilities. The approach 
        preserves the familiar API interface that developers already know and use, 
        while enabling seamless integration across multiple AI inference providers 
        without vendor lock-in.
      </t>
      <t>
        This revision (-01) adds identity-based authorization, role-based
        access control (RBAC), and rate limiting extensions for secure
        multi-tenant deployments.
      </t>
    </abstract>
    <note title="Changes from -00">
      <t>
        Added three subsections to Section 6 (Extension Headers):
        authorization identity headers for JWT/OAuth integration,
        an RBAC framework for tiered model access, and rate limiting
        with RPM/TPM support. Minor updates to Problem Statement,
        Design Principles, Security Considerations, and IANA
        Considerations to reflect the new capabilities.
      </t>
    </note>
  </front>

  <middle>
    <section title="Introduction" numbered="true">
      <t>
        The OpenAI Responses API <xref target="OPENAI-RESPONSES-API"/> has emerged 
        as a de facto standard interface for agentic AI applications, with widespread 
        adoption across the industry. Many providers now offer compatible endpoints, 
        creating a rich ecosystem of inference services. This document treats the 
        OpenAI Responses API as a reference standard interface (analogous to how 
        HTTP serves as a standard protocol), rather than as a vendor-specific 
        implementation. However, applications that want to leverage multiple 
        providers face significant challenges in orchestrating distributed inference, 
        handling provider failures, and optimizing resource utilization across 
        heterogeneous environments.
      </t>
      <t>
        This document specifies vendor-neutral extensions that enable multi-provider 
        AI inference orchestration while maintaining the familiar API interface. 
        The extensions allow applications to leverage the best models and tools 
        from multiple providers without vendor lock-in. The approach uses "auto" 
        parameters and extension headers to enable intelligent provider selection, 
        capability mapping, and distributed inference coordination. The extensions 
        are designed as optional HTTP headers and response fields that enhance 
        the reference API with multi-provider capabilities while ensuring that 
        existing applications continue to work unchanged.
      </t>
      <t>
        The key principle is compatibility-first: any application that works 
        with the reference API interface will continue to work with these 
        extensions, while applications that choose to use the extensions gain 
        access to advanced multi-provider features like intelligent routing, 
        automatic failover, and distributed load balancing.
      </t>
      <t>
        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they appear in all
        capitals, as shown here.
      </t>
    </section>

    <section title="Conventions and Terminology" numbered="true">
      <t>OpenAI Responses API: The reference API specification for agentic AI inference <xref target="OPENAI-RESPONSES-API"/>, designed for hackathon-friendly rapid prototyping and widely adopted across the industry as a de facto standard.</t>
      <t>Multi-Provider Router: A service that extends the reference API with multi-provider orchestration capabilities while maintaining full compatibility.</t>
      <t>Provider Pool: A collection of compatible inference services that can be orchestrated by the multi-provider router.</t>
      <t>Multi-Vendor Compatibility: The ability to seamlessly integrate and route requests across multiple AI inference providers while maintaining a consistent interface.</t>
      <t>Extension Headers: Optional HTTP/HTTPS headers that provide multi-provider functionality without affecting standard API behavior.</t>
      <t>Distributed Inference: The orchestration of AI inference requests across multiple providers to achieve better performance, reliability, and resource utilization.</t>
      <t>Transport Protocol: All API endpoints support both HTTP and HTTPS protocols. HTTPS SHOULD be used for production deployments to ensure confidentiality and integrity of inference requests and responses.</t>
    </section>

    <section title="Problem Statement" numbered="true">
      <t>
        While the OpenAI API provides an excellent standard interface for AI 
        inference, several challenges arise when deploying at scale across 
        multiple providers:
      </t>
      <t>
        1. <strong>Provider Lock-in:</strong> Applications typically connect to a single 
        provider, creating dependency on that provider's availability, pricing, 
        and capabilities.
      </t>
      <t>
        2. <strong>Limited Failover:</strong> When a provider experiences issues, 
        applications have no automatic mechanism to failover to alternative 
        providers while maintaining session continuity.
      </t>
      <t>
        3. <strong>Suboptimal Resource Utilization:</strong> Different providers excel 
        in different scenarios (cost, latency, specialized models), but 
        applications cannot easily leverage these strengths dynamically.
      </t>
      <t>
        4. <strong>Operational Complexity:</strong> Managing multiple provider 
        connections, API keys, and routing logic adds significant complexity 
        to application development and operations.
      </t>
      <t>
        5. <strong>Inconsistent Capabilities:</strong> While providers offer 
        OpenAI-compatible APIs, they may have different model names, 
        capabilities, and limitations that applications must handle manually.
      </t>
      <t>
        6. <strong>Multi-Tenancy Requirements:</strong> Production deployments 
        require user authentication, authorization, and usage governance 
        across multiple tenants with different access levels and rate limits.
      </t>
      <t>
        These extensions address these challenges while preserving the 
        simplicity and familiarity of the OpenAI API that developers rely on.
      </t>
    </section>

    <section title="Design Principles" numbered="true">
      <t>
        The extensions are designed according to the following principles:
      </t>
      <t>
        1. <strong>Multi-Vendor Support:</strong> Enable seamless integration 
        across multiple AI inference providers without vendor lock-in. 
        Applications can leverage the best capabilities from different 
        providers within a unified interface.
      </t>
      <t>
        2. <strong>Opt-in Enhancement:</strong> Multi-provider features are enabled 
        only when clients explicitly request them through extension headers. 
        Default behavior remains unchanged.
      </t>
      <t>
        3. <strong>Transparent Operation:</strong> When multi-provider features are 
        enabled, the complexity of provider orchestration is hidden from the 
        client. Responses maintain standard OpenAI API format.
      </t>
      <t>
        4. <strong>Graceful Degradation:</strong> If multi-provider features are 
        unavailable or fail, the system falls back to standard single-provider 
        behavior.
      </t>
      <t>
        5. <strong>Standard Compliance:</strong> All extensions use standard HTTP 
        mechanisms and do not require proprietary protocols or non-standard 
        API modifications.
      </t>
      <t>
        6. <strong>Security and Multi-Tenancy:</strong> Authentication and 
        authorization integrate with standard identity frameworks (JWT, OAuth) 
        without requiring new authentication mechanisms.
      </t>
    </section>

    <section title="Auto-Selection Parameters for Vendor Neutrality" numbered="true">
      <t>
        The OpenAI Responses API supports several parameters that benefit 
        from vendor-neutral "auto" values, enabling seamless multi-provider 
        orchestration. The following parameters are enhanced with auto-selection 
        capabilities:
      </t>
      
      <section title="Vendor-Neutral Parameter Mapping" numbered="false">
        <t>
          Key Responses API parameters that require provider-specific mapping:
        </t>
        
        <table anchor="vendor-neutral-params">
          <name>Vendor-Neutral Parameters</name>
          <thead>
            <tr>
              <th>Parameter</th>
              <th>Auto Value</th>
              <th>Router Behavior</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>model</td>
              <td>"auto"</td>
              <td>Maps to optimal provider-specific model based on task</td>
            </tr>
            <tr>
              <td>tools</td>
              <td>"auto"</td>
              <td>Selects appropriate tools from provider's available toolkit</td>
            </tr>
            <tr>
              <td>tool_choice</td>
              <td>"auto"</td>
              <td>Lets provider decide when to use tools based on context</td>
            </tr>
            <tr>
              <td>reasoning</td>
              <td>"auto"</td>
              <td>Maps to reasoning-capable models or simulates reasoning for multi-vendor compatibility</td>
            </tr>
            <tr>
              <td>max_completion_tokens</td>
              <td>"auto"</td>
              <td>Calculates optimal token limit based on task complexity</td>
            </tr>
            <tr>
              <td>response_format</td>
              <td>provider-adaptive</td>
              <td>Adapts format requirements to provider capabilities</td>
            </tr>
          </tbody>
        </table>
      </section>
      
      <section title="Auto-Model Selection Criteria" numbered="false">
        <t>
          When model is set to "auto", the router uses these criteria for selection:
        </t>
        <t>
          1. <strong>Task Classification:</strong> Analyzes the request to determine 
          task type (reasoning, coding, creative, analytical, etc.)
        </t>
        <t>
          2. <strong>Provider Capabilities:</strong> Matches task requirements to 
          provider strengths and available models
        </t>
        <t>
          3. <strong>Performance Requirements:</strong> Considers latency, cost, 
          and quality constraints from extension headers
        </t>
        <t>
          4. <strong>Context Awareness:</strong> Maintains conversation context 
          and provider affinity when beneficial
        </t>
        <t>
          5. <strong>Authorization Context:</strong> Considers user identity and 
          role-based access policies when selecting models and providers
        </t>
      </section>
      
      <section title="Auto-Tool Selection Framework" numbered="false">
        <t>
          When tools is set to "auto", the router implements intelligent tool selection:
        </t>
        <t>
          1. <strong>Tool Category Mapping:</strong> Maps generic tool categories 
          (web-search, code-execution, image-generation) to provider-specific tools
        </t>
        <t>
          2. <strong>Capability Discovery:</strong> Dynamically discovers available 
          tools from each provider and their capabilities
        </t>
        <t>
          3. <strong>Context-Aware Selection:</strong> Chooses tools based on 
          conversation context and task requirements
        </t>
        <t>
          4. <strong>Cross-Provider Orchestration:</strong> Coordinates tool usage 
          across multiple providers when beneficial
        </t>
      </section>
      
      <section title="Auto-Reasoning Capability Mapping" numbered="false">
        <t>
          When reasoning is set to "auto", the router intelligently handles 
          providers with different reasoning capabilities:
        </t>
        <t>
          1. <strong>Native Reasoning Models:</strong> Routes to providers with 
          dedicated reasoning models (o1, o1-mini, etc.)
        </t>
        <t>
          2. <strong>Reasoning-Enhanced Models:</strong> Uses models optimized 
          for logical thinking and step-by-step analysis
        </t>
        <t>
          3. <strong>Simulated Reasoning:</strong> For providers without native 
          reasoning, implements reasoning through structured prompting and 
          chain-of-thought techniques
        </t>
        <t>
          4. <strong>Fallback Strategies:</strong> Gracefully degrades to 
          best-available reasoning approximation when native reasoning 
          is unavailable
        </t>
        
        <figure>
        <artwork><![CDATA[
# Auto-reasoning with mixed providers
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: reasoning-optimized
X-AI-Routing-Strategy: capability-first

{
  "model": "auto",
  "messages": [
    {
      "role": "user", 
      "content": "Solve: If 3x+7=22, what is x?"
    }
  ]
}

# Router selects provider with native reasoning capability
HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: openai-reasoning
X-AI-Model-Mapped: o1-preview
X-AI-Auto-Selection: {
  "reasoning_capability": "native",
  "provider_selection": {
    "primary_choice": {
      "provider": "openai-reasoning",
      "model": "o1-preview",
      "reasoning_type": "native_model",
      "confidence": 0.98
    },
    "alternatives_considered": [
      {
        "provider": "anthropic",
        "model": "claude-3-5-sonnet",
        "reasoning_type": "enhanced",
        "confidence": 0.85,
        "reason_not_selected": "native_available"
      },
      {
        "provider": "local",
        "model": "llama-3-8b",
        "reasoning_type": "sim_cot",
        "confidence": 0.65,
        "reason_not_selected": "lower_capability"
      }
    ]
  }
}

{
  "id": "resp-reasoning-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
        "message": {
          "role": "assistant",
          "content": "x=5. Reasoning: 3x+7=22->3x=15->x=5"
        }
      ]
    }
  ]
}
        ]]></artwork>
        </figure>
        
        <t><strong>Fallback to Simulated Reasoning Example:</strong></t>
        <figure>
        <artwork><![CDATA[
# Same request when native reasoning providers are unavailable
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: reasoning-optimized
X-AI-Provider-Pool: anthropic,cohere,local-models

{
  "model": "auto",
  "messages": [
    {
      "role": "user", 
      "content": "Solve: If 3x+7=22, what is x?"
    }
  ]
}

# Router falls back to simulated reasoning
HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: anthropic-enhanced
X-AI-Model-Mapped: claude-3-5-sonnet
X-AI-Auto-Selection: {
  "reasoning_capability": "simulated",
  "fallback_strategy": {
    "native_reasoning_available": false,
    "selected_approach": "enhanced_chain_of_thought",
    "prompt_enhancement": "added_reasoning_structure",
    "confidence": 0.87
  },
  "reasoning_simulation": {
    "technique": "structured_step_by_step",
    "verification_added": true,
    "explanation_enhanced": true
  }
}

{
  "id": "resp-reasoning-002",
  "object": "response",
  "created": 1699123500,
  "model": "auto",
  "choices": [
    {
        "message": {
          "role": "assistant",
          "content": "Step-by-step: 3x+7=22->3x=15->x=5"
        }
      ]
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
      
      <section title="Auto Parameters and Header Synchronization" numbered="false">
        <t>
          Auto parameters in the request body are the primary mechanism for 
          enabling multi-vendor capabilities. Extension headers provide 
          supplementary information to assist the router in making optimal 
          decisions:
        </t>
        <t>
          <strong>Primary Control:</strong> Auto parameters ("model": "auto", 
          "tools": "auto", "reasoning": "auto") trigger multi-vendor selection.
        </t>
        <t>
          <strong>Decision Assistance:</strong> Headers provide hints, constraints, 
          and preferences to guide the auto-selection process.
        </t>
        <t>
          <strong>Synchronization Rules:</strong>
        </t>
        <t>
          1. If auto parameter is NOT "auto" but headers suggest multi-provider 
          behavior, the router SHOULD honor the explicit parameter value and 
          ignore conflicting headers.
        </t>
        <t>
          2. If auto parameter is "auto" but X-AI-Multi-Provider is 
          "disabled", the router MUST treat the parameter as a regular 
          non-auto value and route to a single default provider.
        </t>
        <t>
          3. If auto parameter is "auto" and X-AI-Multi-Provider is 
          "enabled" (or absent but other multi-provider headers are present), 
          the router SHOULD use headers as decision assistance.
        </t>
        <t>
          4. If headers contain conflicting information (e.g., 
          X-AI-Routing-Strategy: "cost" but X-AI-Quality-Threshold: 0.95), 
          the router SHOULD prioritize explicit constraints (quality threshold) 
          over optimization strategies (cost).
        </t>
      </section>
    </section>
    
    <section title="Extension Headers" numbered="true">
      <t>
        The multi-provider extensions are implemented through optional HTTP 
        headers that clients can include in standard OpenAI Responses API requests. 
        These headers provide hints and preferences for auto-selection and 
        multi-provider orchestration.
      </t>

      <section title="Request Headers" numbered="false">
        <t>
          The following headers can be included in requests to enable 
          multi-provider features:
        </t>
        
        <table anchor="request-headers">
          <name>Multi-Provider Assistance Headers</name>
          <thead>
            <tr>
              <th>Header</th>
              <th>Values</th>
              <th>Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>X-AI-Multi-Provider</td>
              <td>enabled | disabled</td>
              <td>Enable multi-provider orchestration (master switch)</td>
            </tr>
            <tr>
              <td>X-AI-Provider-Pool</td>
              <td>CSV of provider IDs</td>
              <td>Constrain auto-selection to specific providers</td>
            </tr>
            <tr>
              <td>X-AI-Routing-Strategy</td>
              <td>cost | latency | quality | capability-first</td>
              <td>Optimization strategy for auto-parameter decisions</td>
            </tr>
            <tr>
              <td>X-AI-Task-Hint</td>
              <td>reasoning | coding | creative | analytical | multimodal</td>
              <td>Task type hint to assist model auto-selection</td>
            </tr>
            <tr>
              <td>X-AI-Tool-Categories</td>
              <td>CSV of tool categories</td>
              <td>Preferred tool categories to assist tools auto-selection</td>
            </tr>
            <tr>
              <td>X-AI-Reasoning-Preference</td>
              <td>native | enhanced | simulated</td>
              <td>Reasoning approach preference to assist reasoning auto-selection</td>
            </tr>
            <tr>
              <td>X-AI-Quality-Threshold</td>
              <td>0.0 - 1.0</td>
              <td>Minimum quality threshold for auto-selected providers</td>
            </tr>
            <tr>
              <td>X-AI-Max-Latency</td>
              <td>milliseconds</td>
              <td>Maximum acceptable latency for auto-selected providers</td>
            </tr>
            <tr>
              <td>X-AI-Cost-Limit</td>
              <td>USD per request</td>
              <td>Maximum cost limit for auto-selected providers</td>
            </tr>
            <tr>
              <td>X-AI-Failover-Policy</td>
              <td>none | automatic | manual</td>
              <td>Failover behavior when auto-selected providers fail</td>
            </tr>
          </tbody>
        </table>
      </section>

      <section title="Response Headers" numbered="false">
        <t>
          When multi-provider features are active, responses include additional 
          headers providing transparency into the routing decisions:
        </t>
        
        <table anchor="response-headers">
          <name>Auto-Selection Response Headers</name>
          <thead>
            <tr>
              <th>Header</th>
              <th>Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>X-AI-Provider-Used</td>
              <td>ID of the provider selected by auto-routing</td>
            </tr>
            <tr>
              <td>X-AI-Model-Mapped</td>
              <td>Provider-specific model mapped from "auto"</td>
            </tr>
            <tr>
              <td>X-AI-Auto-Selection</td>
              <td>JSON object with auto-selection decisions</td>
            </tr>
            <tr>
              <td>X-AI-Tool-Mapping</td>
              <td>JSON object showing tool category to provider mapping</td>
            </tr>
            <tr>
              <td>X-AI-Auto-Decisions</td>
              <td>JSON object with all auto-parameter resolutions</td>
            </tr>
            <tr>
              <td>X-AI-Alternatives-Considered</td>
              <td>JSON array of alternative providers/models considered</td>
            </tr>
            <tr>
              <td>X-AI-Selection-Confidence</td>
              <td>Confidence score (0.0 - 1.0) for auto-selection</td>
            </tr>
          </tbody>
        </table>
        
        <t><strong>Synchronization Examples:</strong></t>
        
        <t><em>Example 1: Proper Sync (Headers Assist Auto Parameters)</em></t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: reasoning
X-AI-Reasoning-Preference: native
X-AI-Quality-Threshold: 0.9

{
  "model": "auto",
  "reasoning": "auto",
  "messages": [{"role": "user", "content": "Solve complex math"}]
}

# Router behavior: Uses auto parameters with header guidance
# Selects native reasoning model with quality >= 0.9
        ]]></artwork>
        </figure>
        
        <t><em>Example 2: Conflict Resolution (Explicit Parameter Wins)</em></t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: reasoning

{
  "model": "gpt-4",
  "reasoning": "auto",
  "messages": [{"role": "user", "content": "Solve complex math"}]
}

# Router behavior: Honors explicit "gpt-4" model selection
# Only applies auto-reasoning since reasoning="auto"
# Ignores task hint for model selection
        ]]></artwork>
        </figure>
        
        <t><em>Example 3: Multi-Provider Disabled Override</em></t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: disabled
X-AI-Task-Hint: reasoning

{
  "model": "auto",
  "tools": "auto",
  "messages": [{"role": "user", "content": "Help with coding"}]
}

# Router behavior: Treats "auto" as regular values
# Routes to single default provider
# Ignores all multi-provider headers
        ]]></artwork>
        </figure>
      </section>

      <section title="Authorization and Identity Headers" numbered="false">
        <t>
          Multi-provider routers in production deployments often operate behind
          authentication gateways that inject identity information into request
          headers. This section defines optional headers for conveying
          authenticated user identity to enable authorization-aware routing.
        </t>
        
        <table anchor="authz-headers">
          <name>Authorization Identity Headers</name>
          <thead>
            <tr>
              <th>Header</th>
              <th>Values</th>
              <th>Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>X-Authz-User-Id</td>
              <td>User identifier</td>
              <td>Authenticated user ID (e.g., JWT 'sub' claim)</td>
            </tr>
            <tr>
              <td>X-Authz-User-Groups</td>
              <td>CSV of group names</td>
              <td>User's group memberships (e.g., JWT 'groups' claim)</td>
            </tr>
            <tr>
              <td>X-Authz-User-Roles</td>
              <td>CSV of role names</td>
              <td>User's assigned roles</td>
            </tr>
          </tbody>
        </table>
        
        <t>
          These headers follow conventions used by common authentication
          frameworks including JWT <xref target="RFC7519"/>, OAuth 2.0
          <xref target="RFC6750"/>, and Kubernetes-style RBAC systems. The
          specific header names MAY vary by deployment; the router SHOULD
          support configurable header mappings.
        </t>
        
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer eyJhbGci...
Content-Type: application/json
X-Authz-User-Id: alice
X-Authz-User-Groups: platform-admins,engineering
X-AI-Multi-Provider: enabled

{"model": "auto", "messages": [{"role": "user",
  "content": "Analyze complex system architecture"}]}

HTTP/1.1 200 OK
X-AI-Provider-Used: premium-provider
X-AI-Model-Mapped: advanced-reasoning-model
X-AI-Authz-Applied: true
X-AI-User-Role: admin
        ]]></artwork>
        </figure>
        
        <t>
          The X-AI-Authz-Applied response header indicates whether authorization
          policies were considered in routing. The X-AI-User-Role header provides
          transparency about which role was matched for auditing purposes.
        </t>
      </section>

      <section title="RBAC Framework" numbered="false">
        <t>
          Role-based access control (RBAC) enables multi-tenant deployments where
          different users or groups receive different levels of service. The RBAC
          framework consists of:
        </t>
        <t>
          1. <strong>Role Bindings:</strong> Map users and groups to roles
          (e.g., "admin", "premium_user", "free_user").
        </t>
        <t>
          2. <strong>Routing Decisions:</strong> Associate roles with model
          access policies, provider selection rules, and capability restrictions.
        </t>
        <t>
          3. <strong>Priority Ordering:</strong> Evaluate routing decisions by
          priority to handle overlapping role assignments.
        </t>
        
        <figure>
        <artwork><![CDATA[
# Admin user routed to premium model with reasoning
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-Authz-User-Id: alice
X-Authz-User-Groups: platform-admins
X-AI-Multi-Provider: enabled

{"model": "auto", "messages": [{"role": "user",
  "content": "Analyze code for security vulnerabilities"}]}

HTTP/1.1 200 OK
X-AI-Provider-Used: vllm-premium
X-AI-Model-Mapped: qwen-14b-instruct
X-AI-RBAC-Role: admin
X-AI-Auto-Selection: {
  "rbac_evaluation": {
    "matched_role": "admin",
    "policy_applied": "admin_unrestricted"
  },
  "model_selection": {
    "allowed_models": ["qwen-14b-instruct", "qwen-7b-instruct"],
    "selected": "qwen-14b-instruct",
    "reasoning_enabled": true
  }
}
        ]]></artwork>
        </figure>
        
        <t>
          A free-tier user (X-Authz-User-Groups: free-tier) sending the same
          request would receive X-AI-RBAC-Role: free_user and be routed to
          qwen-7b-instruct with reasoning disabled. RBAC policies can combine
          with task classification, enabling context-aware authorization (e.g.,
          premium users get advanced models only for complex queries).
        </t>
      </section>

      <section title="Rate Limiting" numbered="false">
        <t>
          Multi-provider routers SHOULD implement rate limiting to protect
          infrastructure and ensure fair resource allocation across tenants.
        </t>
        
        <table anchor="ratelimit-headers">
          <name>Rate Limit Response Headers</name>
          <thead>
            <tr>
              <th>Header</th>
              <th>Values</th>
              <th>Description</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>X-RateLimit-Limit</td>
              <td>Integer</td>
              <td>Total requests allowed per time window</td>
            </tr>
            <tr>
              <td>X-RateLimit-Remaining</td>
              <td>Integer</td>
              <td>Remaining requests in current window</td>
            </tr>
            <tr>
              <td>X-RateLimit-Reset</td>
              <td>Unix timestamp</td>
              <td>When rate limit window resets</td>
            </tr>
            <tr>
              <td>X-RateLimit-Retry-After</td>
              <td>Seconds</td>
              <td>Time to wait before retrying (429 only)</td>
            </tr>
            <tr>
              <td>X-TokenLimit-Limit</td>
              <td>Integer</td>
              <td>Total tokens allowed per window (TPM)</td>
            </tr>
            <tr>
              <td>X-TokenLimit-Remaining</td>
              <td>Integer</td>
              <td>Remaining tokens in current window</td>
            </tr>
          </tbody>
        </table>
        
        <t>
          Rate limiting can be applied at multiple levels: request-based (RPM),
          token-based (TPM), model-specific, and per-user/group. Routers MAY
          implement rate limiting through an external Rate Limit Service (e.g.,
          Envoy RLS <xref target="ENVOY-RLS"/> via gRPC), a local in-process
          limiter using sliding window counters, or a hybrid chain using
          first-deny semantics.
        </t>
        
        <figure>
        <artwork><![CDATA[
# Rate limit exceeded
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699126800
X-RateLimit-Retry-After: 42

{"error": {"message": "Rate limit exceeded",
  "type": "rate_limit_error",
  "code": "rate_limit_exceeded"}}

# Successful request with rate limit headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-TokenLimit-Limit: 1000000
X-TokenLimit-Remaining: 345678
X-AI-Provider-Used: vllm-premium
X-AI-RBAC-Role: premium_user
        ]]></artwork>
        </figure>
        
        <t>
          The router SHOULD support fail-open and fail-closed modes. In fail-closed
          mode (default), rate limiter errors reject requests to prevent bypass
          during outages. In fail-open mode, errors allow requests through,
          prioritizing availability.
        </t>
      </section>
    </section>

    <section title="Multi-Provider Orchestration with Responses API" numbered="true">
      <t>
        The following examples demonstrate how the extensions work with 
        OpenAI's Responses API while providing multi-provider 
        capabilities through auto-model and auto-tool selection.
      </t>

      <section title="Auto-Model Selection with Responses API" numbered="false">
        <t>
          Using the OpenAI Responses API with auto-model selection for 
          vendor-neutral multi-provider routing:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: reasoning
X-AI-Routing-Strategy: balanced

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Solve: integral of x*sin(x^2)"
    }
  ],
  "response_format": {
    "type": "text"
  },
  "tools": "auto",
  "max_completion_tokens": 500
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: provider-anthropic
X-AI-Model-Mapped: claude-3-5-sonnet
X-AI-Auto-Selection: {
  "model_selection": {
    "requested": "auto",
    "criteria": "reasoning",
    "selected": "claude-3-5-sonnet",
    "reason": "best_math_reasoning"
  },
  "tool_selection": {
    "available_tools": ["calculator", "wolfram", "python"],
    "selected": "python",
    "reason": "symbolic_math"
  }
}

{
  "id": "resp-abc123",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'll solve using substitution...",
        
        "tool_calls": [
          {
            "id": "call_python_123",
            "type": "function",
            "function": {
              "name": "python_calculator",
              "arguments": "{\"code\": \"import sympy as sp...\"}"
            }
        },
        "finish_reason": "stop"
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}
        ]]></artwork>
        </figure>
      </section>

      <section title="Auto-Tool Selection with Provider Mapping" numbered="false">
        <t>
          The router automatically maps generic tool requests to provider-specific 
          implementations while maintaining Responses API compatibility:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: coding
X-AI-Provider-Pool: openai,anthropic,cohere
X-AI-Tool-Categories: web-scraping,data-viz

{
  "model": "auto",
  "messages": [
       {
         "role": "user",
         "content": "Create web scraper and visualize data"
       }
     ],
  "tools": "auto",
  "tool_choice": "auto",
  "response_format": {
    "type": "text"
  }
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: openai
X-AI-Model-Mapped: gpt-4-turbo
X-AI-Tool-Mapping: {
  "requested": "auto",
  "available_categories": ["web-scraping", "data-viz", "code-exec"],
  "provider_tools": {
    "openai": ["browser", "python", "dalle"],
    "anthropic": ["computer_use", "text_editor"],
    "cohere": ["web_search", "python_interpreter"]
  },
  "selected_tools": [
    {
      "category": "web-scraping",
      "provider_tool": "browser",
      "generic_name": "web_scraper"
    },
    {
      "category": "data-viz", 
      "provider_tool": "python",
      "generic_name": "data_visualizer"
    }
  ]
}

{
  "id": "resp-def456",
  "object": "response",
  "created": 1699123500,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'll create a web scraper and visualize data.",
        "tool_calls": [
          {
            "id": "call_browser_123",
            "type": "function",
            "function": {
              "name": "web_scraper",
              "arguments": "{\"url\": \"example.com\"}"
            }
        },
        "finish_reason": "stop"
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 95,
    "total_tokens": 113
  }
}
        ]]></artwork>
        </figure>
      </section>

      <section title="Vendor-Neutral Parameters with Auto-Selection" numbered="false">
        <t>
        The Responses API enables vendor-neutral parameter handling through 
        auto-selection, allowing seamless provider switching:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: analytical
X-AI-Tool-Categories: data-analysis,reporting
X-AI-Routing-Strategy: cost
X-AI-Quality-Threshold: 0.85

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Analyze data and create summary"
    }
  ],
  "tools": "auto",
  "tool_choice": "auto",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "analysis_report",
      "schema": {
        "type": "object",
        "properties": {
          "summary": {"type": "string"},
          "insights": {"type": "array"}
        }
      }
    }
  },
  "max_completion_tokens": "auto"
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: cohere
X-AI-Model-Mapped: command-r-plus
X-AI-Auto-Decisions: {
  "model_selection": {
    "criteria": "cost-efficient + data-analysis",
    "alternatives": {
      "openai": {"model": "gpt-4o-mini", "cost": 0.15},
      "anthropic": {"model": "haiku", "cost": 0.25},
      "cohere": {"model": "command-r-plus", "cost": 0.08}
    },
    "selected": "cohere",
    "reason": "best_cost_above_threshold"
  },
  "tool_mapping": {
    "data": "cohere_connector",
    "report": "structured_output"
  },
  "token_optimization": {
    "requested": "auto",
    "calculated": 300,
    "basis": "task_complexity_analysis"
  }
}

{
  "id": "resp-ghi789",
  "object": "response",
  "created": 1699123600,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"summary\": \"Analysis...\", \"data\": [...]}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 285,
    "total_tokens": 435
  }
}
        ]]></artwork>
        </figure>
      </section>
    </section>

    <section title="Streaming and Auto-Selection Compatibility" numbered="true">
      <t>
        The extensions maintain full compatibility with OpenAI Responses 
        API streaming while providing auto-selection capabilities.
      </t>

      <section title="Streaming with Auto-Model Selection" numbered="false">
        <t>
          Server-sent events streaming works with auto-model selection, 
          with provider routing happening before stream initiation:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: creative
X-AI-Routing-Strategy: latency

{
  "model": "auto",
    "messages": [{"role": "user", "content": "Write a story"}],
  "stream": true,
  "max_completion_tokens": "auto"
}

HTTP/1.1 200 OK
Content-Type: text/event-stream
X-AI-Provider-Used: anthropic
X-AI-Model-Mapped: claude-3-5-sonnet
X-AI-Auto-Selection: {"criteria": "creative", "confidence": 0.92}

data: {"id":"resp-stream1","object":"response.chunk",...}

data: {"id":"resp-stream1","object":"response.chunk",...}

data: [DONE]
        ]]></artwork>
        </figure>
      </section>

      <section title="Auto-Tool Selection with Responses API" numbered="false">
        <t>
          Tool calling with auto-selection maps generic tool requests to 
          provider-specific implementations seamlessly:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: multimodal
X-AI-Tool-Categories: weather,web-search

{
  "model": "auto",
    "messages": [
      {
        "role": "user",
        "content": "What's the weather in Boston and latest news?"
      }
    ],
  "tools": "auto",
  "tool_choice": "auto"
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: openai
X-AI-Model-Mapped: gpt-4o
X-AI-Tool-Mapping: {
  "weather": "openai_weather_tool",
  "web-search": "openai_browser_tool"
}

{
  "id": "resp-func123",
  "object": "response",
  "created": 1699123700,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'll get the weather and latest news for you.",
        "tool_calls": [
          {
            "id": "call_weather_123",
            "type": "function",
            "function": {
              "name": "weather_lookup",
              "arguments": "{\"location\": \"Boston\"}"
            }
          },
          {
            "id": "call_news_456",
            "type": "function",
            "function": {
              "name": "web_search",
              "arguments": "{\"query\": \"Boston latest news\"}"
            }
        },
        "finish_reason": "stop"
      },
      "finish_reason": "tool_calls"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Performance-Based Auto-Selection" numbered="true">
      <t>
        The OpenAI Responses API with auto-selection enables dynamic 
        provider routing based on performance requirements and real-time 
        characteristics.
      </t>

      <section title="Latency-Optimized Auto-Selection" numbered="false">
        <t>
          Applications can specify performance requirements through extension 
          headers while using standard Responses API calls:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: support
X-AI-Routing-Strategy: latency
X-AI-Max-Latency: 200

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Quick support response needed"
    }
  ],
  "max_completion_tokens": "auto",
  "response_format": {"type": "text"}
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: edge-provider-fast
X-AI-Model-Mapped: fast-response-model
X-AI-Auto-Selection: {
  "latency_achieved_ms": 180,
  "alternatives_rejected": [
    {"provider": "cloud", "latency_ms": 350, "reason": "slow"},
    {"provider": "premium", "latency_ms": 800, "reason": "limit"}
  ],
  "performance_tier": "edge-optimized"
}

{
  "id": "resp-support-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I can help you right away..."
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>

      <section title="Quality vs Speed Trade-off" numbered="false">
        <t>
          Auto-selection can balance quality and performance requirements:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: analytical
X-AI-Quality-Threshold: 0.85
X-AI-Max-Latency: 2000

{
  "model": "auto",
  "messages": [
     {
       "role": "user",
       "content": "Analyze legal document for compliance"
     }
   ],
  "max_completion_tokens": "auto"
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: premium-balanced
X-AI-Model-Mapped: legal-analysis-model
X-AI-Auto-Decisions: {
  "quality_achieved": 0.92,
  "latency_ms": 1800,
  "tradeoffs": {
    "fastest": {"quality": 0.72, "rejected": "below_thresh"},
    "highest_quality": {"latency_ms": 5000, "rejected": "too_slow"}
  },
  "selection_rationale": "optimal_quality_within_latency_constraint"
}

{
  "id": "resp-legal-002",
  "object": "response",
  "created": 1699123500,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Based on my analysis of the document..."
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Multi-Turn Failover with Persistent Tracking" numbered="true">
      <t>
        The extensions maintain conversation continuity during provider failures 
        through persistent ID tracking and state preservation using standard 
        OpenAI conversation patterns.
      </t>
      
      <figure anchor="fig-failover-sequence" title="Multi-Turn Failover with OpenAI API">
        <artwork type="ascii-art"><![CDATA[
Client    Router    Provider A  Provider B  State Store
  |          |           |           |           |
  | Start    |           |           |           |
  | Conv     |           |           |           |
  |--------->|           |           |           |
  |          | Route +   |           |           |
  |          | Store     |           |           |
  |          |---------->|           |           |
  |          |           |           |        Store
  |          |           |           |      conv-001
  |          |<----------|           |           |
  | conv-001 |           |           |           |
  |<---------|           |           |           |
  |          |           |           |           |
  | Continue |           |           |           |
  |--------->|           |           |           |
  |          | Route +   |           |           |
  |          | Load      |           |           |
  |          |---------->|           |           |
  |          |           X TIMEOUT   |           |
  |          |           |           |           |
  |          | FAILOVER  |           |           |
  |          | Detect +  |           |      Retrieve
  |          | Retrieve  |           |     conv-001
  |          | State     |           |           |
  |          |           |           |           |
  |          | Route B + |           |           |
  |          | Full St   |           |           |
  |          |---------------------->|           |
  |          |           |           |        Store
  |          |           |           |      conv-001
  |          |<----------------------|           |
  | conv-001 |           |           |           |
  | FAILOVER |           |           |           |
  |<---------|           |           |           |
        ]]></artwork>
      </figure>
      
      <section title="Seamless Failover Example" numbered="false">
        <t>
          Multi-turn conversations maintain context automatically during failover:
        </t>
        <figure>
        <artwork><![CDATA[
# Turn 1: Initial request to Provider A
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: coding
X-AI-Failover-Policy: automatic

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Debug: def calc(x): return x/0"
    }
  ]
}

# Response from Provider A
HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: provider-a
X-AI-Model-Mapped: code-assistant-model
X-AI-Conversation-ID: conv-debug-001

{
  "id": "resp-debug-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Division by zero error. Add error handling."
      },
      "finish_reason": "stop"
    }
  ]
}

# Turn 2: Follow-up (Provider A fails, auto-failover to Provider B)
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: coding
X-AI-Failover-Policy: automatic

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Debug Python: def calc(x): return x/0"
    },
    {
      "role": "assistant",
      "content": "Division by zero error. Add error handling."
    },
    {
      "role": "user",
      "content": "Show me the corrected code"
    }
  ]
}

# Auto-failover response from Provider B
HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: provider-b
X-AI-Model-Mapped: advanced-coder-model
X-AI-Failover-Occurred: true
X-AI-Auto-Selection: {
  "failover_reason": "provider_a_timeout",
  "failover_time_ms": 1200,
  "context_preserved": true,
  "conversation_continuity": "maintained"
}

{
  "id": "resp-debug-002",
  "object": "response",
  "created": 1699123500,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's the corrected code with error handling"
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Security-Aware Auto-Selection" numbered="true">
      <t>
        The extensions handle providers with different security requirements 
        and compliance levels through security-aware auto-selection.
      </t>
      
      <figure anchor="fig-security-selection" title="Security-Aware Provider Selection">
        <artwork type="ascii-art"><![CDATA[
Client      Router      Security    Provider Pool
  |           |         Evaluator       |
  | Request + |            |            |
  | Sec Reqs  |            |            |
  |---------->|            |            |
  |           | Evaluate   |            |
  |           | Security   |            |
  |           |----------->|            |
  |           |            | Check      |
  |           |            | Compliance |
  |           |            | Hi|Med|Lo  |
  |           |            |            |
  |           |<-----------|            |
  |           | Security   |            |
  |           | Matched    |            |
  |           |            |            |
  |           | Route to   |            |
  |           | Compliant  |            |
  |           |------------------------>|
  |           |            |            |
  |           |<------------------------|
  | Response +|            |            |
  | Sec Info  |            |            |
  |<----------|            |            |
        ]]></artwork>
      </figure>
      
      <section title="HIPAA-Compliant Auto-Selection" numbered="false">
        <t>
          Medical data processing with strict compliance requirements:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: medical
X-AI-Security-Requirements: hipaa,pii
X-AI-Data-Classification: sensitive-medical

{
  "model": "auto",
  "messages": [
     {
       "role": "user",
       "content": "Analyze patient symptoms for diagnosis"
     }
   ],
  "tools": "auto",
  "response_format": {"type": "text"}
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: healthcare-secure
X-AI-Model-Mapped: medical-analysis-hipaa
X-AI-Auto-Selection: {
  "security_compliance": {
    "hipaa": "certified",
    "soc2_type2": "verified",
    "encryption": "aes256_end_to_end",
    "data_residency": "us_only"
  },
  "rejected_providers": [
    {"provider": "public", "reason": "insufficient_hipaa"},
    {"provider": "intl", "reason": "data_residency"}
  ]
}

{
  "id": "resp-medical-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Based on the symptom analysis..."
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
      
      <section title="Multi-Tier Security with Data Segregation" numbered="false">
        <t>
          Financial workflow with mixed sensitivity levels using auto-selection:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: financial
X-AI-Security-Requirements: pci-dss,sovereignty
X-AI-Data-Classification: financial-mixed

{
  "model": "auto",
   "messages": [
      {
        "role": "user",
        "content": "Generate financial report"
      }
    ],
  "tools": "auto",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "financial_report",
      "schema": {
        "type": "object",
        "properties": {
          "public_data": {"type": "object"},
          "private_data": {"type": "object"}
        }
      }
    }
  }
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: multi-tier-financial
X-AI-Model-Mapped: financial-segregation
X-AI-Auto-Selection: {
  "strategy": "segregation",
  "allocation": {
    "public": {"provider": "public", "sec": "basic"},
    "pci": {"provider": "secure", "sec": "pci"},
    "conf": {"provider": "private", "sec": "max"}
  },
  "flow_controls": {
    "cross_tier": "prohibited",
    "aggregation": "secure_comp"
  }
}

{
  "id": "resp-financial-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"public\": {...}, \"private\": {...}}"
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Advanced Failover and Performance Degradation" numbered="true">
      <t>
        The extensions implement sophisticated failover strategies that handle 
        performance degradation and cascading failures while maintaining 
        OpenAI API compatibility.
      </t>
      
      <figure anchor="fig-advanced-failover" title="Advanced Failover with Performance Monitoring">
        <artwork type="ascii-art"><![CDATA[
Client Router Monitor Prov-A  Prov-B  Prov-C
  |      |      |       |       |       |
  | Req  |      |       |       |       |
  |----->|      |       |       |       |
  |      | Mon  |       |       |       |
  |      | Start|       |       |       |
  |      |----->|       |       |       |
  |      |      | Check |       |       |
  |      |      | Health|       |       |
  |      |      |------>|       |       |
  |      |      |       |Degrade|       |
  |      |      |<------|       |       |
  |      |      |       |       |       |
  |      | Route|       |       |       |
  |      | to B |       |       |       |
  |      |--------------------->|       |
  |      |      |       |       |       |
  |      |      | Mon   |       |       |
  |      |      | Prov B|       |       |
  |      |      |-------------->|       |
  |      |      |       |    Overload   |
  |      |      |<--------------|       |
  |      |      |       |       |       |
  |      |GRACEFUL      |       |       |
  |      |DEGRADE       |       |       |
  |      |Route C       |       |       |
  |      |----------------------------->|
  |      |      |       |       |       |
  |      |<-----------------------------|
  | Resp |      |       |       |       |
  |<-----|      |       |       |       |
        ]]></artwork>
      </figure>
      
      <section title="Cascading Failover with Quality Adjustment" numbered="false">
        <t>
          Auto-selection with graceful degradation during system stress:
        </t>
        <figure>
        <artwork><![CDATA[
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: creative
X-AI-Quality-Threshold: 0.8
X-AI-Failover-Policy: cascading

{
  "model": "auto",
    "messages": [
       {
         "role": "user",
         "content": "Write comprehensive product documentation"
       }
     ],
  "max_completion_tokens": "auto"
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: provider-fast
X-AI-Model-Mapped: efficient-writer-model
X-AI-Auto-Selection: {
  "failover_cascade": {
    "primary_attempt": {
      "provider": "premium",
      "status": "degraded",
      "quality_estimate": 0.95,
      "response_time_ms": 8000,
      "decision": "too_slow"
    },
    "secondary_attempt": {
      "provider": "balanced",
      "status": "overloaded",
      "queue_depth": 150,
      "decision": "capacity_exceeded"
    },
    "tertiary_selection": {
      "provider": "fast",
      "status": "available",
      "quality_estimate": 0.82,
      "response_time_ms": 1200,
      "decision": "selected_with_quality_adjustment"
    }
  },
  "quality_adjustment": {
    "target": 0.95,
    "achieved": 0.82,
    "mitigation": "post_processing_available"
  }
}

{
  "id": "resp-docs-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "# Product Docs\n\nGuide..."
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Workflow State Management and Branching" numbered="true">
      <t>
        Complex workflows can branch and merge while maintaining conversation 
        state through standard OpenAI message arrays and extension headers.
      </t>
      
      <figure anchor="fig-workflow-branching" title="Workflow Branching with State Inheritance">
        <artwork type="ascii-art"><![CDATA[
           Initial Request
                 |
                 v
          +-------------+
          | Base        |
          | Analysis    |
          | Provider A  |
          | conv-001    |
          +------+------+
                 |
                 | State: base
                 |
        +-------+--------+
        |                |
        v                v
  +-----------+    +-----------+
  | Marketing |    | Product   |
  | Prov B    |    | Prov C    |
  | conv-mkt  |    | conv-prod |
  +-----+-----+    +-----+-----+
        |                |
        | Inherits       | Inherits
        | Base State     | Base State
        |                |
        +-------+--------+
                |
                v
        +-------------+
        | Final Report|
        | Provider D  |
        | conv-final  |
        +-------------+
                |
                | Merges
                v
          Final Report
        ]]></artwork>
      </figure>
      
      <section title="Workflow Branching Example" numbered="false">
        <t>
          Multi-branch workflow with state inheritance using conversation arrays:
        </t>
        <figure>
        <artwork><![CDATA[
# Initial workflow step
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: analytical

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Analyze user behavior data"
    }
  ]
}

# Base analysis response
HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: analytics-provider
X-AI-Conversation-ID: conv-behavior-001
X-AI-Workflow-Step: base-analysis

{
  "id": "resp-base-001",
  "object": "response",
  "created": 1699123456,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Analysis complete. Found 3 segments..."
      },
      "finish_reason": "stop"
    }
  ]
}

# Branch 1: Marketing insights
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: marketing
X-AI-Parent-Conversation: conv-behavior-001
X-AI-Workflow-Branch: marketing

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Analyze user behavior data for insights"
    },
    {
      "role": "assistant",
      "content": "Analysis complete. Found 3 segments..."
    },
    {
      "role": "user",
      "content": "Generate marketing recommendations"
    }
  ]
}

# Branch 2: Product insights (parallel)
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: product
X-AI-Parent-Conversation: conv-behavior-001
X-AI-Workflow-Branch: product

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Analyze user behavior data for insights"
    },
    {
      "role": "assistant",
      "content": "Analysis complete. Found 3 segments..."
    },
    {
      "role": "user",
      "content": "Generate product improvements"
    }
  ]
}

# Merge branches for final report
POST /v1/responses HTTP/1.1
Host: multi-provider.example.com
Authorization: Bearer sk-...
Content-Type: application/json
X-AI-Multi-Provider: enabled
X-AI-Task-Hint: analytical
X-AI-Merge-Branches: marketing,product

{
  "model": "auto",
  "messages": [
    {
      "role": "user",
      "content": "Create executive summary"
    }
  ],
  "tools": "auto",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "executive_summary",
      "schema": {
        "type": "object",
        "properties": {
          "marketing_insights": {"type": "array"},
          "product_recommendations": {"type": "array"},
          "combined_strategy": {"type": "string"}
        }
      }
    }
  }
}

HTTP/1.1 200 OK
Content-Type: application/json
X-AI-Provider-Used: report-generator
X-AI-Model-Mapped: executive-summary-model
X-AI-Auto-Selection: {
  "branches_merged": ["marketing", "product"],
  "context_integration": "complete",
  "workflow_completion": "success"
}

{
  "id": "resp-final-001",
  "object": "response",
  "created": 1699123600,
  "model": "auto",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"insights\": [...], \"strategy\": \"...\"}"
      },
      "finish_reason": "stop"
    }
  ]
}
        ]]></artwork>
        </figure>
      </section>
    </section>
    
    <section title="Implementation Architecture" numbered="true">
      <t>
        The multi-provider extensions can be implemented as a proxy layer that 
        sits between clients and provider endpoints, or as enhanced provider 
        implementations that support multi-provider orchestration.
      </t>

      <figure anchor="fig-architecture" title="Multi-Provider Router Architecture">
        <artwork type="ascii-art"><![CDATA[
Client Applications
(Standard OpenAI API)
         |
         | HTTP/HTTPS
         v
+------------------+
| Multi-Provider   |
| Router           |
| - Header parsing |
| - Route decision |
| - Failover logic |
| - Response merge |
| - RBAC engine    |
| - Rate limiter   |
+------------------+
         |
    +----+----+----+
    |    |    |    |
    v    v    v    v
+------+ +------+ +------+ +------+
|OpenAI| |Azure | |Anthro| |Local |
|API   | |OpenAI| |Claude| |Model |
+------+ +------+ +------+ +------+
        ]]></artwork>
      </figure>

      <section title="Router Components" numbered="false">
        <t>
          The multi-provider router consists of several key components:
        </t>
        <t>
          1. <strong>Header Parser:</strong> Extracts multi-provider preferences 
          from request headers while preserving standard OpenAI API structure.
        </t>
        <t>
          2. <strong>Provider Registry:</strong> Maintains information about 
          available providers, their capabilities, current status, and 
          performance metrics.
        </t>
        <t>
          3. <strong>Routing Engine:</strong> Implements provider selection 
          algorithms based on client preferences, provider capabilities, and 
          real-time performance data.
        </t>
        <t>
          4. <strong>Request Translator:</strong> Adapts requests to 
          provider-specific requirements while maintaining OpenAI API 
          compatibility.
        </t>
        <t>
          5. <strong>Response Normalizer:</strong> Ensures all responses conform 
          to standard OpenAI API format regardless of the underlying provider.
        </t>
        <t>
          6. <strong>Failover Manager:</strong> Handles provider failures and 
          implements retry logic with alternative providers.
        </t>
        <t>
          7. <strong>RBAC Engine:</strong> Evaluates role bindings and 
          authorization policies to determine permitted models and providers 
          for each authenticated user.
        </t>
        <t>
          8. <strong>Rate Limiter:</strong> Enforces request and token rate 
          limits per user, group, and model using local or external rate 
          limiting services.
        </t>
      </section>
    </section>

    <section title="Backward Compatibility Guarantees" numbered="true">
      <t>
        The extensions provide strong backward compatibility guarantees:
      </t>
      <t>
        1. <strong>API Compatibility:</strong> All standard OpenAI API endpoints, 
        request formats, and response formats remain unchanged. Existing 
        applications work without modification.
      </t>
      <t>
        2. <strong>Default Behavior:</strong> Requests without extension headers 
        behave identically to standard OpenAI API calls, typically routing to 
        a default provider.
      </t>
      <t>
        3. <strong>Error Handling:</strong> Error responses maintain standard 
        OpenAI API error format and codes, ensuring existing error handling 
        logic continues to work.
      </t>
      <t>
        4. <strong>Authentication:</strong> Standard OpenAI API authentication 
        mechanisms (API keys, bearer tokens) are preserved and work unchanged.
      </t>
      <t>
        5. <strong>Rate Limiting:</strong> Rate limiting headers and behavior 
        remain compatible with OpenAI API standards.
      </t>
      <t>
        6. <strong>Optional Extensions:</strong> Authorization, RBAC, and rate 
        limiting features are optional enhancements that do not affect clients 
        unaware of these capabilities.
      </t>
    </section>

    <section title="Security Considerations" numbered="true">
      <t>
        Multi-provider routing introduces several security considerations:
      </t>
      <t>
        <strong>Credential Management:</strong> The router must securely manage 
        credentials for multiple providers while ensuring that client 
        credentials are not exposed to inappropriate providers.
      </t>
      <t>
        <strong>Data Privacy:</strong> Request data may be processed by different 
        providers with varying privacy policies. The router should provide 
        mechanisms to restrict certain providers based on data sensitivity.
      </t>
      <t>
        <strong>Audit Logging:</strong> Multi-provider routing decisions should 
        be logged for security auditing and compliance purposes.
      </t>
      <t>
        <strong>Provider Trust:</strong> The router must validate provider 
        certificates and ensure secure communication channels to all providers.
      </t>
      <t>
        <strong>Identity Header Security:</strong> Identity headers 
        (X-Authz-User-Id, etc.) MUST only be accepted from trusted 
        authentication gateways. The router SHOULD strip these headers 
        from client requests and only trust them when injected by the 
        authentication layer to prevent authentication bypass.
      </t>
      <t>
        <strong>RBAC Policy Security:</strong> Role binding configurations 
        should be protected with appropriate access controls. Misconfigured 
        RBAC policies could grant unauthorized access to premium models or 
        providers.
      </t>
      <t>
        <strong>Rate Limit Bypass Prevention:</strong> In fail-closed mode 
        (recommended for production), rate limiter failures SHOULD reject 
        requests to prevent bypass. Fail-open mode should only be used when 
        availability requirements outweigh rate limit enforcement.
      </t>
    </section>

    <section title="IANA Considerations" numbered="true">
      <t>
        This document requests registration of the following HTTP header fields 
        in the "Message Headers" registry:
      </t>
      <t>
        Request Headers (Decision Assistance):<br/>
        - X-AI-Multi-Provider<br/>
        - X-AI-Provider-Pool<br/>
        - X-AI-Routing-Strategy<br/>
        - X-AI-Task-Hint<br/>
        - X-AI-Tool-Categories<br/>
        - X-AI-Reasoning-Preference<br/>
        - X-AI-Quality-Threshold<br/>
        - X-AI-Max-Latency<br/>
        - X-AI-Cost-Limit<br/>
        - X-AI-Failover-Policy<br/><br/>
        Request Headers (Authorization):<br/>
        - X-Authz-User-Id<br/>
        - X-Authz-User-Groups<br/>
        - X-Authz-User-Roles<br/><br/>
        Response Headers (Transparency):<br/>
        - X-AI-Provider-Used<br/>
        - X-AI-Model-Mapped<br/>
        - X-AI-Auto-Selection<br/>
        - X-AI-Tool-Mapping<br/>
        - X-AI-Auto-Decisions<br/>
        - X-AI-Failover-Occurred<br/>
        - X-AI-Selection-Confidence<br/>
        - X-AI-Authz-Applied<br/>
        - X-AI-User-Role<br/>
        - X-AI-RBAC-Role<br/><br/>
        Response Headers (Rate Limiting):<br/>
        - X-RateLimit-Limit<br/>
        - X-RateLimit-Remaining<br/>
        - X-RateLimit-Reset<br/>
        - X-RateLimit-Retry-After<br/>
        - X-TokenLimit-Limit<br/>
        - X-TokenLimit-Remaining
      </t>
    </section>

  </middle>

  <back>
    <references title="Normative References">
      <reference anchor="RFC2119">
        <front>
          <title>Key words for use in RFCs to Indicate Requirement Levels</title>
          <author initials="S." surname="Bradner"/>
          <date year="1997" month="March"/>
        </front>
        <seriesInfo name="RFC" value="2119"/>
      </reference>
      <reference anchor="RFC8174">
        <front>
          <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
          <author initials="B." surname="Leiba"/>
          <date year="2017" month="May"/>
        </front>
        <seriesInfo name="RFC" value="8174"/>
      </reference>
    </references>

    <references title="Informative References">
      <reference anchor="OPENAI-RESPONSES-API" target="https://platform.openai.com/docs/api-reference/responses/create">
        <front>
          <title>OpenAI Responses API Specification</title>
          <author>
            <organization>OpenAI</organization>
          </author>
          <date year="2025"/>
        </front>
      </reference>
      <reference anchor="RFC6750">
        <front>
          <title>The OAuth 2.0 Authorization Framework: Bearer Token Usage</title>
          <author initials="M." surname="Jones"/>
          <author initials="D." surname="Hardt"/>
          <date year="2012" month="October"/>
        </front>
        <seriesInfo name="RFC" value="6750"/>
      </reference>
      <reference anchor="RFC7519">
        <front>
          <title>JSON Web Token (JWT)</title>
          <author initials="M." surname="Jones"/>
          <author initials="J." surname="Bradley"/>
          <author initials="N." surname="Sakimura"/>
          <date year="2015" month="May"/>
        </front>
        <seriesInfo name="RFC" value="7519"/>
      </reference>
      <reference anchor="ENVOY-RLS" target="https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ratelimit/v3/rls.proto">
        <front>
          <title>Envoy Rate Limit Service</title>
          <author>
            <organization>Envoy Proxy</organization>
          </author>
          <date year="2025"/>
        </front>
      </reference>
    </references>

    <section title="Acknowledgments" numbered="false">
      <t>
        The authors thank the OpenAI team for creating the foundational API 
        standard that enables this ecosystem, and the broader AI community 
        for adopting OpenAI-compatible interfaces that make multi-provider 
        orchestration possible. Thanks to the Envoy community for the Rate
        Limit Service specification and the Kubernetes community for RBAC
        design patterns that informed the authorization framework.
      </t>
    </section>
    
    <section anchor="appendix-examples" title="Implementation Examples" numbered="false">
      <t>
        This document includes comprehensive implementation examples throughout
        the main sections demonstrating:
      </t>
      <t>
        - Auto-model selection with vendor-neutral routing (Section 4)<br/>
        - Auto-tool selection and provider mapping (Section 4)<br/>
        - Performance-based routing with latency and quality constraints (Section 6)<br/>
        - Security-aware provider selection for compliance (Section 7)<br/>
        - Multi-turn failover with persistent state tracking (Section 5)<br/>
        - Workflow branching and state inheritance patterns (Section 8)<br/>
        - Identity-based authorization with JWT integration (Section 6)<br/>
        - RBAC-aware routing for multi-tenant deployments (Section 6)<br/>
        - Rate limiting with RPM/TPM budgets (Section 6)
      </t>
      <t>
        Each example includes complete HTTP/HTTPS request-response pairs showing
        both the standard OpenAI Responses API format and the optional multi-provider
        extension headers. The examples are designed to be hackathon-friendly and
        can be directly adapted for rapid prototyping and production deployment.
      </t>
    </section>
  </back>
</rfc>
