<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="rfc7991bis.rnc"?>
<!DOCTYPE rfc [
<!ENTITY docname "draft-swhited-ogg-skeleton-00">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="info" docName="&docname;" ipr="trust200902" submissionType="independent" xml:lang="en" version="3">
  <front>
    <title abbrev="skeleton">Ogg Skeleton</title>
    <seriesInfo name="Internet-Draft" value="&docname;"/>
    <author fullname="Sam Whited" initials="ssw" role="editor" surname="Whited">
      <address>
        <email>sam@samwhited.com</email>
        <uri>https://blog.samwhited.com</uri>
      </address>
    </author>
    <date year="2026" month="3" day="22"/>
    <area>General</area>
    <workgroup>Internet Engineering Task Force</workgroup>
    <keyword>audio</keyword>
    <keyword>ogg</keyword>
    <abstract>
      <t>
        Ogg Skeleton defines a logical bitstream that provides structuring
        information for multitrack Ogg files.
        It provides clues for synchronization and content negotiation
        including language selection.
        It also provides keypoint indices for optimal seeking over
        high-latency connections or in time-critical scenarios.
      </t>
    </abstract>
  </front>
  <middle>
    <section>
      <name>Introduction</name>
      <t>
        Ogg <xref target="RFC3533"/> is a generic container format, enabling
        interleaving of several tracks of frame-wise encoded content in a
        time-multiplexed manner.
        As an example, an Ogg physical bitstream could encapsulate
        several tracks of video encoded in <xref target="Theora"/> and multiple
        tracks of audio encoded in Opus <xref target="RFC6716"/> or FLAC
        <xref target="RFC9639"/> at the same time.
        A player that decodes such a bitstream could then play one video channel
        as the main video playback, alpha-blend another one on top of it (e.g. a
        caption track), play a main Opus audio track together with several FLAC
        audio tracks simultaneously (e.g. as sound effects), and provide a
        choice of Opus channels providing commentary in different languages.
        Such a file is generally possible to create with Ogg, it is
        however not possible to generically parse such a file, seek on
        it, understand what codecs are contained in such a file, and
        dynamically handle and play back such content.
      </t>
      <t>
        Ogg does not know anything about the content it carries and leaves it to
        the media mapping of each codec to declare and describe itself.
        There is no meta information available at the Ogg level about
        the content tracks encapsulated within an Ogg physical
        bitstream.
        This is particularly a problem if you don't have all the decoder
        libraries available and just want to parse an Ogg file to find
        out what type of data it encapsulates, or want to seek to a temporal
        offset without having to decode the data (such as on a Web server that
        serves media).
      </t>
      <t>
        Ogg Skeleton is designed to overcome these problems.
        Ogg Skeleton is a logical bitstream within an Ogg stream that
        contains information about the other encapsulated logical
        bitstreams.
        For each logical bitstream it provides information such as its
        media type, and explains the way that Ogg pages are mapped to time.
      </t>
      <t>
        Seeking in an Ogg file is typically implemented as a bisection search
        for the seek target timestamp.
        However when seeking over a high latency connection, such as the
        internet, such searches can be slow.
        Some bitstreams, such as Theora video streams, have keyframes.
        In order to seek to a given temporal offset in a Theora stream, you must
        first perform a bisection search to find the target Theora
        frame, determine its keyframe, and then perform another
        bisection search to locate that keyframe and decode forwards to
        the temoporal offset.
        This can be very slow.
        Ogg Skeleton provides an index of keyframes, or an index of periodic
        samples on streams without the concept of a keyframe, enabling seeking
        over high-latency connections optimally with "one hop".
      </t>
      <t>
        Ogg Skeleton is also designed to allow the creation of substreams from
        Ogg physical bitstreams that retain the original timing information.
        For example, when cutting out the segment between the 7th and
        the 59th second of an Ogg file, it is ideal to continue to
        start this cut out file with a playback time of 7 seconds and
        not of 0.
        This is of particular interest if you're streaming this file
        from a web server after a query for a temporal subpart such as
        in http://example.com/video.ogv?t=7-59 .
      </t>
      <section anchor="requirements">
        <name>Requirements Language</name>
        <t>
          The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
          "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>",
          "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>",
          "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>",
          "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>", and
          "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
          described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/>
          when, and only when, they appear in all capitals, as shown here.
        </t>
      </section>
    </section>
    <section>
      <name>The skeleton logical bitstream</name>
      <section>
        <name>Ident Header</name>
        <t>
        The skeleton logical bitstream starts with an ident header
        containing information for the complete Ogg physical bitstream.
        The ident header has the following format:
      </t>
        <figure>
          <name>Ident header packet layout</name>
          <artwork><![CDATA[
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier 'fishead\0'                                        | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version major                 | Version minor                 | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Presentationtime numerator                                    | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Presentationtime denominator                                  | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basetime numerator                                            | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basetime denominator                                          | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UTC                                                           | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Segment length in bytes                                       | 64-67
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 68-71
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Content byte offset                                           | 72-75
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 76-79
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
          ]]></artwork>
        </figure>
        <t>
        Fields with more than one byte length <bcp14>MUST</bcp14> be encoded
        Least Significant Byte (LSB) first byte order.
      </t>
        <t>
        The fields in the skeleton ident header have the following meaning:
      </t>
        <dl>
          <dt>Identifier</dt>
          <dd>
            <t>
            An 8 byte field that identifies this bitstream as Ogg
            Skeleton. It contains the magic numbers:
          </t>
            <ul>
              <li>0x66 'f'</li>
              <li>0x69 'i'</li>
              <li>0x73 's'</li>
              <li>0x68 'h'</li>
              <li>0x65 'e'</li>
              <li>0x61 'a'</li>
              <li>0x64 'd'</li>
              <li>0x00 '\0'</li>
            </ul>
          </dd>
          <dt>Version major</dt>
          <dd>
          2 byte unsigned integer signifying the major version number of
          the skeleton bitstream.
          This document specifies the major version 4.
        </dd>
          <dt>Version minor</dt>
          <dd>
          2 byte unsigned integer signifying the minor version number of the
          skeleton bitstream.
          This document specifies the minor version 0.
        </dd>
          <dt>Presentationtime numerator &amp; denominator</dt>
          <dd>
          8 byte unsigned integer each. Together they represent the
          time at which to start presenting the Ogg physical bitstream
          given as a rational number. The denominator represents the
          temporal resolution at which the presentationtime is given.
          E.g. 5/1000 results in a presentationtime of 0.005 sec.
          This enables a very high temporal resolution without having
          to store floating point numbers. In a newly created physical
          bitstream presentationtime and basetime are the same. When
          remultiplexing a subpart of the stream, this number
          <bcp14>MUST</bcp14> be adapted to the requested start time offset of
          the newly created stream. Presentationtime <bcp14>MUST</bcp14> be
          larger than or equal to zero.
        </dd>
          <dt>Basetime numerator &amp; denominator</dt>
          <dd>
          8 byte signed integer each. Together they represent the
          basetime of the Ogg physical bitstream given as a rational
          number like the presentationtime. This number is fixed once
          the physical bitstream is created and provides a mapping to
          time for the beginning of the physical bitstream when it
          starts with a granule position of 0.
        </dd>
          <dt>UTC</dt>
          <dd>
          A 20 byte string containing a UTC time in the form of
          YYYYMMDDTHHMMSS.sssZ <xref target="ISO.8601.1988"/>.
          It associates a calendar date and a wall-clock time with the
          basetime. If unused the UTC field <bcp14>MUST</bcp14> be a sequence of
          20 NUL bytes, making this ident packet and thus the BoS page of the
          skeleton bitstream constant length.
        </dd>
        </dl>
        <t>
        The possible temporal resolution of the presentation and basetime is on
        the order of 2<sup>-64</sup>.
        For example, the time formats in use for media that are
        described in this document range from 1/24 to 1/60 for the
        different <xref target="SMPTE"/> formats.
        This resolution is enough for any one of these. It is also
        expected to accommodate any future needs of time resolution for
        any other time format and time-continuously sampled data.
      </t>
        <t>
        A denominator of 0 in either presentationtime or basetime indicates that
        the respective time is 0 regardless of the value of the numerator.
      </t>
      </section>
      <section>
        <name>Secondary Header</name>
        <t>
        The skeleton secondary headers are a sequence of packets that each
        contain information about one of the other logical bitstreams contained
        within the Ogg stream.
        A skeleton secondary header packet has the following format:
      </t>
        <figure>
          <name>Secondary header packet layout</name>
          <artwork><![CDATA[
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier 'fisbone\0'                                        | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Offset to message header fields                               | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Serial number                                                 | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Number of header packets                                      | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate numerator                                         | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate denominator                                       | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basegranule                                                   | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Preroll                                                       | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granuleshift  | Padding/future use                            | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Message header fields ...                                     | 52-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
          ]]></artwork>
        </figure>
        <t>
        Fields that are longer than one byte <bcp14>MUST</bcp14> be encoded with
        LSB first byte order.
      </t>
        <t>
        The fields in a skeleton secondary header packet have the
        following meaning:
      </t>
        <dl>
          <dt>Identifier</dt>
          <dd>
            <t>
            An 8 byte field that identifies this packet as a skeleton
            secondary header for identifying other logical bitstreams. It
            contains the magic numbers:
          </t>
            <ul>
              <li>0x66 'f'</li>
              <li>0x69 'i'</li>
              <li>0x73 's'</li>
              <li>0x62 'b'</li>
              <li>0x6f 'o'</li>
              <li>0x6e 'n'</li>
              <li>0x65 'e'</li>
              <li>0x00 '\0'</li>
            </ul>
          </dd>
          <dt>Offset to message header fields</dt>
          <dd>
          4 byte unsigned integer that contains the number of bytes used in this
          packet before the message header fields.
          For the version of the skeleton bitstream described in this document
          this number is fixed to 44.
          This field accommodates future changes to the skeleton bitstream
          allowing implementations to parse message header fields even if
          additional fields have been added in a future revision of Skeleton.
        </dd>
          <dt>Serial number</dt>
          <dd>
          4 byte signed integer containing the bitstream serial number of the
          logical bitstream described by this secondary header packet.
        </dd>
          <dt>Number of header packets</dt>
          <dd>
          4 byte unsigned integer that contains the number of header
          packets of the referenced logical bitstream consisting of
          the BoS page and the secondary header pages that are included before
          the Skeleton EoS page.
        </dd>
          <dt>Granulerate numerator &amp; denominator</dt>
          <dd>
          8 byte signed integer each. They represent the temporal
          resolution of the logical bitstream in Hz given as a
          rational number in the same way as the basetime field.
        </dd>
          <dt>Basegranule</dt>
          <dd>
          8 byte signed integer that represents the granule number with which
          this logical bitstream starts, which is originally 0, but will be a
          positive offset when only a subpart of the stream is requested.
        </dd>
          <dt>Preroll</dt>
          <dd>
          4 byte unsigned integer that contains the number of packets
          to pre-roll in order to decode a current packet correctly.
          This is for example the case with Ogg Vorbis, which requires
          a pre-roll of 2 packets.
        </dd>
          <dt>Granuleshift</dt>
          <dd>
          A 1 byte unsigned integer describing whether to partition the
          <tt>granule_position</tt>
          (<xref target="RFC3533" sectionFormat="comma" section="6"/>) into two
          for that logical bitstream, and how many of the lower bits to use for
          the partitioning.
          The upper bits signify a time-continuous granule position for an
          independently decodable and presentable data granule. The
          lower bits are generally used to specify the relative offset
          of dependent packets, such as predicted frames of a video.
          Hence these can be addressed, though not decoded without
          tracing back to the last fully decodable data granule. This
          is the case with Ogg Theora; the general procedure is given
          in <xref target="granule_mapping"/>.
        </dd>
          <dt>Padding/future use</dt>
          <dd>
          3 bytes of padding data that <bcp14>MUST</bcp14> be set to zero but
          may be used in future versions of Skeleton.
        </dd>
          <dt>Message header fields</dt>
          <dd>
          Header fields following the generic Internet Message Format defined in
          <xref target="RFC2822"/>.
          Each header field consists of a name followed by a colon
          (0x3a ":") and the field value.
          Field names are case-insensitive.
          The field value <bcp14>MAY</bcp14> be preceded by any
          amount of white space, though a single space (SP, ASCII value 32) is
          <bcp14>RECOMMENDED</bcp14>.
          Multi-line header fields as described in
          <xref target="RFC2822" sectionFormat="of" section="2.2.3"/>
          <bcp14>MUST</bcp14> be supported.
        </dd>
        </dl>
        <section>
          <name>Message Header Fields</name>
          <t>
          The following message headers are <bcp14>REQUIRED</bcp14>:
        </t>
          <dl>
            <dt>Content-type</dt>
            <dd>
            Mime type of the content encoded in this stream, e.g. audio/opus,
            video/theora, etc.
          </dd>
            <dt>Role</dt>
            <dd>
            Describes the function of this track.
            Common examples are "video/main", "audio/main", "text/caption".
            It is <bcp14>RECOMMENDED</bcp14> to stick to the existing role names
            defined in
            <xref target="SkeletonHeaders" relative="#Role" section="Role"/>.
          </dd>
            <dt>Name</dt>
            <dd>
            A unique free text string which can be used to directly address or
            identify the track and which may be shown in user interfaces that
            list or allow for selection of tracks.
          </dd>
          </dl>
          <t>
          For more message headers, see <xref target="SkeletonHeaders"/>.
        </t>
          <t>
          As per <xref target="RFC2277"/>, message header fields are considered
          protocol data, i.e. they are not expected to have human readable text,
          and they <bcp14>MUST</bcp14> be entirely encoded in UTF-8.
          In addition, the mandatory header fields <bcp14>MUST</bcp14> be
          encoded in ASCII.
        </t>
        </section>
      </section>
      <section>
        <name>Keypoint Index Packets</name>
        <t>
        Before the Skeleton EoS page in the segment header pages come the
        keypoint index packets.
        Index packets are <bcp14>OPTIONAL</bcp14> in a valid Skeleton bitstream.
        If index packets are included, there <bcp14>SHOULD</bcp14> be at least
        one keypoint packet for each content logical bitstream in the Ogg
        stream.
      </t>
        <t>
        In order to save space, the offsets and timestamps in keypoint packets
        are stored as deltas, and then variable byte-encoded.
        The offset and timestamp deltas store the difference between the
        keypoint's offset and timestamp from the previous keypoint's
        offset and timestamp.
        To calculate the page offset of a keypoint, calculate the sum of the
        offset deltas of up to and including the keypoint in question.
        The variable byte encoded integers are encoded using 7 bits per byte to
        store the integer's bits, and the high bit is set in the last byte used
        to encode the integer.
        The bits and bytes are in LSB first byte order.
        For example, the integer 7843, or 0001 1110 1010 0011 in binary, would
        be stored as two bytes: 0xBD 0x23, or 1011 1101 0010 0011 in binary.
      </t>
        <t>
        Each index packet contains the following fields:
      </t>
        <figure>
          <name>Keypoint index packet layout</name>
          <artwork><![CDATA[
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier 'index\0'                                          | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | Serial number                  | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | Number of keypoints            | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | Timestamp denominator          | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | First sample time numerator    | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | Last sample end time numerator | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              | Keypoints...                   | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
          ]]></artwork>
        </figure>
        <t>
        The fields in an index packet have the following meaning:
      </t>
        <dl>
          <dt>Identifier</dt>
          <dd>
            <t>
            An 6 byte field that identifies this page as an index packet.
            It contains the magic numbers:
          </t>
            <ul>
              <li>0x69 'i'</li>
              <li>0x6e 'n'</li>
              <li>0x64 'd'</li>
              <li>0x65 'e'</li>
              <li>0x78 'x'</li>
              <li>0x00 '\0'</li>
            </ul>
          </dd>
          <dt>Serial number</dt>
          <dd>
          4 byte signed integer containing the bitstream serial number
          of the Ogg logical bitstream described by this index packet.
        </dd>
          <dt>Number of keypoints</dt>
          <dd>
          8 byte unsigned integer containing the number of keypoints included at
          the end of this index packet.
          This value <bcp14>MAY</bcp14> be zero.
        </dd>
          <dt>Timestamp denominator</dt>
          <dd>
          8 byte signed integer representing the presentation time denominator
          for this stream.
          All timestamps, including keypoint timestamps and first and last
          sample timestamps are fractions of seconds over this denominator.
          The timestamp denominator <bcp14>MUST NOT</bcp14> be 0.
        </dd>
          <dt>First sample time numerator</dt>
          <dd>
          8 byte signed integer representing the numerator for the presentation
          time of the first sample in the track.
        </dd>
          <dt>Last sample end time numerator</dt>
          <dd>
          8 byte signed integer representing the numerator for the presentation
          end time of the last sample in the track.
        </dd>
          <dt>Keypoints</dt>
          <dd>
            <t>
              'Number of keypoints' key points, starting with the first keypoint
              at offset 42.
            </t>
          </dd>
        </dl>
        <section>
          <name>Keypoints</name>
          <t>
            Each keypoint comprises the following fields:
          </t>
          <ol>
            <li>
              The keypoint's page's byte offset delta, as a variable byte
              encoded integer.
              This is the number of bytes that this keypoint is
              after the preceeding keypoint's offset, or from the start of the
              segment if this is the first keypoint.
              The keypoint's page start is therefore the sum of
              the byte-offset-deltas of all the keypoints which
              come before it.
            </li>
            <li>
              The presentation time numerator delta of the first keypoint which
              starts on the page at the keypoint's offset, as a variable byte
              encoded integer.
              This is the difference from the previous keypoint's timestamp
              numerator.
              The keypoint's timestamp numerator is therefore the sum of all the
              timestamp numerator deltas up to and including this keypoint's.
              Divide the timestamp numerator sum by the timestamp denominator to
              determine the presentation time of the keyframe in seconds
            </li>
          </ol>
          <t>
          Keypoint's <bcp14>MUST</bcp14> be stored in increasing order by
          offset (and thus by presentation time).
        </t>
          <t>
          The byte offsets stored in keypoints are relative to the start of the
          Ogg bitstream segment.
          For a physical Ogg bitstream made up of two chained Oggs,
          the offsets in the second Ogg segment's bitstream's index
          are relative to the beginning of the second Ogg in the
          chain, not the first.
          If a physical Ogg bitstream is made up of chained Oggs, the presence
          of an index in one segment does not imply that there will be an index
          in any other segment.
        </t>
          <t>
          The first-sample-time and last-sample-time are rational numbers, in
          seconds.
          If the denominator is 0 for the first-sample-time or the
          last-sample-time, then that value was unable to be determined at
          indexing time, and is unknown.
        </t>
          <t>
          The exact number of keypoints used to construct the index is up to the
          indexer, but it is <bcp14>RECOMMENDED</bcp14> to include at most one
          keypoint per every 64KB of data, or every 1000ms, whichever is least
          frequent.
          For codecs which use key frames, indexers <bcp14>SHOULD</bcp14> create
          one keypoint pointing to each key frame.
        </t>
        </section>
      </section>
    </section>
    <section>
      <name>Ogg Media Mapping</name>
      <t>
        Because Ogg may be used to encapsulate any any type of time-continuous
        data stream it does not place requirements on the order of data streams
        in the file.
        An Ogg physical bitstream with a Skeleton track, however,
        <bcp14>MUST</bcp14> have the following page order:
      </t>
      <ul>
        <li>Skeleton Beginning-of-Stream (BoS) page.</li>
        <li>BoS pages of the other logical bitstreams.</li>
        <li>
          Secondary header pages of all logical bitstreams (including Skeleton
          secondary headers).
        </li>
        <li>Skeleton keypoint index packets (if present)</li>
        <li>Skeleton End-of-Stream (EoS) page.</li>
        <li>
          Data and EoS pages of logical bitstreams, excluding skeleton,
          multiplexed in a time-synchronous fashion.
        </li>
      </ul>
      <t>
        In addition, Skeleton has the following further restrictions:
      </t>
      <ul>
        <li>
          An Ogg segment <bcp14>MUST NOT</bcp14> contain more than one Skeleton
          logical bitstream.
        </li>
        <li>
          The skeleton BoS page <bcp14>MUST</bcp14> only contain the Skeleton
          Ident (fishead) header as a single packet.
        </li>
        <li>
          The secondary header pages of a Skeleton logical bitstream consist
          only of fisbone header packets.
        </li>
        <li>
          The Skeleton stream <bcp14>MUST NOT</bcp14> contain any content
          pages or data packets other than keypoint index packets.
        </li>
        <li>
          The skeleton EoS page <bcp14>MUST</bcp14> contain a single packet
          of length zero.
        </li>
        <li>
          The skeleton EoS page <bcp14>MUST</bcp14> appear before any data pages
          for any other logical bitstream in the Ogg bitstream.
        </li>
        <li>
          The skeleton EoS page <bcp14>MUST</bcp14> end the skeleton logical
          bitstream.
          If an Ogg stream parser reaches the skeleton EoS page, it knows that
          it has received all the BoS and secondary header pages and can start
          setting up its decoding or parsing environment.
        </li>
      </ul>
    </section>
    <section>
      <name>Time Handling</name>
      <t>
        With time-continuous data inside Ogg, one needs to handle data
        at four different levels:
      </t>
      <ul>
        <li>at the byte level: upon seeking,</li>
        <li>at the packet level: upon encapsulating,</li>
        <li>at the granule level: upon recomposing, and</li>
        <li>at the time level: upon displaying and addressing.</li>
      </ul>
      <t>
        This section explains how they all fit together.
      </t>
      <section>
        <name>Conceptual Overview</name>
        <t>
          Ogg bitstreams represent a single timeline with multiple content
          tracks.
          All of these tracks relate to the same timeline which starts at a
          certain time point and ends when the last bitstream ends.
        </t>
        <t>
          An example bitstream can be seen in the following figure.
          It consists of an Ogg bitstream that contains 4 media bitstreams.
          The picture is a conceptual representation of the time intervals
          covered by the different logical bitstreams and the Ogg pages used to
          encapsulate the data.
          In the flat representation these are multiplexed such that the data
          packets of each of these bitstreams occur at the correct time.
        </t>
        <figure anchor="layout_example">
          <name>Ogg bitstream layout example</name>
          <artwork><![CDATA[

                             t_url
                               |
t_0                            v                                      t_n
|------------------------------------------------------------------->|
----------------------------------------------
|  |  |  |  |  |  |  |  |  |  |//|  |  |  |  |
----------------------------------------------
audio bitstream 1
        -------------------------------------------------------------
        |     |     |     |/////|     |     |     |     |     |     |
        -------------------------------------------------------------
        video bitstream 1
                 ----------------------------------------------------
                 |  |  |  |  |//|  |  |  |  |  |  |  |  |  |  |  |  |
                 ----------------------------------------------------
                 audio bitstream 2
                        -------------------------------
                        |     |/////|     |     |     |
                        -------------------------------
                        video bitstream 2
            ]]></artwork>
        </figure>
        <t>
          The time point at which an Ogg bitstream starts
          (<tt>t<sub>0</sub></tt> in <xref target="layout_example"/>) is called
          the "basetime" and represents the time in seconds associated with the
          granule position of 0 on all logical bitstreams.
          Typically, a newly created Ogg file starts all its logical bitstreams
          at granule position 0, and a typical extract of an Ogg bitstream, such
          as the one starting at <tt>t<sub>url</sub></tt> in
          <xref target="layout_example"/>, starts each of its logical bitstreams
          at different granule positions.
          These granule positions are stored in the "basegranule" field of the
          skeleton secondary header packets.
        </t>
        <t>
          The "basetime" of an Ogg bitstream may be 0, but it can also be any
          positive time. For example, in professional video production,
          the first frame of video of a program normally refers to a
          <xref target="SMPTE">SMPTE basetime</xref> of 01:00:00:00, not
          00:00:00:00.
          Associating such a practice to a digital video resource requires
          a way to store that basetime with the resource and interpreting it
          correctly when addressing offsets such as <tt>t<sub>url</sub></tt>.
          Skeleton provides such a mapping through the basetime field in the
          skeleton ident header.
        </t>
        <t>
          Also associated with the basetime is an <xref target="ISO.8601.1988"/>
          calendar date and wall-clock time (a "UTC base") which represent a
          real-world time giving some meaningful calendar date association to
          the content such as the creation time or the first presentation time.
          The UTC base is specified in the "UTC" field of the Skeleton ident
          header.
        </t>
      </section>
      <section anchor="granule_mapping">
        <name>Mapping a granule position to a time position</name>
        <t>
          Each one of the encapsulated data bitstreams have their own temporal
          resolution at which they provide data to cover the given timeline.
          This temporal resolution is usually given through the
          sampling rate of the particular bitstream.
          For example, a raw audio bitstream at CD quality is sampled
          with a sampling rate of 44100 Hz.
          A video bitstream may be sampled with a frame rate of 25
          frames per second.
        </t>
        <t>
          This temporal resolution is called the "granulerate".
          A granule is a data element that is based on a regular data rate
          specific to the content type, such as the frame rate for video or
          the sampling rate for audio.
          It even exists for bitstreams that are not sampled at a regular
          rate—then it is the highest resolution of any of the used
          sampling rates.
          The granulerate is specified in the skeleton secondary header packets
          for each logical bitstream.
        </t>
        <t>
          Each one of the bitstreams insert data into the Ogg bitstream
          through packets which have an associated temporal duration based on
          the encoder packaging.
          Packets are packaged into Ogg pages, which have a granule position
          associated with them
          (<xref target="RFC3533" sectionFormat="comma" section="6"/>).
          Not taking the special case of a granuleshift into account,
          the granule position specifies the number of granules that have been
          encapsulated since the implicit start of the original bitstream until
          and including the given Ogg page.
        </t>
        <t>
          The granule position together with the granulerate and granuleshift
          information of the skeleton secondary header packets for the
          particular logical bitstream are used for the calculation of
          the time position for which a data packet of the logical
          bitstream completes data.
          A granule position of -1 indicates a special case and
          <bcp14>MUST NOT</bcp14> be used for calculation of a mapping to time.
        </t>
        <t>
          In principle, the granule position of an Ogg page divided by the
          granulerate of this page's logical bitstream provides the time
          position that is reached in that bitstream after decoding all data
          packets finished on this page. However, the granule position field
          in an Ogg page allows for a more finely-grained description of
          the temporal position. The following image explains the composition
          of the granule position field in an Ogg page:
        </t>
        <figure>
          <name>Granule position field layout</name>
          <artwork><![CDATA[
granule_position
------------------------------------------------
|  keyindex               |  keyoffset         |
------------------------------------------------
            ]]></artwork>
        </figure>
        <t>
          The granuleshift field of the skeleton secondary header packets
          describes how many of the <tt>granule_position</tt>'s 64 bits are
          being used for the keyoffset.
          The keyoffset part of the <tt>granule_position</tt> is commonly used
          when the logical bitstream consists of packets that can only be fully
          decoded when referring back to a previous packet.
          For example, video streams often consist of inter and intra coded
          frames, where the intra frames are fully decodable and the inter
          frames are intermediate frames that require backtracking to the
          last inter frame for accurate decoding. Another example is a
          logical bitstream that is mapped as instantaneous information (i.e.
          their granuleposition represents the start time and the end time of
          the packet data), but actually has a duration associated to it, which
          is provided through a subsequent packet. The keyindex part of the
          granule_position is then used to provide the temporal position of the
          reference packet and the keyoffset part provides a counter
          for the data in between.
        </t>
        <t>
          The calculation of the temporal position of an Ogg page using
          Skeleton is thus specified through the following formula:
        </t>
        <figure>
          <name>Ogg page to temporal position formula</name>
          <sourcecode><![CDATA[
t_page = basetime + ((keyindex + keyoffset) / granulerate)
            ]]></sourcecode>
        </figure>
        <t>
          The basetime provides the time offset used at the beginning of the
          logical bitstream for the first data packet and thus
          <bcp14>MUST</bcp14> be added for a correct calculation of the temporal
          position.
        </t>
        <t>
          As an example regard an audio bitstream that has a granulerate
          of 44100 (i.e. 44100 samples per 1 sec), a granuleshift of 0,
          and starts at 4 sec. When reaching a granule_position of 88200, this
          maps to a time position of 6 seconds:
        </t>
        <figure>
          <sourcecode><![CDATA[
t_page = 4 + ((88200 + 0) / 44100) = 6
            ]]></sourcecode>
        </figure>
        <t>
          This signifies that the bitstream has reached the second sec of the
          audio bitstream after the end of decoding this page's packets, but
          maps to 6 seconds because of the basetime.
        </t>
        <t>
          As another example consider a video bitstream that has a granulerate
          of 25 (i.e. 25 frames per 1 second), a granuleshift of 3 (because
          it encodes - say - 7 partial frames between each fully encoded frame),
          and starts at 0 sec. When reaching a granule_position of 997, i.e.
          a keyindex of 62 and a keyshift of 5, this maps to a fully decodable
          time position of 2.68 seconds:
        </t>
        <figure>
          <sourcecode><![CDATA[
t_page = 0 + ((62 + 5) / 25) = 2.68 sec
            ]]></sourcecode>
        </figure>
        <t>
          The granulerate of a time-instantaneous bitstream can be chosen
          arbitrarily by the bitstream multiplexer. Per default, a
          granulerate of 1000 is used, which is the resolution of npt.
          The resolution of all the time schemes is given as:
        </t>
        <dl>
          <dt>npt</dt>
          <dd>1000 (milliseconds)</dd>
          <dt>smpte-24</dt>
          <dd>24 (24 fps)</dd>
          <dt>smpte-24-drop</dt>
          <dd>24/1.001 = 23.976 (approx. as per SMPTE)</dd>
          <dt>smpte-25</dt>
          <dd>25</dd>
          <dt>smpte-30</dt>
          <dd>30</dd>
          <dt>smpte-30-drop</dt>
          <dd>30/1.001 = 29.970 (approx. as per SMPTE)</dd>
          <dt>smpte-50</dt>
          <dd>50</dd>
          <dt>smpte-60</dt>
          <dd>60</dd>
          <dt>smpte-60-drop</dt>
          <dd>60/1.001 = 59.940 (approx. as per SMPTE)</dd>
        </dl>
        <t>
          The granule position of the page finishing data of a
          time-instantaneous bitstream packet <bcp14>MUST</bcp14> signify the
          start time of that packet. For example, a CMML bitstream with a
          granulerate of 1000, a basetime of 0, and a clip that lasts from
          npt=12.020 till npt=15.0 will get a granule_position of 12020. In
          contrast, the granule_position of the page finishing data of e.g.
          an audio bitstream with granulerate 44100, basetime 0 and containing 
          data from npt=12.020 to npt=15.0 will be 661500.
        </t>
        <t>
          A note about field overflows: an overflow of the granule
          position field can destroy the temporal integrity of the Ogg
          physical bitstream. In this case, a multiplexer <bcp14>MUST</bcp14>
          end the Ogg physical bitstream and restart a new one resetting the
          counter to 0 and adjusting the basetime appropriately. This is also
          called sequential multiplexing in Ogg. The same measure MUST be taken
          in case of an overflow of the page_sequence_number on one of
          the logical bitstreams.
        </t>
      </section>
      <section>
        <name>Seeking into the bitstream</name>
        <t>
          Seeking to a time offset inside an Ogg logical bitstream is
          a fundamental activity frequently performed on media data. Time 
          inside an Ogg with a Skeleton track is specified as a temporal offset
          from the "beginning" of the stream, making use of the basetime
          field. Time offsets can also be specified as calendar dates and
          times. The UTC base is then used as a basis for offsetting.
        </t>
        <t>
          The basetime allows to correctly map a temporal offset point such as
          a temporal URI to a byte position in the stream. In the above figure
          take t_uri=npt:14.0 as the temporal offset addressed on a stream with
          t_0=npt:5.0 as the basetime—this requires a stream offsetting of only
          9 sec to the appropriate granule position in each of the bitstreams,
          in the figure marked through patterned pages.
        </t>
        <t>
          The seeking action is performed on the interleaved bitstream, in
          which the data packets occur in a temporally consecutive order based
          on the time at which their data ends. These times are represented in
          the granule positions of the Ogg pages, which are only allowed to
          monotonically increase within one logical bitstream. This
          implies that when having found an Ogg page with a granule position
          that maps to a given seek time (i.e. covers the time or ends at it),
          the seek has found the right location. This applies over all logical
          bitstreams.
          In the above example, this means that the byte position of the first
          occurring page of the patterned pages has been found.
        </t>
        <t>
          There is a complication to the seeking: some logical bitstreams have
          backwards dependencies in their data packets and these have to be
          taken into account for seeking. For example, a logical
          bitstream may require several of its previous packets to
          allow a correct and complete decoding of the actual packet
          that occurs at the seektime. This is the case for Theora
          which requires to go back to the previous keyframe when decoding
          from a time offset. It is also the case for Vorbis which requires the
          previous 2 packets for accurate setup of the frequency transform -
          Speex needs approximately 2 packets for similar reasons.
          Even instantaneous bitstreams may require to go back to a
          previous packet to recover the last state information.
        </t>
        <t>
          Therefore, once seeking has located the correct byte position that
          refers to the given temporal offset, it <bcp14>MUST</bcp14> seek back.
          For logical bitstreams that have a non-zero "granuleshift"
          in the skeleton, it <bcp14>MUST</bcp14> seek back to the Ogg page that
          has a "keyindex" granule position. For logical bitstreams that have
          a non-zero "preroll" in the skeleton, it <bcp14>MUST</bcp14> seek back
          that many packets. The earliest byte position that satisfies
          all these requirements is the correct seek position.
        </t>
        <t>
          A player that presents from an offset <bcp14>MUST</bcp14> take into
          account that the bitstream may contain some packets that are
          only there to allow accurate decoding of the seek time. When
          the backwards dependencies were resolved for a specific
          logical bitstream, several non-relevant Ogg pages of may
          also have ended up in the intermediate. These have to be
          skipped by a player. The time that a player <bcp14>MUST</bcp14> start
          presenting from is given in the "presentationtime" in the
          skeleton ident header.
        </t>
      </section>
      <section>
        <name>Remultiplexing an Ogg Bitstream</name>
        <t>
          Ogg with a Skeleton track allows for the creation of mashups of
          a file without actual decoding and re-encoding. A mashup in the sense
          used here is when a subpart of a Ogg physical bitstream is required,
          such as a temporal sub-interval from the whole file. Skeleton allows
          the creation of the mashup bitstream through recomposition and 
          remultiplexing. There are several
          aims for performing the remultiplexing with as little effort and 
          therefore as little delay as possible:
        </t>
        <ul>
          <li>
            no decoding of the logical bitstreams is performed.
          </li>
          <li>
            no changes to the pages, in particular to the granule
            positions are made.
          </li>
          <li>
            changes occur only to the control section.
          </li>
        </ul>
        <t>
          The fields of the skeleton track allow achievement of all these aims.
          Remultiplexing is essentially achieved by seeking to the position as
          described above and then including from each logical bitstream only
          the relevant Ogg pages into the new stream.
          Changes to fields in the bitstream are restricted to the
          control section:
        </t>
        <ul>
          <li>
            the "presentationtime" <bcp14>MUST</bcp14> be adjusted to the
            requested start time
          </li>
          <li>
            the "startgranule" for each logical bitstream <bcp14>MUST</bcp14> be
            adjusted to the granule position at which each logical
            bitstream starts. This is not the first granule position
            of the Ogg pages included into the bitstream, but rather
            the last one that did not get included, as it represents
            the start time of the bitstream.
          </li>
        </ul>
        <t>
          Everything else, and in particular the Ogg pages, stay the same. This
          is important to allow caching of such files as is required for Web
          proxies.
        </t>
      </section>
    </section>
    <section anchor="IANA">
      <name>IANA Considerations</name>
      <t><cref>
          TODO: define an Ogg Skeleton Header Registry and register the stuff
          from the <xref target="SkeletonHeaders"/> wiki page?
        </cref>
        This memo includes no request to IANA.
      </t>
    </section>
    <section anchor="Security">
      <name>Security Considerations</name>
      <t>
        Ogg format bitstreams contain several multiplexed binary and
        non-binary data bitstream. There is no generic encryption or
        signing mechanism provided for the complete bitstream or anyone of
        its parts. As the format of the encapsulated media bitstreams is
        not prescribed and is identified through the "Content-type"
        Message header field in that bitstream's skeleton secondary
        header packet, it is possible to encrypt or sign that media
        bitstream and then mark it accordingly with a MIME type that
        signifies the encryption. It is up to the applications that use
        this bitstream to provide an appropriate codec to handle such
        bitstreams.
      </t>
      <t>
        As Ogg format bitstreams generally contain arbitrary bitstreams,
        it is possible to include executable content in them.
        This can be an issue with applications that decode these
        bitstreams, especially when they are used in a network scenario.
        Such applications <bcp14>MUST</bcp14> ensure correct handling of
        manipulated bitstreams, of buffer overflow and the like.
      </t>
    </section>
  </middle>
  <back>
    <references>
      <name>Normative References</name>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2822.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3533.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml2/reference.ISO.8601.1988.xml"/>
    </references>
    <references>
      <name>Informative References</name>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2277.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6716.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9639.xml"/>
      <reference anchor="SkeletonHeaders" target="https://wiki.xiph.org/SkeletonHeaders">
        <front>
          <title>Skeleton Headers</title>
          <author>
            <organization>Xiph.Org</organization>
          </author>
          <date month="March" year="2026"/>
        </front>
      </reference>
      <reference anchor="SMPTE">
        <front>
          <title>SMPTE STANDARD for Television, Audio and Film - Time and Control Code</title>
          <author>
            <organization abbrev="SMPTE">The Society of Motion Picture and Television Engineers</organization>
            <address>
              <postal>
                <street>595 W. Hartsdale Ave.</street>
                <city>White Plains</city>
                <region>NY</region>
                <code>10607</code>
                <country>USA</country>
              </postal>
              <email>smpte@smpte.org</email>
            </address>
          </author>
          <date month="September" year="1999"/>
        </front>
        <seriesInfo name="ANSI" value="12M-1999"/>
      </reference>
      <reference anchor="Theora" target="https://xiph.org/theora/doc/Theora.pdf">
        <front>
          <title>Theora Specification</title>
          <author>
            <organization>Xiph.Org</organization>
          </author>
          <date year="2017" month="June" day="03"/>
        </front>
      </reference>
    </references>
    <section anchor="Acknowledgements" numbered="false">
      <name>Acknowledgements</name>
      <t>
        Thanks to Silvia Pfeiffer, and Conrad D. Parker for their work
        on an earlier attempt to specify the Skeleton bitstream which
        was consulted (and stolen from) heavily while writing this
        document.
        Thanks also to Christopher Montgomery and Andre Pang who are thanked in
        the earlier draft mentioned previously and (probably) deserve to be
        thanked again.
        Also to the contributors to the Ogg Skeleton description on the Xiph.Org
        wiki.
      </t>
    </section>
  </back>
</rfc>
