Internet-Draft Scalable Quality Extension June 2026
Valin Expires 14 December 2026 [Page]
Workgroup:
mlcodec
Internet-Draft:
draft-ietf-mlcodec-opus-scalable-quality-extension-02
Updates:
6716 (if approved)
Published:
Intended Status:
Standards Track
Expires:
Author:
JM. Valin
Google

Scalable Quality Extension for the Opus Codec (Opus HD)

Abstract

This document updates [RFC6716] to add support for a scalable quality layer.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 14 December 2026.

Table of Contents

1. Introduction

This document updates [RFC6716] to add support for a scalable quality extension layer. Implementations conforming to this document will be referred to as Opus HD.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Scalable Quality Extension

The Opus codec was designed to operate at sampling frequencies up to 48 kHz, with an audio bandwidth up to 20 kHz. The CELT mode that is used for high bitrate coding uses vector quantization with a mostly implicit bit allocation system that is dictated by the bitstream definition. Opus can allocate up to 8 bits per MDCT bin in some of the bands.

While Opus capabilities listed above are sufficient to achieve perceptually transparent audio coding, there is a use for codecs that scale beyond those specs. That includes the current market for 24-bit/96 kHz codecs, but also any application where the intended recipient is not (only) a human being, e.g. ultra-sonic applications.

This document proposes a scalable quality extension layer that both increases the resolution of existing Opus quantizers below 20 kHz, and defines a way of coding audio above 20 kHz, with a sampling rate of 96 kHz. The extension is designed to be forward and backward compatible with [RFC6716]. All extra bits use the Opus extension mechanism defined in [opus-extension] and a 96 kHz decoder is designed to decode a regular 48 kHz [RFC6716] stream and vice versa.

The code implementing this draft is available on the main branch of the Opus repository at https://gitlab.xiph.org/xiph/opus/ and requires building with --enable-qext (or defining ENABLE_QEXT).

2.1. Extended resolution

To reduce the coding error, we need to increase the resolution for 3 different quantizers: the fine energy quantizer (scalar), the band pyramid vector quantizer (PVQ), and the band splitting angle quantizer. We also introduce a new cubic quantizer that scales to higher bit depths than PVQ. To preserve compatibility, all of the bits extending the Opus resolution are stored in the extension payload.

As bands are split into sub-vectors (during stereo coupling or band partitioning), the extension bit budget is split equally between the two partitions. Because the base Opus layer already accounts for the relative magnitudes of the sub-vectors in its own allocation offset, a simple equal split is sufficient to maintain a uniform refinement depth across the two halves.

For each sub-vector of dimension N, the refinement bit depth is derived by dividing its budget by the degrees of freedom (N-1 for PVQ, or 2*N-1 for the angle quantizer). In both cases, the depth is clamped between 2 and 14 bits (or 0 if the budget is insufficient). For PVQ sub-vectors, the depth is also reduced if the remaining payload budget is too small to fit the worst-case refinement.

2.1.1. Fine energy quantizer

For each band we can increase the resolution of the fine energy quantizer by adding extra bits. The extra bits are added in the same way as the regular fine energy quantizer adds resolution on top of the coarse energy quantizer.

2.1.2. PVQ

From a size-K PVQ codebook in N dimensions we can create an extended codebook of size u*K, where u is always odd and selected as 2^b-1, where b is the extra depth. Let y_i be the (integer) value for dimension i of the size-K codebook and z_i be the corresponding value for the size-u*K codebook. We define a refinement r_i = z_i - u*y_i where |r_i| < u. Only the refinement r_i needs to be coded since the regular Opus bitstream already includes y_i.

In the N=2 special case, only the single refinement r_0 is coded as a uniform integer in [-(u-1)/2, (u-1)/2]. If the base layer has y_1 != 0, the refined components are reconstructed as: z_0 = y_0 * u + r_0 * sgn(y_1) and z_1 = y_1 * u - r_0 * sgn(y_0). If the base layer has y_1 = 0, the sign of the refined component z_1 is implicit in r_0: z_0 = y_0 * u - |r_0| * sgn(y_0) and z_1 = -r_0 * sgn(y_0). No additional sign bits are coded for N=2.

For N>2, even though all |r_i| < u are allowed, smaller values of r_i are more likely, so we benefit from entropy coding r_i. We assume that the likelihood of |r_i| < (u+1)/2 is 7/8 and use that probability for decoding a "large" flag. If large=0, we decode b bits and subtract (u-1)/2 to get r_i. The maximum value 2^b-1 maps to r_i = (u+1)/2 and aliases with a large=1 case. If large=1, we decode a sign bit, followed by an integer with b-1 bits to which we add (u+1)/2 and apply the sign. To prevent exceeding the bit budget, the encoder and decoder calculate the worst-case size of the refinement, which is (N-1)*(b+3)+1 bits. If the remaining bit budget in the extension payload is less than or equal to this worst-case bound, the probability for the 'large' flag falls back to 1/2 (uncompressed uniform coding) to guarantee that the budget is respected. The last residual value r_{N-1} does not need to be coded since it's value can be inferred from the other values and the knowledge that the sum of the absolute values is u*K. The only exception for N > 2 is when y_{N-1}=0, in which case a single sign bit is coded, but the magnitude is still inferred.

2.1.3. Angle quantizer

When using mid-side stereo or when splitting a band, we code an angle representing the atan of two sub-vectors' magnitude ratio. The standard Opus encoder can code angles with up to 8 bits. In a similar way to how we define the PVQ refinement, we pick u = 2^b-1 where u is the number of (equidistant) extra quantization levels to be added between each of the original levels. We code a uint symbol between 0 and u-1, where 0 is almost mid-point to the previous (lower) quantization level, u-1 is almost mid-point to the next (higher) level, and (u-1)/2 perfectly lines up with with the originally selected quantization of the standard Opus layer.

2.1.4. Cubic quantizer

The existing Opus PVQ only scales up to 32-bit codebooks. For cases where there is no PVQ in the base Opus layer, we define a new cubic quantizer. Whereas the PVQ codebook is defined as a reflected simplex warped onto the unit sphere, the cubic quantizer warps an N-dimensional cubic shell to the same unit sphere. Cubic codewords specify which face of the cube the vector lies on by coding the dimension of the largest component as a uniform range-coded integer from 0 to N-1, followed by a 1-bit sign of that component. The face of an N-dimensional hyper-cube shell is a full N-1-dimensional cube and can be coded with N-1 scalar values from 0 to Q-1. To minimize complexity, each scalar is encoded directly using b bits, consuming exactly (N-1)*b bits in the bitstream. We use even Q (Q=2^b) for non-transient bands (B==1) and odd Q (Q=2^b-1) for transient bands (B>1).

For transient bands, the maximum possible bitwise value (2^b-1) is simply left unused by the encoding. If a decoder encounters this unused value, it MUST process it normally without applying any clamping. In the cubic synthesis process, this evaluates to a coordinate slightly outside the bounds of the hyper-cube face. Because the entire reconstructed vector is subsequently L_2-normalized to the target band energy, this evaluates to a safe, bounded position on the unit sphere. Processing the value blindly prevents the need for branching logic in the decoder's critical path.

2.2. Extended frequency range

To extend the audio bandwidth, we define more frequency bands. Because psychoacoustics is no longer involved past 20 kHz, all new bands are defined to have a width of 2 kHz. Therefore, when encoding 48-kHz content we add 2 extra bands and when encoding 96-kHz content, we add 14 extra bands. A flag is encoded to specify whether 2 or 14 bands are added. The decoder uses that flag to know how many bands to decode, regardless of whether decoding at 48 or 96 kHz.

The exact band boundaries in terms of MDCT bin indices depend on the short MDCT block size (N_{short}):

  • For 2.5 ms short blocks, (corresponding to N_{short} = 240 at 96 kHz, or N_{short} = 120 at 48 kHz), the boundaries of the 14 extra bands are: 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240.
  • For 1.875 ms short blocks (corresponding to a N_{short} = 180 at 96 kHz, or N_{short} = 90 at 48 kHz), the boundaries of the 14 extra bands are: 74, 82, 90, 98, 106, 114, 122, 130, 138, 146, 154, 162, 168, 174, 180.

For a 48 kHz stream, only the first 2 bands (up to index 120 or 90) are decoded.

For these extra bands, coarse energy, fine energy, and spectral coefficients are coded in a similar sequence to standard CELT bands in [RFC6716]:

  1. An intra-frame energy flag is decoded first, using a probability of 1/8 of being 1. If set, coarse energy is intra-frame coded; otherwise, it is predicted from the previous frame. Coarse energy decoding follows standard CELT coarse energy procedures (Section 4.3.2 of [RFC6716]).
  2. Fine energy is decoded for each band using the allocated extra_quant bits (derived from the bit allocation depth described in Section 2.3).
  3. The normalized spectral coefficients are decoded from scratch using either plain PVQ or the cubic quantizer. The cubic quantizer is used instead of plain PVQ if the allocation allows for a cubic quantizer with a depth of at least 3 bits per dimension. This corresponds to a threshold of 3*N + log2(N) + 1 bits, where N is the size of the band. If fewer bits than the threshold are available, then regular PVQ is used for the band.

When the cubic quantizer is selected for a band, if the block size is larger than 2.5 ms (where the block size index LM > 0) and the budget b (in bits) is greater than 2*N, the band is recursively split into two halves. An angle theta is coded to represent the energy distribution between the two halves. The resolution of theta is derived from the budget and the size of the band. The remaining budget (after coding theta) is split between the two halves: b_1 = min(b, max(0, floor((b - delta)/2))) and b_2 = b - b_1, where delta is a linear approximation of the standard CELT budget offset (described in Section 4.3.5 of [RFC6716]): delta = (N-1) * 2.875 * (theta / (pi/2) - 0.5). The two halves are then recursively quantized, with their gains scaled by cos(theta) and sin(theta) respectively.

All entropy decoding for these extra bands uses the extension decoder context.

2.3. Bit allocation

The allocation of the extra bit depth b is explicitly signaled for each band at a time, using a resolution of 1/4 bit depth between 0 and a cap c=14. For band b_i, we use entropy coding to give a higher probability to three different cases: b_i=0, b_i=4*c, and b_i=b_{i-1}. In the case where b_{i-1} is either 0 or 4*c, we merge two of the probabilities. The ICDF for the general case is {120, 112, 70, 0}, where the first symbol means b_i=0, the second means b_i=4*c, the third means b_i=b_{i-1}, and the last symbol means that b_i is equal to 1 plus a uint value coded from 0 to 4*c-1. For b_{i-1} = 0, we use the ICDF {64, 50, 0} and for b_{i-1}=4*c, we use {110, 60, 0}, where the last symbol always means that a uint is coded. We start with b_{-1} = 0.

Given b_i, the number of extra energy bits is given by (b_i+3)/4. The number of 1/8 bits (BITRES) allocated for PVQ refinement and/or cubic codebook bits is given by ((W-1)*channels * b_i * 8 + 2)/4, where W is the number of bins in the band.

To guarantee decoder synchronization, both the encoder and decoder check the remaining bit budget in the extension payload before decoding the bit depth b_i of each band. If the remaining budget is less than 10 bits (80 in 1/8th bit units), the decoding of bit depths is terminated, and b_i is set to 0 for all remaining bands.

The dynamic bit sharing (re-balancing) mechanism for the extension payload follows the same algorithm as standard CELT (described in Section 4.3.3 of [RFC6716]), replacing the base stream variables with their extension payload equivalents.

2.4. Time-domain processing at 96 kHz

CELT includes two time-domain filter pairs that require updating for 96 kHz: the preemphasis/deempahsis filters, as well as the pitch prefilter/postfilter. The CELT deemphasis filter is currently defined as D(z)=1/(1 - a1*z^-1) for a 48 kHz signal, where a1=27853/32768. To obtain approximately the same response in the 0-20 kHz range using a sampling rate of 96 kHz, we instead use D(z)=g*(1 - b1*z^-1)/(1 - a1*z^-1), where g=5415/8192, b1=7209/32768, a1=30245/32768.

For the pitch pre-filter/post-filter, we use zero-insertion upsampling of the 48 kHz filters, which results in the same frequency response below 24 kHz and a "folded" image above 24 kHz. The standard CELT postfilter in the Z-domain is defined as: P(z) = 1/(1 - G*(g_0*z^{-T} + g_1*(z^{-T+1} + z^{-T-1}) + g_2*(z^{-T+2} + z^{-T-2}))) where T is the pitch period, G is the global postfilter gain, and g_0, g_1, g_2 are the tapset gains. For the same pitch at 96 kHz, the filter becomes: P(z) = 1/(1 - G*(g_0*z^{-2T} + g_1*(z^{-2T+2} + z^{-2T-2}) + g_2*(z^{-2T+4} + z^{-2T-4}))) representing symmetric taps at delays (2T-2), (2T), and (2T+2) within the sub-streams.

3. Format

The extension payload is entropy-coded in the following order

Table 1
Symbol(s) PDF/Description
96 kHz flag {1, 1}/2
Intensity stereo uint
Dual stereo {1, 1}/2, only if Intensity != 0
Intra coarse energy {7, 1}/8
Coarse energy (high bands) Section 2.2
Bit allocation Section 2.3
Fine energy (low bands) Section 2.1.1, Section 2.3
PVQ refinement Section 2.1.2, Section 2.1.3
Fine energy (high bands) Section 2.1.1, Section 2.3
PVQ and cubic codebook (high bands) Section 2.1.2, Section 2.1.4

4. Conformance

This section defines some tests for evaluating Opus HD conformance. The evaluations are based on test vectors, along with a custom-made comparison tool named qext_compare and derived from the original opus_compare tool from [RFC6716].

4.1. Decoder

For a decoder to conform to this specification, its output MUST be within the specified bounds for all testvectors when compared using qext_compare. Two sets of testvectors are provided. The first, qext_vector01.bit through qext_vector06.bit are high-quality 1024 kb/s bitstreams for which the decoder target files are qext_vector01.f32 through qext_vector06.f32.

The testvectors can be downloaded at https://media.xiph.org/opus/ietf/opushd_testvectors.tar.gz.

Using the reference decoder, a testvector can be decoded as:

% opus_demo -d 96000 2 -f32 qext_vector01.f32 qext_test01.f32

Then the output can be compared to the reference with specific thresholds:

% qext_compare -s -f32 -thresholds 0.05 0.1 0.1 \
                       qext_vector01dec.f32 qext_test01.f32

which will output "Comparison PASSED" if the tested decoder is close enough to the target output.

The second set of testvectors are meant to test corner cases. The bitstream files are qext_vector01fuzz.bit through qext_vector06fuzz.bit, the corresponding target files qext_vector01fuzz.f32 through qext_vector06fuzz.f32. Those are decoded in the same way as the first set of testvectors, but the comparison thresholds are looser:

% qext_compare -s -f32 -thresholds 0.1 0.5 1.0  \
                       qext_vector01decfuzz.f32 qext_test01fuzz.f32

For the tested decoder to be deemed compliant with this specification, all testvectors from both sets MUST pass.

4.2. Encoder

It is RECOMMENDED, but not mandatory for an encoder to comply with the following criteria. Encoder testing involves encoding uncoded testvectors, decoding them with the reference decoder, and comparing to the original uncoded files. The test is meant to evaluate encoding at bitrates around 1 Mb/s. For example, encoding at 1024 kb/s can be done with (may be different for the encoder being tested):

% opus_demo -e audio 96000 2 1024000 -f32 -cbr -qext \
                           qext_vector01.f32 qext_test01.bit

The resulting bitstream then needs to be decoded with the reference implementation with:

% opus_demo -d 96000 2 -f32 qext_test01.bit qext_test01enc.f32

The decoded output (from the tested encoder) can then be compared against the reference original uncoded PCM:

% qext_compare -s -f32 -skip <N> \
                   -thresholds 0.1 0.5 <RMS threshold> \
                   qext_vector01.f32 qext_test01enc.f32

where <RMS threshold> for testvectors 1 through 6 are 5, 320, 20, 20, 40, and 5, respectively. The <N> value for the skip compensates for the encoder delay. For the "audio" encoding mode, N=624 samples. For "restricted-lowdelay", N=240 samples.

Because of the inherent limitations of objective quality evaluation metrics -- including the qext_compare tool -- it is also RECOMMENDED to perform a subjective evaluation of an encoder.

5. IANA Considerations

[Note: Until the IANA performs the actions described below, implementers should use 124 instead of 33 as the extension number.]

This document assigns ID 33 to the "Opus Extension IDs" registry created in [opus-extension] to implement the proposed scalable quality extension.

6. Security Considerations

This document does not add security considerations beyond those already documented in [RFC6716].

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC6716]
Valin, JM., Vos, K., and T. Terriberry, "Definition of the Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716, , <https://www.rfc-editor.org/info/rfc6716>.
[opus-extension]
Terriberry, T.B. and J.-M. Valin, "Extension Formatting for the Opus Codec (draft-ietf-mlcodec-opus-extension)", .

Author's Address

Jean-Marc Valin
Google
Canada