Wednesday, September 12, 2012

Opus, the codec to end all codecs

Opus is a new all-purpose low-latency audio codec with better quality than MP3 and better than Vorbis at low bitrates. Plus it's an open standard, open source, and patent-free! Not only that, it operates over bitrates from 6 kbps (for low-quality speech) up to 510 kpbs. Thus it is suitable to replace virtually every audio codec in the world!* As a bonus, it's an official IETF standard and it has been chosen a mandatory-to-implement (MTI) codec for WebRTC, an upcoming standard for real-time communication on the web.

Well, I had a look at the RFC. The technical details are largely impenetrable… for example the introduction to Range Coding should really say something about what fl[k], fh[k], and ft represent (and what’s [k], an array subscript?), and they really should not have waited until section 4.1.3.3 to introduce the concept of a Probability Density Function, since it appears to be a fundamental idea...

...Anyway, Opus is basically two codecs in one: a linear prediction (LP) codec for speech (based on a standalone codec called ‘SILK’) and a MDCT-based codec for music (based on a standalone codec called ‘CELT’). An Opus encoder can instantly and seamlessly switch between the two codecs at any time according to the bitrate and the dominant character of the input, and there is a hybrid mode where SILK is used for low frequencies and CELT for high frequencies.

CELT is built on the same psychoacoustic principles as other modern codecs (e.g. Vorbis), but adds innovations to permit low latency without hurting overall audio quality. SILK is designed for low bandwidth and low-latency speech (though the principles behind it remain totally opaque to me), while CELT is better for high bandwidth audio (music or speech or anything else) but can still operate at very low bitrates when the sound is not speech or speech-like and therefore unsuitable for SILK.

The operating modes of Opus are designed so that the two codecs “fit together” nicely, and it can switch between the two on 10ms boundaries (sampling range can similarly change among 8/12/16/24/48 kHz, ditto for mono/stereo). Opus also offers other real-time and VOIP features, such as CBR and forward error correction and other tricks to compensate for packet loss.

Some final niceties of Opus: the encoder and decoder can operate at independent sampling rates (or one can be mono and the other stereo), and the (continuously variable) bitrate is independent of the operating mode (e.g. want stereo at 48 kHz at 8 kbps? It’s probably a bad idea, but it’s not outright prohibited). So basically Opus covers virtually all scenarios. The codec to end all codecs, indeed. (correct me if I’m wrong about any of this, dear experts).

Now can I just get a codec that achieves very low bitrates by observing long-term audio patterns and making backward references to similar audio many seconds in the past? kthxbye :) Codec2 (presentation) is really fascinating too. It was developed for radio applications, and is not really useful for VOIP because Codec2 packets are much smaller than UDP headers, i.e. the overhead of the Internet itself outweighs the space savings of Codec2. But what I find interesting about it is the potential for cheap archiving. For instance, with Codec2 you could record your entire life: at 2400 bits/second it requires just 25.9 MB (non-MiB) per day and 9.46 GB per year or under 1 TB for a lifetime (or much less, assuming you eliminate regions of silence and reduce the bitrate in nearly-silent situations.) Coupled with voice recognition and a search index, you'd finally be able to resolve any arguments with your wife about who said what last week. I bet some sociologists and linguists would find interesting uses for such a corpus...