Lossless Audio Compression
By Amir Majidimehr
This article is about the basics of lossless audio
compression. Before we get into this topic, let’s first look at what
uncompressed audio looks like. Here, we are talking about digital samples that are N
number of bits wide, X number of channels, running at certain
sampling frequency. For CD audio, each sample is 16 bits or two
bytes, stereo (two channels), and 44,100 samples/sec. Multiply all
of this and you get 1.4 Megabits per second. Converted to
bytes by dividing by 8 we get 1.7 Kilobytes/sec. So a typical
3-minute song takes up about 32 Megabytes of storage without any
form of compression. This may not seem like much but multiply this
by three to get six channels and make it a movie soundtrack at 90
minutes and you require an entire DVD just for a single audio track!
So some form of compression is handy to have.
Our first tool for dealing with such large amount of data is lossy
compression. “Perceptual coding” is the technique used in lossy
compression whereby we model the human ear and based on its
characteristics are able to sharply reduce the file size with
comparatively little loss of fidelity. For example, 128 kbps
MP3/AAC/WMA represents an 11:1 compression (1.4/0.128). Inverted,
only 9% of the data is kept and the rest is discarded! No matter how
much you may be put off by the notion of lossy compression, you
have to admire how well it operates given so little data.
I am hoping that you are already familiar with the concept of lossless
compression of data. Just about any program you download from the
Internet uses lossless compression to reduce its size and hence,
speed delivery to your computer. The most common utility that performs
this function is “zip.” How does it work? Simply put, instead of
having every piece of data represented as fixed number of bits (e.g.
8 bits for each one of the characters in this article), less bits
are allocated to most common values (e.g. vowels in the English
language) than for less often used values (e.g. letter X in
English). Since by definition there are more of the common values,
representing them with fewer bits we gain significant level of
compression. Actual algorithms are more complex than this but you
get the idea.
Alas, zipping techniques do not work with audio. Indeed if you
attempt to zip an audio file, its size often grows rather than
shrink! Reason is that audio at first blush is a highly
uncompressible type of content. The digital samples from a single
instrument represent a complex waveform. Combine multiple
instruments and vocals and you have something that appears to be
a totally random set of numbers with little to no redundancy to
remove.
Fortunately by applying some simple mathematics we can extract
redundancy that is hidden in the samples. Take the following audio
samples for example: 3, 7, 11, 15. If you feed this to the zip program,
it sees that the numbers are all different and gives up compressing
them. But if we look carefully, we see that each sample is made up
of the previous sample plus 4. So instead of storing four numbers,
we could simply store the first one and tell the decoder to keep
adding 4 to each sample to get the next one in the sequence. In this
sense, we would need only two numbers: the initial number “3” and
the differential of “4.” The decoder can synthesize the rest of the
numbers, giving us 2:1 compression ratio (four numbers becoming
two).
The above is a trivial example of what we call “Linear Prediction
Coding” or LPC for short. A pretty fancy signal processing term to be
sure but fortunately for us, it has its roots in simple algebra.
Linear means a set of samples that follow a line. Prediction means
that we use the past samples to predict the future ones. You can see
how I used both of these aspects in my above example. I assumed the
samples were on a line and that the only thing that separated them
was an offset. If you remember your college math, this is a form of
“curve fitting.” I am trying to find a curve (a line in this
scenario) that matches the sequence of numbers.
Of course I cheated in my example by assuming the decoder already
knew the shape of the line and that the samples kept going that way
forever. Real life is much more complex than that. Numbers may
follow a line more or less but not precisely. In the above examples,
samples could be 3, 9, 13, and 15. In this case, the lossless
compressor still pretends that they line up perfectly. But it also
keeps track of the “error” from the perfect line. In would still
generate the initial value “3” and increment of “4” but it also has
to transmit the error that would be generated by following the
straight line. The “residual error” in this example would be “2” for
the second sample, “0” for the third, and “-1” for the last.
The residual error values must be transmitted to the receiver
efficiently. Fortunately a technique called “Rice coding” (honest,
that is what is called !), is used to compress those error values
efficiently. Reason is that mathematically we can show that
error values have a favorable distribution and hence can be
compressed efficiently.
Predicting the shape of the line (i.e. the formula that tells us the
approximate value of the next sample) is more complex yet. Different
lossless schemes use varying techniques for arriving at what formula
best represents future samples. The encoder may try multiple
permutations, analyzing each iteration to see if it was more or less
efficient. This makes the encoder slower but fortunately, computers
have gotten so fast these days that the computational complexity is
not a major concern. And outside of live broadcast, we can encode
once and be done with it.
But wait, there is more! We have another powerful tool to apply in
cases where there is more than one channel of audio. Listen to a
typical audio track and you notice the same frequencies often coming
out of both speakers (1960s Beatles music excepted). A lossless
encoder can divide the spectrum into two or more bands and isolate
the mid and low frequency components that tend to be more shared
between the two channels. After this division, it can apply
different techniques to reduce the data rate such as subtracting the
common signal from both channels.
Movie tracks provide even more opportunity for elimination of
redundancy as the rear channels are often quiet without much sound
in them. For this reason, lossless 5.1 channel codecs can achieve
compression efficiencies that are much higher than stereo. For
example, Dolby TrueHD used in Blu-ray Disc format when used in 6
channel, 16-bit/48Khz mode can achieve better than 3:1 compression
whereas the best 2-channel lossless codec can rarely exceed 2:1
compression.
Of note, while there are a number of different lossless codecs, what
separates them is only a single digit difference in compression
efficiency. The best may shrink an audio file by 55% and the worst
at 60%. Unfortunately that extra 5% may require more work on the
part of the encoder making it slower. To that point, WMA Lossless is
at the high end of efficiency curve and FLAC at the low end.
Computationally, the roles are reversed with FLAC being faster.
Again, in a PC this is not material as computers are plenty fast at
either task. So your choice is more determined by the hardware or
software that supports the specific codec and not which codec to use
(unlike lossy codecs which do sound different from each other).
By the way, despite folklore on the Internet, lossless audio codecs
do not change the sound as they are proven to mathematically
reproduce the original data stream. Nothing is gained or lost. That
said, playing a lossless track can sound different on a PC than the
original. Why? Well, that is the topic for another article.
Last but not least, note that the data rate for a lossless audio (or
video) codec is NEVER fixed. A lossy codec like MP3 or Dolby AC-3
can force the data rate to be fixed by varying quality. A lossless
codec by definition cannot vary quality. So as a result, it has no
choice but to let the data rate spike as it wants when it sees a
complex waveform it cannot shrink. As a rule the spikes are never
more than uncompressed stream as the codec can simply choose to pass
the original data through, as opposed to trying to “compress” it and
make the data set bigger instead (the problem of zip expanding the
size is avoided). So if you are streaming audio around your home and
are using lossless compression, you need to plan for the full data
rate of the source even though you are benefitting from lossless
compression in the actual amount of data transferred. For CD
audio for example, this will always be 1.4 Mbits/sec.
Back to Articles