Understanding Audio/Video Formats
By Amir Majidimehr
Note: this article originally was published in Widescreen Review. This is a revised and updated version.
Get ready to learn the most fundamental concept in audio/video which
sadly, most people don’t understand or get badly wrong. I can’t
blame them as properly understanding them requires pretty deep
understanding of how these content streams are captured, encoded and
transmitted. This is useful knowledge to have as you would want to
know how many songs or movies you can store on your 16 Gigabyte
tablet or 4 Terabyte video server. Turns out that with some simple
math and basic concepts we can learn everything we need here.
Let’s start at the top. My assumption in this article is that most
of you know that there are eight bits in one byte. Okay, if you
didn’t, don’t feel bad, as even the people who are supposed to know
such as magazine writers and technical people in the field often
confuse bits and bytes. As you can imagine, with almost an order of
magnitude difference between these two terms, it is super important
to get them right.
To avoid the above confusion I rarely abbreviate “b” for bit and “B”
for byte, as is often done, and instead will spell them out as bits
and Bytes. Since we usually deal with larger numbers of these, the
prefixes Kilo, Mega, and Giga are added to represent thousands,
millions, and billions, respectively.
Since these are computer concepts, the units here are not decimal
but rather, binary. This means that “kilo” is actually 1024, not
1000. Throughout this article I will be using the familiar decimal
versions of these numbers. The world won’t come to an end if we are
off by 2.4 percent.
Digital Audio Formats
To record audio on CDs, we
have to convert the analog audio signal to digital. The “sampling rate” for this application is
44.1 KHz, which means that we digitize the analog value
44,100 times per second (frequency is measured in Hz or cycles, which means
one sample per second in this situation). CD is stereo which means it has two
independent channels. Each audio sample in turn has 16 bits of
resolution, or two bytes. So at any instant, we have four bytes of information and 44,100 instances per second.
If we multiply 44.1 KHz (the sampling rate) by two (number of
channels), and then by 16 (bits of resolution), we get the data rate
of CD in bits per second. The result is 1,411 Kbits/sec (“kbps”) or
roughly 1.4 Mbit/sec (Mbps). Converting this to bytes, we get about
176 KBytes/sec.
When a CD drive is used for data
the industry uses 150 Kbytes/sec as its reference baseline speed
accounting for some overhead used for error correction and such. This has become a marketing spec for optical media
where the speed of the drive is specified by a number followed by
“X” with X representing this 150 Kbytes/sec. So if you see a “20X”
drive, it means that it can read at 20 * 150, or 3 MBytes/sec. I
said this was a marketing spec because in reality the drive speed is
variable depending on the track being read so average speed across
the entire media (e.g. when you rip a CD) is lower than this number.
So far we are in the traditional A/V realm. Let’s jump into the new
era. Here, we are talking about things like “128 kbps MP3.” This
audio stream is exactly what it says it is. That is, the music file
is represented as 128 Kbit/sec, compressed in MP3 format. This
compares to our original CD source at 1.4 Mbit/sec.
To figure out how much the file is compressed, we simply divide the
MP3’s data rate (128 Kbits/sec or 0.128 Mbits/s) by the CD’s data
rate (1.4 Mbps) and arrive at 0.09. In other words, the MP3 file
represents only 9 percent of the data of the original source.
Perhaps more interesting is the inverse ratio, which tells us that a
whopping 91 percent of the original information has been thrown out
and you are hearing what is left! While as audiophiles we want our
music to be of higher fidelity than 128 kbps MP3, it is remarkable
how much quality is preserved relative to so little bits contained
in that file.
Keep in mind that
the sampling rate and bit rates of compressed files
are two entirely different things. A 128 Kbits/sec MP3 has the same
sampling rate as the uncompressed music. Same is true of 256
Kbit/sec MP3 and 384 Kbits/sec. They are compressed versions of the
same 44.1 KHz audio stream. So don’t make the common mistake of
talking about the bit rate of the file as sampling rate.
Back to our original uncompressed CD, if we multiply its 176
Kbytes/sec data rate by 3,600 (seconds in one hour), we get the
total space consumed for one hour of music which is 630 MBytes
(rounded to 650 MBytes to include overhead).
Now let’s apply the same math to the 128 Kbits/sec MP3. The 0.128 Mbit/sec must be
divided by eight to convert it to bytes and then multiplied by 3,600
to get the capacity requirements. This adds up to 57.6 MBytes/hour,
showing the remarkable saving in storage size when using “lossy”
audio compression. This is the reason that solid-state “flash”-based
music players and phones can hold so much compressed music. For a typical
three-minute song, a 128 kbps MP3 would take up 2.88 MBytes of
space. So if you have an even small 4-Gigabyte flash memory player, it can hold
4,000 / 2.88 or 1,388 songs. The same player would only hold 125
songs in the original uncompressed format of the CD.
Let’s put this in the context of audio for DVD and Blu-ray Disc. In
this application we typically have “5.1” channels of content to create a surround
experience. The notation means we have five (5) full frequency
channels and a sixth low bandwidth channel indicated by the “.1.”
Note that we don’t really have 5.1 channels as a mathematical figure
because the low-frequency channel does not equate to 10 percent of the
full bandwidth main channels. But for the sake of simplifying our life,
let’s pretend that it does use 10 percent as many bits to do its job
and use the number 5.1 just like we used 2.0 for stereo
computations.
For sampling rate of surround music, the standard in the industry is
to deliver 48 KHz as opposed to 44.1 used in CDs. The sample
resolution can be 16, 20, or 24 bits. Assuming 20-bit samples,
the math becomes 5.1 (channels) * 48 (sampling rate)
* 20 (bits of resolution) = 4.9 Mbits/sec. To figure this out for 16
and 24 bit audio samples, simply swap out the 20 for those numbers.
If we take the uncompressed data rate of 4.9 Mbit/sec
and divide it by
8 to get bytes/sec, then multiply by 3,600, we get a capacity
requirement of 2.2 GBytes/hour. A two-hour movie would then need 4.4
Gigabytes just for the audio, or nearly half the capacity of a
standard DVD!
Movies on DVD therefore are compressed using Dolby® Digital (AC-3)
compression at typical data rate of 448 kbps (or optionally using
DTS at higher data rates). Let’s compute the
compression ratio as we did with MP3. We simply repeat the same math
by dividing 0.448 (data rate of Dolby Digital) by 4.9 (data rate of
the uncompressed audio) and get 0.09. So as in the case of MP3, quite a
bit—91 percent—is thrown out in the process of compressing the
multichannel audio. DTS® Digital Surround™, in contrast, at 1.5
Mbit/sec, would represent 30 percent of the original, or a very mild
3:1 compression ratio (although “half-rate” DTS at 750 Kbit/sec is
also applying a reasonable amount of compression at 6:1).
Let’s put things in perspective in a different way. If we divide 448
Kbits/sec by 5.1 channels, we get 88 Kbits/sec allocated to each
channel on the average. If we had a stereo track at the same rate,
it would be two (2) channels * 88 (data rate), or 176 Kbits/sec. So,
this Dolby Digital encoding has a 50 percent higher data rate than
the 128 Kbits/sec MP3. While the compression
techniques are different between the two formats, one can still see
that the 448 kbps Dolby Digital is able to “breathe more,” as far as
data rate is concerned as compared to 128 Kbps MP3. So in theory, this
encoding is more transparent to the source.
When it comes to Blu-ray disc, we have a third option: lossless
encoding. This is a process by which the audio data rate is reduced
but the full fidelity maintained. Think of it as compressing your
files on your computer and how you can get them back intact after
decompression. Dolby TrueHD and DTS-HD™ Master Audio are both
lossless surround audio formats supported optionally in Blu-ray Disc
for this purpose.
The price we pay here is that lossless compression is far less
efficiency than lossy techniques such as MP3 and Dolby Digital.
Achieved compression
ratios are about 2:1 for music, reaching up to 3:1 for multichannel
movie sound. The efficiency becomes higher with more channels and
non-intuitively lower at higher sample resolutions (e.g. 24 bits
compared to 16 bits). Using a rough figure of a 2.5:1, we save a
whopping 2.6 GBytes of space for our two-hour movie.
Digital Video Formats
As with our audio example we need to first understand
how the analog video signal is captured and encoded. In this case let's review
how standard definition video (SD) is encoded for broadcast applications and DVD. Here we are talking
about 720 horizontal pixels and 486
vertical pixels, which is often rounded to 720 x 480. Multiply the
two numbers and we arrive at 345,600 pixels in each frame of video.
Converting this to millions and rounding, we get 0.3 million pixels
per frame. Yes, what you are thinking is right. That the resolution
of your 6 megapixel camera in your phone is a whopping 20 times higher than SD
video! But wait, are you watching that DVD image with the same
resolution on a 50-inch display or a 8-foot projection screen? Hmmm.
Movies are recorded at 24 frames per second. So in every second we
have 24 * 345,600 = 8,294,400 pixels. To compute the data rate we need to know the number of
bits allocated to each of these pixels. Computer users are comfortable with the concept of RGB color
pixels which is 24 bits for each pixel. Eight (8)
bits are allocated to Red, Green, and Blue "sub-pixels."
When it comes to storage and transmission of video for home video we do not use RGB
but a different scheme that separates color from the black and white portion of video signal.
We do this because our eyes are less sensitive to
color resolution than black and white. By keeping these two
separate we can then decide how much data to allocate to each one.
The black white sample is called “Luminance” and the color, “Chrominance.”
We shorten the former to "Luma" and the latter to Chroma. You
should do the same if you like people to think you are a video
engineer :).
The notation that tells us how much data is allocated to luma and
chroma is a triplet such as 4:4:4 which in this example means that
the color and black and white samples have the same
resolution. For the reason just explained, this is not a
very common format, even in the broadcast world. The most often
used format is 4:2:2, which means that we use half as much bandwidth
for color. This means that if there is a sharp transition
between two colors it will look softer than the same two transitions
between two shades of gray.
The format used for distribution of content to
consumers—whether it is over digital broadcast, optical disc (SD or
HD), or the Internet/IPTV—is 4:2:0. This means that we
have a quarter of the resolution for color, as compared to black and
white. This is fair bit of compromise in color fidelity but
given the fact that you probably did not know this fact but still
enjoyed the high definition movies at home, the eye sensitivity
factor is doing its job.
With me so far? Good, because we are not done yet. We still don’t
know how many bits each color or black and white sample occupy. In
the computer RGB world typically each color component has eight bits. In
broadcast/professional video, we use either 10 or 8 bits with the
former being preferred. For delivery to consumers the 8-bit format
is the only one used.
In the case of 4:2:0 video encoding used in
DVD, Blu-ray and Internet delivery then, each video sample takes 12
bits on the average, 8 bits for Luma and average of 4 bits for
Chroma. The color samples in reality are eight (8) bits each but since they are spaced out, they
average to 4 bits.
Now we are ready to compute the data rate of our movie stream.
Multiplying 8,294,400 pixels in an SD video by 12 bits of
resolution and rounding, we get 100 Mbits/sec. To put this in
context, the United States ATSC digital broadcast standard for
high-definition video provides for 19 Mbits/sec, yet we just learned that
standard-definition video in the uncompressed domain takes up more
than five times as much to transmit! At the risk of stating the
obvious, we are talking about much bigger numbers than audio.
Speaking of high-definition (HD) video, let’s compute its data rate. The
highest approved resolution for today’s HD home delivery of video
is 1920 x 1080. This is called either 1080i or 1080p
depending on whether each frame of video is sent as half a frame in
each instance or full frame (interlaced or progressive).
Movies are stored progressively so let's assume that for simplicity.
Using the newly learned
math, we get 2,073,600 pixels per frame or about 2 million.
Comparing that to SD video, we see that we have six (6) times more
resolution. This is a pretty big step up but at two megapixels it
still pales in comparison to even the cheapest digital still camera.
Continuing on with our math homework we arrive at 597 Mbits/sec for
the total data rate of this 1080, 4:2:0 source at 24 frames/second.
Converting this to bytes gets us 75 Megabytes/sec. Therefore, a
two-hour movie takes 400 GBytes of storage if uncompressed! Now
contrast this with just 9 GBytes available in DVD, and 25/50 Gigabytes in Blu-ray Disc
(single or double layer), and you see that compression has to be
your friend or we would never be able to delivery such resolution
video to you at home. And unfortunately lossless video compression need not
apply as we are way past 2:1 or 3:1 that can be provided there.
Once compressed, a typical movie encoded for DVD uses an average of
about 5 Mbits/sec using MPEG-2 compression. Dividing this by the
source data rate of 100 Mbits/sec we get 20:1 compression or 5
percent of the source data rate is represented in the final output!
We should be thankful that video compression is as effective as it
is.
Internet delivery of content uses more advanced video compression
technology, such as VC-1/WMV-9 or MPEG-4 AVC. Unfortunately the
efficiency is used to lower the data rate not to improve fidelity
and severely so. Typical bit rates used for SD video delivery
may be around 2 Mbits/sec representing 50:1 compression. In this
case, an incredible 98 percent of the source is thrown away!
Achieved video fidelity often falls short of DVD.
For high definition video on Blu-ray Disc, there is wide variation
in data rates and movie sizes due to much larger disc capacity and
the amount of auxiliary content that may exist.
As a decent guess let’s use an average data rate of 20 Mbits/sec.
This translates to 3% of the uncompressed 1080p/24 video source
(20/600). As with DVD though, pretty good quality can be
achieved despite such high levels of compression due to tremendous
amount of redundancy in video. Imagine a person walking in
front of a house. The house never moves so we can transmit
that once and keep repeating it.
Now let’s look at some communication speeds in order to figure out
what quality we can fit in those channels. A 1.5 Mbit/sec DSL broadband Internet connection has a
maximum data rate of what it states: 1.5 Mbits/sec. Recall that uncompressed SD
video has a data rate of 100 Mbits/sec or 66 times higher. This means
that a two-hour movie would take 132 hours to download without
compression. At the 5 Mbits/sec compressed data rate used for DVD
video, the time drops immensely to six hours. Using advanced video
compression, the stream could be downloaded in real-time (i.e. two
hours in our example) if encoded at 1.5 Mbits/sec or allow us to
"stream" it in real time and watch the video without storing it.
1080 high definition video as encoded on Blu-ray Disc at 20 Mbit/sec
average would overwhelm just most broadband connections today
especially since it has unpredictable peaks as high as 48 Mbits/sec. The
reduced rates used for HD video on the Internet are sober consequences of this.
So, there you have it. Amazing what you can do with “elementary
math,” no?
Further reading:
Digital
Audio/Video and Communication Rates
Back to Articles