This is only a preview of the August 2014 issue of Silicon Chip. You can view 41 of the 104 pages in the full issue, including the advertisments. For full access, purchase the issue for $10.00 or subscribe for access to the latest issues. Items relevant to "Nirvana Valve Sound Simulator":
Items relevant to "The 44-pin Micromite Module":
Items relevant to "The Tempmaster Thermostat Mk.3":
Purchase a printed copy of this issue for $10.00. |
Digital Audio File
Formats Explained
By NICHOLAS VINEN
11010001
11010001 001
1
0
0
1
0
0011110000011 110
0
1
0
011 11
In the digital world, there are a lot more ways to store and transmit
audio data than there are analog forms. The differences, advantages
and disadvantages of these formats are not obvious. Here are some
details on the various formats and their differences.
P
CM, MP3, WAV, FLAC, AAC,
OGG, WMA – these are all pretty
cryptic names for common audio file
formats. So if you want to store audio
on your computer, phone etc, which
format should you use and why? It
depends on a number of factors as
each format comprises a different set
of compromises.
as an analog signal (eg, from a micro
phone) and also ends up as an analog
signal; typically, going to an amplifier
to drive a set of speakers or headphones. An analog-to-digital converter
(ADC) converts the original analog
signal to a digital format while at the
other end of the chain, a digital-toanalog (DAC) converter turns it back
into analog.
The digital output of the ADC is
usually some form of Pulse-Code
Digital audio basics
All digital audio ultimately starts
0
Sample value
0us
0
10
20
Sample number
20 Silicon Chip
750us
1ms
3V
1.5V
16
0V
14
12
5
3
1111
2
30
2 2
3
5
6
Input signal voltage
Time <at> 44.1kHz
250us
500us
31
31
30
29
31
30 29 28
27
26
28
27
26
24
24
24
22
22
20
20
18
18
16
16
16
14
12
10
8
8
6
32
10
-1.5V
8
40 44
-3V
Fig.1: how a 1kHz
analog sinewave (red)
is converted to PCM
values (sequence of
green numbers). This
shows 32 voltage steps
when a real PCM
waveform normally
uses at least 65,536.
The voltage is sampled
at a constant rate
(here, 44.1kHz) and the
nearest value for the
voltage at that point
(blue dots) is stored as
the next number in the
sequence.
Modulation (PCM) and this can be
regarded as the most basic form of
digital audio. Two common examples
of digital audio formats based on PCM
are the Microsoft WAV file format and
CD audio.
Fig.1 shows how a continuously
varying analog signal (red) is converted to a set of data points (blue).
The horizontal axis is the timebase and
the vertical lines represent sampling
periods which occur at fixed intervals.
The number of such sampling intervals
per second is known as the sampling
rate and is typically at least 44.1kHz
for good-quality sound reproduction.
The vertical axis represents voltage and this too is divided up into
a number of equal intervals. For CD
audio, there are 65,536 such intervals
representing a total voltage of about
6V, to handle signals up to about 2.1V
RMS. Each interval therefore covers
a range of about 0.1mV. The number
of voltage steps is the resolution and
since 65,536 = 216 this is known as
16-bit audio (note: Fig.1 has a lot less
steps for the purposes of illustration).
At each sampling period, the analogto-digital converter finds the horisiliconchip.com.au
LEFT
RIGHT
IN
IN
ADC
ADC
OUT
OUT
ANALOG to DIGITAL
I2S
(PCM or
DSD)
1.5-9.2
Mbps
'LOSSY' OR
'LOSSLESS'
AUDIO
ENCODER
(SOFTWARE/
HARDWARE)
MP3,
FLAC,
AAC
etc
0.1-5
Mbps
SOME INFORMATION
LOST IN THIS STAGE
IF 'LOSSY' CODEC IS USED
STORAGE
MEDIUM
(HARD DISK/
FLASH)
AND/OR
BROADCAST
MP3,
FLAC,
AAC
etc
0.1-5
Mbps
'LOSSY' OR
'LOSSLESS'
AUDIO
DECODER
(SOFTWARE/
HARDWARE)
I2S
(PCM or
DSD)
1.5-9.2
Mbps
IN
DAC
OUT
IN
DAC
OUT
LEFT
RIGHT
DIGITAL to ANALOG
OUTPUT WAVEFORM IS NOT
IDENTICAL TO INPUT WAVEFORM
IF 'LOSSY' CODEC IS USED
Fig.2: compressing digital audio adds an extra step into both recording and playback. Once analog data has been
digitised (and possibly mixed), it passes through an encoder which produces an output bitstream at a lower rate than
its input. This can then be more easily transmitted or stored. Later, before being played back, the data passes through a
decoder which reconstructs the PCM audio data (or a close facsimile) to be sent to the DAC.
zontal line closest to the input signal
voltage at that time and since each line
is numbered, by storing this series of
numbers, we form a digital approximation of the waveform shape (blue dots).
The storage required for this data is
the product of the number of bits of
vertical resolution, the sampling rate
and the number of channels. So for CD
quality audio this is 16 x 44,100 x 2 =
1.4Mbit/s or 172.266KB/s.
Why 44.1kHz? This frequency was
chosen for historical reasons as it
divides evenly into PAL and NTSC
frequencies, allowing easy synchronisation with video tape. The important
thing is that it’s more than twice the
highest frequency humans can hear
(~20kHz) so the Nyquist limit is high
enough (more on that later) and the
anti-aliasing filter has a sufficiently
large 4.1kHz transition band.
At this point, you may have noticed
that some of the blue dots in Fig.1 are
not precisely on the red curve due to
the limited voltage and time resolution. As a result, if you drew a line
through these points, it would not be
a pure sinewave like the input.
But the DAC has a ‘reconstruction
filter’ which smooths the output to give
a result very close to the input signal
despite this, as long as the input only
contains frequencies up to the Nyquist
limit, which is half the sampling rate.
In this case, that means frequencies
up to about 22kHz are reconstructed
almost perfectly.
Experience tells us that with a goodquality ADC and DAC, most people
can’t hear any imperfections in CDquality audio recordings. There are
arguments to be made that higher sampling rates and bit depths (eg, 96kHz
siliconchip.com.au
and 24-bit) have certain advantages,
hence formats such as DVD-Audio and
SACD, as described below.
However, there is a further issue
to consider and that is that 172KB/s
of data results in quite large storage
requirements. A typical CD holds up
to 72 minutes of audio with a 720MB
capacity. If you have a 32GB SD card,
that would hold about 40 full CDs
worth of audio; more if the CDs weren’t
entirely full (but some CDs can contain
up to 80 minutes/800MB).
In order to fit more recordings into
the same amount of space, newer formats which require less storage have
been developed in the last 20 years.
With those formats which gain the most
dramatic reductions though, small
errors are introduced in the reconstruction process. These are known as
‘lossy’ compression formats, because
there is some loss in audio quality.
However, ‘lossless’ audio compression
can also be used, to reduce storage
requirements modestly without this
drawback.
Fig.2 shows a typical audio chain,
from an analog recording (eg, an old
tape master) through to a digital stream
which is then compressed, stored and/
or transmitted, then decompressed
and converted back into analog audio
for amplification.
Original
Copied
Copied
Copied
Copied
Difference
Fig.3: at top is a 90ms snippet of audio from a music CD. Below this we have
copied and pasted each cycle over the subsequent cycle as a crude method of
‘predicting’ the shape of the audio based on previous samples. The difference, at
bottom, shows that the error (‘residual’) with even this crude prediction method
has significantly less amplitude than the audio signal itself and thus requires
less storage space.
August 2014 21
32
0us
250us
Time <at> 44.1kHz
500us
750us
1ms
8
6
Sample value
24
5
16
4
3
8
2
1
Quantisation error
0
0
10
0
+4
10
20
Sample number
20
30
40 44
30
40 44
0
0
Quantised sample value
7
Fig.4: an example of
quantisation, which can
be applied to residuals
or other less important
signals to reduce the
amount of storage
space they require.
The number of vertical
divisions is reduced; in
this case from 32 to 8
and the nearest points
are selected instead. The
resulting quantisation
error is shown at
bottom. Normally,
though, this is applied
to spectral (frequency
domain) data rather
than time domain data
as the deleterious effects
are less severe.
-4
So just how much does lossy compression affect the sound and are you
willing to put up with that in exchange
for fitting more audio? And if you are
very fussy, just how much audio can
you fit into a limited amount of space?
Lossless compression
There are a few common ‘lossless’
audio CODECs (encoders/decoders)
available. Perhaps the most common
is called FLAC, or Free Lossless Audio
Codec. Like most lossless CODECs,
FLAC typically achieves a 40-45%
reduction in storage requirement over
PCM without affecting audio quality
at all. Other similar CODECs include
Apple Lossless, WavPack, TAK and
APE but the differences are minor.
If you aren’t interested in the details
of how this is achieved, you can skip
right ahead to the next section.
Many readers will be familiar with
compression programs such as “ZIP”
and “RAR” which can reduce the size
of many types of file, sometimes quite
substantially. Unfortunately, if you try
to ZIP up a PCM audio file (eg, WAV format), you will save little if any space.
That’s because despite PCM audio
containing quite a bit of redundancy,
the general-purpose pattern matching
algorithms used in archiving software
such as ZIP are not effective at identifying and eliminating it.
To realise lossless compression for
audio, we have to consider the nature
of the signal itself. For a start, while
22 Silicon Chip
the vertical scale of the PCM data has
to be large enough to encompass the
peak signal amplitude, much of the
time the signal level is much lower
than this; in other words, most audio
has significant dynamic range. During
the quieter passages, PCM ‘wastes bits’
(or fractions of bits) which are always
zero but are nevertheless stored.
So one simple (but limited) approach
to lossless audio compression is to create a PCM-like file format where the
bit depth (resolution) can vary over
time, to save bits by taking advantage
of amplitude variations in the signal.
FLAC effectively does this, using a
technique called “Rice coding”. But
that only gives modest gains in space
efficiency; perhaps 10% at best.
A more advanced approach is to
realise that audio signals tend to be
quite repetitive and while subsequent
cycles of a waveform are almost never
identical to the previous cycle, they
are often quite similar. So if you can
use one cycle of audio to ‘predict’ the
next, you only have to store the error
between the prediction and reality. By
then applying that difference during
decoding, you can reconstruct the next
part of the waveform exactly.
This error signal is known as the ‘residual’ and it is normally much lower
in amplitude than the signal itself. The
compressor can store the ‘prediction’
parameters, which take little space,
and since the residual’s amplitude is
low, the aforementioned dynamic bit
depth coding will give a much greater
reduction in storage space.
This process is illustrated with a
typical audio fragment in Fig.3. All
we have done is taken a 90ms snippet
from a music CD (top), then created a
predicted version (middle) by simply
copying each previous cycle to form
the next; something that a decoder can
easily do, since it has access to past
audio samples.
The bottom panel shows the difference between these two signals.
As you can see, it’s much lower in
amplitude than the original and can
thus be stored much more efficiently,
despite the crudeness of our approach.
Of course, what FLAC and other
lossless compressors do is much more
advanced than this. For example, they
can try multiple different prediction
methods for each section of audio and
store the results of whichever takes
up the least number of bits. But the
general principle is the same.
Other lossless methods
Meridian Lossless Packing (MLP) is
a commercial lossless CODEC that is
used for DVD-Audio. This is slightly
more space-efficient than FLAC but
it is a proprietary, patented system,
compared to FLAC which is free to
download and use (including the
source code). FLAC is also somewhat
faster, especially to decode. There’s
a lot of detailed information on how
MLP operates at www.meridian-audio.
com/w_paper/mlp_jap_new.PDF
Ultimately, there isn’t much practical difference between FLAC and
the other lossless CODECs, including
Apple Lossless (ALAC), except for
popularity. Apple computer users will
find their software has better support
for ALAC while Windows/Linux users
will be better off with FLAC. Apple
audio player hardware such as iPods
and iPhones also support ALAC while
other devices, including Android
phones, typically use FLAC.
Lossy compression
With lossless compression, you can
fit around 60 CDs worth of audio onto
a 32GB SD card rather than 40 CDs
worth. That’s an improvement but it
is possible to do better, using a lossy
compression method.
One way to do this would be to
use the same method as FLAC but
‘quantise’ the residual. That means
effectively rounding some of the residual values up or down, in order
siliconchip.com.au
siliconchip.com.au
+80
Psychoacoustic Masking
+70
Masking tone
+60
Sound Pressure Level (dB(SPL))
to create less possible sample values.
Less values take less bits to store,
saving space (see Fig.4). This is not
a particularly good approach but it
would work; since the residual values are usually small, the error signal
introduced (shown in red on Fig.4)
would also be small.
To get really good compression
ratios (ie, increase the ratio between
the original PCM data size and the
compressed data size), more advanced
techniques are required. The first lossy
compression method with very high
compression and good audio quality
was MPEG-2 (Moving Pictures Expert
Group) Audio Layer III, better known
as “MP3”. This was developed by
the Fraunhofer-Gesellschaft research
institute in Germany in the early 90s.
MP3 followed on the heels of MPEG1 Audio Layer 2 (MP2), formalised in
the late 1980s. MP2 is a less advanced
coding method which has a worse
compression ratio. However, it also
results in less audible degradation and
is still in use today for broadcasting, as
it is simpler (and thus faster) to encode
and decode than MP3.
The added complexity of MP3 is in
its heavy reliance on a psychoacoustic
model. This takes advantage of psychoacoustic masking, a property of human hearing whereby tones at certain
frequencies can make simultaneous
tones at other frequencies but of lower
amplitudes inaudible (‘masked’). In
other words, if both tones are present
in a signal, depending on the relative
levels, the human brain will perceive
the louder one but not the other.
Thus, it is possible to do a spectral
analysis of the audio data and ‘chop
out’ certain signal components (frequencies) without (in theory, at least)
affecting how it sounds. This is illustrated in Fig.5. Note that the tones
don’t necessarily need to be simultaneous; if one comes immediately after the
other it may still be masked and this
is referred to as ‘temporal masking’.
The resulting simplified spectrum
can then be quantised (as explained
earlier), re-ordered and compressed
using a standard ‘entropy coding’
method (such as Huffman) to give a
much smaller amount of data than the
original PCM. This is usually done by
chopping the PCM audio data up into
variable-sized overlapping blocks and
then compressing them separately.
Another option to achieve good
compression with reasonable sound
+50
Masked tone
+40
Psychoacoustic 'shadow'
+30
+20
+10
Threshold of audibility
0
-10
20
50
100
200
500
1k
2k
5k
10k
20k
Frequency (Hz)
Fig.5: an illustration of psychoacoustic masking. Tones with amplitudes below
the threshold of audibility (mauve) are always inaudible but when loud tones
are present (eg, 300Hz <at> +65dB as shown in green), even tones above the
audibility threshold can be masked and generally not perceived as audible.
In this case, the 150Hz +39dB tone (shown in red) is within the other signal’s
“shadow” and thus could theoretically be removed without changing the
overall sound.
quality is to separate the spectrum out
into the important parts, which the
listener is expected to hear, and the
unimportant parts which will be partially or completely masked and then
decimate the latter more heavily using
a more severe quantisation scheme.
During playback, the compressed
frequency-domain audio blocks are
unpacked and converted back into
time domain data. The snippets of
reconstructed sound are then joined
back together using a ‘windowing’
method to get rid of any discontinuities caused by the imprecise storage
method, ie, where the signal at the
end of one block wouldn’t necessarily
end up at the same voltage as the start
of the next block. Windowing takes
advantage of the overlap to smooth
these transitions.
MP3 also uses the similarities between the two channels in a stereo
recording to reduce the size, storing
them as a sum and (lossy) difference
with a technique known as “joint stereo” (using a method such as ‘intensity
coding’ or ‘mid/side coding’), thereby
saving further space.
The overall amount of compression
varies, depending on how aggressively
the psychoacoustic model removes
‘redundant’ signals and also by controlling the amount of quantisation
of the resulting data. In practice, for
MP3, the reduction in size ranges from
about 77% (320kbps) to 93% (96kbps).
At 96kbps/s, there will be a rather
noticeable impact to audio quality; at
320kbps/s, not so much.
Note that “kbps” refers to kilobits
per second and may also be written
“kbit”. To convert from kilobits per
second to kilobytes per second, divide
by eight, ie, 128kbps = 16kB/s.
MP3 compression thus allows for
something like 4-10 times more audio
to be stored in the same amount of
space as raw PCM. Or to put it another
way, 250-500 full CDs can fit onto a
32GB SD card with reasonable sound
quality. That’s quite an improvement!
What’s the catch?
So what’s the catch? Well, if you’re
listening to relatively high bit-rate
MP3 files on a noisy bus or in a car,
you probably won’t tell the difference
August 2014 23
Constant Bitrate (CBR)
Fixed bitrate (128kbps)
Varying quality factor (q)
Variable Bitrate (VBR)
Fixed quality factor (q=6)
1m35s
1m45s
55s
1m20s
q=8, 1.5MB
q=4, 1.6MB
q=7, 0.85MB
q=4, 1.2MB
96kbps, 1.1MB
50s
q=9, 0.75MB (5.9MB)
(5.8MB)
160kbps, 1.5MB
160kbps, 2.0MB
108kbps, 0.7MB
80kbps, 0.5MB
Fig.6: this illustrates the difference between constant bit rate (CBR) and variable bit rate (VBR) encoding. The encoder
can either vary the quality factor to maintain a constant bit rate or use a fixed quality factor which results in the bit
rate varying with signal complexity. As shown here, both methods can produce a file of the same size but the CBR file
will have the more complex passages encoded with a low quality factor which could result in poor sound quality.
between it and the original recording
– even with a decent car audio system.
But with a proper hifi set-up, the difference between an MP3 and a CD can be
stark for critical listeners. Some more
recent lossy CODECs claim to do a
better job of reproducing CD quality;
more on this below. But if you’re a
discerning listener with reasonable
hearing acuity, lossless compression
is still your best choice.
Variable vs fixed bit rate
When a lossy compression algorithm is applied to normal audio data,
even if each block of raw audio data
processed is the same size, it will
generally produce compressed data
blocks of varying size. That’s because
the complexity of the audio signal
varies over time.
For example, a cymbal clash
contains a wide range of frequency
components and so will not compress
anywhere near as well as, say, a bass
guitar by itself. Many audio files also
contain short gaps of (near) silence,
which may not be obvious during
listening.
So while the PCM data is recorded
or played at a fixed rate, the natural
compressed data stream naturally has
a varying rate.
Sometimes, this is undesirable – for
example, in a broadcast, there will be
a fixed amount of bandwidth allocated
to audio. The maximum compressed
data rate must not exceed this and
while smaller blocks could be padded to fit, that would simply waste
bandwidth.
In this case, the best solution is to
adjust the ‘lossiness’ of the compression algorithm block-by-block, in order
to produce compressed data with a
24 Silicon Chip
more-or-less fixed bit rate and then
use padding to make up the difference;
see Fig.6.
This also has the advantage that the
ratio between the uncompressed and
compressed data is fixed. For example,
if you are compressing CD-quality
WAV files to 192kbps MP3 files, you
know that the MP3 files will be exactly
192kHz ÷ (44kHz x 16 bits x 2 channels) = 13.6% the size of the originals.
However, there’s little reason to do
this if you are simply creating files to
store on, say, a phone or PC. In this
case, it would make more sense to use
a fixed quality level and let the bit rate
vary. This is known as ‘variable bit
rate’ encoding or VBR.
With VBR, some files will have a
higher compression ratio and some
lower, depending on the content. However, for a given quality setting (which
determines psychoacoustic masking
aggressiveness, quantisation factors,
etc), the variation is generally only of
the order of 25%.
MP3s ain’t identical
You might think that if you used
two different pieces of software to
produce similarly sized MP3 files from
the same CD or WAV file source, they
would sound essentially the same. But
this isn’t necessarily the case. During
MP3 encoding (or indeed, any lossy
encoding), the encoder has thousands
of decisions to make for each block of
audio processed in order to produce
the smallest output which loses the
least information.
For example, during the psychoacoustic modelling process, there are
many signals which could be removed
from the sound in different combinations and different encoders may
choose to remove different frequencies
to achieve the desired reduction in
signal complexity.
There is also a speed trade-off as
encoders which take longer may have
more time to ‘explore’ all the possible
combinations of masking, quantisation etc and determine the best
combination to achieve the required
compressed data size. There are also
many different metrics which the encoder can use to determine which is
the ‘best’ outcome. Therefore, a more
carefully designed MP3 encoder can
produce significantly better sounding
MP3 files at the same size (or even
smaller!) as a poorly written encoder.
So if you’re going to compress
hundreds of CDs to MP3 format, it
pays to do your research first and pick
encoding software which gives the best
quality output. This may even allow
you to use a lower ultimate bit rate for
the same sound quality, thus fitting
more data on to your storage medium.
While this is a subjective evaluation
(and readers are invited to do their own
research via Google), some encoders
are generally considered superior. One
of the better-regarded MP3 encoders is
the free, open source, multi-platform
“LAME”. Even this, though, has
many different settings which give
different results. Suggested quality
levels for a good size/sound quality
trade-off are the “-V0” (~245kbps),
“-V1” (~225kbps), “-V2” (~190kbps) or
“-V3” (~175kbps) options; see http://
wiki.hydrogenaud.io/index.php?
title=Lame
Advanced compression
Since MP3 was formalised in 1995,
a number of improved CODECs have
been developed. In some cases, the
siliconchip.com.au
Ogg Vorbis
Another post-MP3 format is Ogg
Vorbis. This was developed specifically because MP3 is a patented algorithm
and Fraunhofer charge a fee for using
the technology. In contrast, Ogg Vorbis
is a free, open source alternative which
can give superior audio quality to MP3
at some (usually higher) bit rates.
One source of subjective comparisons of lossy audio CODECs and encoders is http://soundexpert.org/encoders
We have graphed the information from
this website and smoothed it considerably, to give Fig.7. This suggests that
AAC is the best choice above 224kbps,
Vorbis the best between 112kbps and
224kbps, and AAC+ the best choice
below 112kbps.
Note the large increase in perceived
quality above 224kbps, suggesting that
if you want to play lossily compressed
files through a hifi system, the best
compromise between quality and size
is probably somewhere around 256288kbps and thus AAC is the CODEC
to use. This gives a compression ratio
relative to CD-quality PCM audio data
of around 5.5:1 – still very worthwhile.
siliconchip.com.au
Multi-CODEC Comparison, Subjective Evaluation (SoundExpert.org)
AAC
AAC+
MP3
Vorbis
MPC
WMA
18
16
Subjective Sound Quality Score
aim was to produce an algorithm with
a similar compression ratio to MP3
but with better audio quality. In other
cases, the aim was to produce better
compression without such objectionable artefacts as are present in low bit
rate MP3 files (<128kbps).
One of the more successful codecs
has been AAC and its variations, AAC+
and HE-AAC, These were developed
as a successor to MP3 for the MPEG-4
standard and have also been adopted
by Apple for use with iTunes. AAC is
generally regarded has having better
audio quality than MP3 at the same bit
rate while AAC+ is optimised for lower
bit rates and gives little or no benefit at
settings of 128kbps and above.
In fact, AAC+ is generally inferior to
both MP3 and AAC at higher bit rates
(192kbps+). While many consider
128kbps AAC to give good sound quality, we feel that as with MP3, you really
need 192kbps to even get close to CD
quality. Note that DAB digital radio
uses AAC encoding and DAB+ uses
AAC+, so the same comments apply.
Unfortunately, few DAB+ radio stations in Australia are encoded at rates
above 64kbps! The result is that they
sound inferior to a simultaneous FM
broadcast of the same program (assuming good reception).
14
12
10
8
6
4
2
0
32
64
96
128
160
192
Stereo Bitrate (kbps)
224
256
288
320
Fig.7: a comparison of the subjective scores awarded to audio samples
compressed with different audio CODECs and varying bit rates. While this
can only be considered a guide, it shows the perceived audio quality of most
lossy CODECs goes up significantly above 256kbps and also that certain
CODECs seem to sound better than others for particular ranges of bit rates.
To get an idea of what these bit
rates really mean, refer to Table 1.
This shows how much audio you can
fit, in terms of hours or average CDs
per GB. This applies both to storage
(ie, how much you can fit on an xGB
flash drive) and transmission (ie, how
much bandwidth you will use streaming digital radio at a specific bit rate).
Encapsulation
When you have a digital audio
stream, it needs to be “encapsulated”
somehow to be stored. For example,
PCM audio can be encapsulated in
the WAV format which in addition to
the PCM audio data itself, includes
information at the start of the file indicating the sampling rate, bit resolution,
number of channels and length.
CD audio (‘redbook’) also uses the
PCM format but it involves a more
complex encapsulation for two reasons: (1) it adds error checking and
correction (ECC) information so that
small scratches or divots on the surface
of the CD do not render it unplayable
(and hopefully won’t affect the sound
at all); and (2) it provides feedback
to the user as to which track they are
listening to, how far they are into the
track and allows seeking and skipping
to specific tracks.
As a result, each ‘sector’ of a CD,
containing 2352 bytes of PCM audio
data, is actually 3234 bytes in size. The
extra 882 bytes per sector includes two
392-byte ECC blocks and 98 bytes of
side-channel/control data. There are
75 sectors of data for each second of
audio. A redbook audio CD also has a
table of contents, listing the location of
up to 99 tracks along with their length,
the duration of any pauses between
tracks etc.
Encapsulation can also include the
ability to store track names, authors,
composers, genre etc. For example,
CD audio includes the ability to store
track names using the CD-Text extension although few discs contain such
information.
Other types of digital audio encapsulation for storage include:
• FLAC: this can be encapsulated in
its own simple container format (.flac),
or it can be stored within an “Ogg”
file, which is the same encapsulation
as used for the Vorbis CODEC which
also supports metadata (track name,
author, etc).
• MP3: can either be stored in an
“elementary stream” (.mp3 file), with
optional ID3 metadata tag at the beginning or end, or in an MPEG stream,
possibly along with video data.
• AAC: can be encapsulated in an
MPEG-2 or MPEG-4 stream. Also used
for DAB+, DVB-H or can be contained
in an “ISO base media file” (.aac file).
• Vorbis: generally either appears in
an Ogg file (with or without Theora
Video) or in a Matroska file, which is
intended to be a flexible multimedia
August 2014 25
Table 1: Storage Required For Typical Audio Bit Rates
Bit rate
Hours/CDs per GB
Hours/CDs per 32GB
Data per hour/CD
64kbps
37
1165
28MB
96kbps
25
775
41MB
128kbps
19
580
55MB
160kbps
15
466
69MB
192kbps
12
390
82MB
224kbps
10
333
96MB
256kbps
9
291
110MB
288kbps
8
259
124MB
320kbps
7
233
137MB
1.4Mbps (CD)
1.7
53
606MB
container format (akin to Microsoft’s
AVI).
Having been read from the source
file or media, the same data may then
be transmitted to a different piece of
equipment or a different IC within the
same device. This is generally done by
re-encapsulating the extracted digital
audio data in one of several transmission formats:
• S/PDIF: a two-wire format using biphase encoding, intended for transmitting audio data between media players, amplifiers, receivers and so on.
S/PDIF can carry linear PCM, Dolby
Digital, DTS and other formats, along
with metadata describing the contents of the data and its source. The
optical version of S/PDIF is known
as TOSLINK.
• I2S or one of its variants: a simple
method for transmitting PCM audio
data between ICs within a device,
similar to SPI. Typically involves a
bit clock line (typically 32 or 64 times
the sampling rate), word clock (at the
sampling rate), data bit transmit and/
or receive lines, plus a master clock
which is typically between 128 and
1024 times the sampling rate.
• MPEG transport stream: while
this is used as a file format (with an
extension such as .mpg or .mp4) it is
also intended to be used as a transmission format and is used for digital TV,
among other purposes. MPEG streams
can contain video, audio or both and
can also include subtitles and other
metadata.
Multi-channel formats
Multi-channel formats compress
three or more channels of audio for
“surround sound”. They usually have a
relatively high bit rate (eg, 384kbps+) as
they are intended for use with movies
26 Silicon Chip
where significant degradation in sound
quality is not acceptable. However,
multi-channel formats are also sometimes used for music recordings, to
give a more ‘immersive’ or ‘live’ sound.
With some exceptions, these formats
generally have inferior sound quality
to CD-quality PCM. Of the two most
common 5.1 channel formats, DTS is
usually considered to have superior
quality to Dolby Digital (AC3) at the
same bit rate.
As with stereo CODECs, multichannel formats take advantage of the
similarity in content between channels
to achieve good compression. They
also use the fact that some channels
only operate over a limited range of
frequencies, especially the subwoofer
or “low frequency effects” channel (the
“.1” in 5.1 or 7.1).
The sound quality of the left and
right channels is generally the most
critical as these carry most of the
music; centre is used mainly for voice
while surround channels mostly carry
effects so degradation on those channels is less objectionable. Thus, the bit
rate of a 5.1-channel audio stream is
usually no more than about twice that
of a stereo recording.
Dolby Digital 5.1 and DTS 5.1 were
the most common multi-channel formats in the early days of DVDs. More
recently, with the introduction of
HD-DVD (now obsolete) and Blu-ray,
both Dolby Labs and Digital Theatre
Systems have come up with higher
quality formats that support even more
channels, eg, 7.1 surround sound with
a total of eight channels.
More recent multi-channel formats
such as Dolby Digital Plus, Dolby TrueHD, DTS Neo, DTS 96/24 and DTS-HD
increase audio quality through higher
bit rates and in some cases, use lossless
compression. However, the general
principle remains the same.
DVDs use an MPEG-2 stream and
allow linear PCM, MP2, AC3 or DTS
compressed audio data to be interleaved with the video. Multiple audio
streams can be interleaved, to support different numbers of channels or
languages.
DVD-audio adds the ability to carry
Meridian Lossless Packing (MLP) audio
data at higher sampling rates and bit
depths such as 24-bit 96kHz or 24-bit
192kHz. DVD-audio players thus generally have higher-quality DACs plus
the ability to decode these streams. In
addition, DVD-audio discs can contain
Dolby Digital and DTS tracks.
Non-PCM audio data
While virtually every digital audio
format is either based around PCM
or derived from PCM, there are other
formats. Super Audio CD or SACD is
one of these and it is based on PulseDensity Modulation Encoding (PDME)
which Sony and Philips refer to as
Direct-Stream Digital (DSD).
Rather than using a sampling rate of
44.1kHz, they use 2.8224MHz (ie, 64
times higher) but each sample is just
a single bit. Noise shaping is used to
allow the one-bit data stream to accurately encode an analog signal at a
much lower frequency.
The reason for using PDME rather
than PCM is that most modern DACs
are the Delta-Sigma type, which typically comprise a 4-bit DAC operating at
a similar frequency, ie, some multiple
of the incoming PCM data sampling
rate. The advantage of this approach
is that it’s much cheaper to fabricate
a 4-bit DAC with good linearity than
a 16-bit DAC. In addition, the much
higher noise frequency means that the
output analog filter doesn’t need to be
anywhere near as steep and so it can
be much simpler.
The logic therefore is this: if the DAC
is going to have to convert the PCM to
some form of PDME internally, why
not simply store and transmit the data
in this format? It certainly is a valid
approach but one criticism levelled at
DSD is that it’s much more difficult to
process audio in this format than PCM
data, and converting between PDME
and PCM is not simple.
Perhaps it is for this reason that
DVD-audio uses traditional PCM encoding, although with higher sampling
rates and bit depths.
SC
siliconchip.com.au
|