Tutorial
Tutorial
Markus Erne
Scopein Research
Sonnmattweg 6
5000 Aarau
Abstract
Digital audio devices, such as CD players or DAT-recorders, have become common during the past few years.
A growing demand for high quality audio delivery and the increased requirements for the storage of digital audio
data have motivated considerable research towards formulation of compression schemes which can satisfy both
the conflicting demands of high compression ratios and transparent reproduction quality at the same time. This
tutorial presents the principles of lossless and lossy audio compression schemes and will strongly focus on
perceptual compression algorithms. Starting from the definition of entropy, lossless audio coding is discussed
before perceptual coding will be addressed. Psychoacoustic principles, different quantizing and bit-allocation
schemes, subband coding and transform coding will be presented in more detail in order to introduce different
perceptual coding algorithms.
A short overview of the development of perceptual coding algorithms, including the different MPEG–Audio
standards and the concept of different layers, used in MPEG-1/2 will be addressed before the following
algorithms: MPEG-1, MPEG-AAC, Dolby AC3, ATRAC (Minidisc) and emerging standards, such as MPEG-4
and Wavelet-based coding schemes, are presented in more detail. In MPEG-4 audio, three different coder types
are integrated into the standard: coders based on time-frequency mapping (General Audio-coders), Speech-
coders and parametric coders each of which will be briefly presented.
Fig 1: Partitioning into relevant vs. irrelevant parts 3 Principles of lossy coding
and into redundant vs. non-redundant parts. In contrast to a lossless coding system, a lossy
compression schemes not only exploits the statistical
The notion of redundancy is connected to the entropy redundancies but also the perceptual irrelevancies of
of a signal: the signal, as they result from the properties of the
The entropy of a signal denotes the least amount of human auditory system.
information in order to represent the signal in binary
(or any other) number-format form. The redundancy 3.1 The human auditory system
is the difference between the amount of symbols
Fletcher [3] reported on test results for a large range
which is currently used to represent the signal and the
of listeners, showing their absolute threshold of
lower bound, the entropy. Therefore, the most
hearing, depending on the stimulus frequency. This
efficient coding scheme will represent the signal with
threshold in quiet is well approximated by the non-
the least amount of symbols, equal to the amount of
linear function:
bits, given by the entropy and in such case,
redundancy will be zero. Irrelevancy, denotes these 2
parts of the signal which cannot be perceived by a Tq(f) = 3.64(f/1000)-0.8 –6.5e-0.6(f/1000-3.3)
human-being. + 10-3(f/1000)4 [dB-SPL]
According to these categories, a digital compression The absolute threshold of hearing is used in some
scheme targets at removing either the irrelevant or coding schemes in order to scale the tolerable
redundant information part in a signal or both. A injected quantization noise but care has to be applied
lossless coding scheme is designed to remove when using this threshold in perceptual audio coding
since the final level of sound reproduction is usually 17 3700 4400 700 4000
not known at the time of the encoding of the signal. 18 4400 5300 900 4800
In the cochlea, a frequency-to-place transformation 19 5300 6400 1100 5800
takes place which leads to the notion of critical bands 20 6400 7700 1300 7000
[4]. 21 7700 9500 1800 8500
22 9500 12000 2500 10500
23 12000 15500 3500 13500
24 15500
Figure 2: Absolute Threshold of hearing
Figure 3: Critical band derived Bark scale
Perceptual coding algorithms profit from the masking
properties of the human auditory system. Masking is a
process where one sound (maskee) becomes inaudible
in the presence of another sound (masker). Masking
can occur in the time domain (temporal masking) or
in the frequency domain (frequency domain masking).
For the implementation of a psychoacoustic model in
a perceptual coder we have to address the tone-
masking-noise and the noise-masking-tone situation
separately because their influence on the human
auditory system is different [5].
Masking
where an attack of a percussive sound (castanets)
Threshold
starts near the end of an analysis block. The inverse
SMR(m)
SNR(m)
4 Subband Decomposition
The filterbank is certainly one of the most important
parts of a perceptual coding scheme. Despite the fact
that there exist audio coding schemes using signal
transforms (MDCT, Lapped Transform, Wavelet
Transform) and others using filterbanks (Polyphase
filters, Time varying, Quadrature Mirror Filter), they
are mathematically almost equal and can be
interpreted as filterbanks. Transform coders usually
Figure 6: Temporal masking process the signals in blocks of samples whereas
filterbanks based on convolutions may operate on a
Forward or post stimulatory masking depicts the sample by sample basis (but can also perform
masking effect that start at the end of the masking subsampling). There are several options for the type
sound and lasts about two tenths of a second. This of filterbank which directly will affect the
effect can physically be explained by the recovery computational complexity:
time required by the hair cells after stimulation.
Backward or pre-stimulatory masking refers to the Uniform vs. Non-uniform filterbanks: uniform
phenomenon whereby the threshold of audibility is filterbanks are rather easy to implement and they are
already raised before the onset of the masking sound. widely used in the ISO/MPEG Audio standard for the
Layer-1, Layer-2 and Layer-3 coders. Non-uniform 5 Block diagram
filterbank often mimic the response of the human
auditory system (e.g. critical band filterbanks) and
they therefore better approximate the critical bands.
Nevertheless, they may not necessarily be superior in Figure 9: Block diagram of a basic perceptual audio
terms of coding gain compared to a uniform coder
filterbank.
Combining the individual building blocks, we can
now derive a block diagram of a basic audio coder.
The time-frequency mapping (filterbank) decomposes
the audio signal into subbands and therefore
decorrelates successive samples. In the perceptual
model, the masking threshold is estimated based on
the presence of frequency domain and temporal
masking present in the current block of audio data.
The reduction in irrelevance is performed in the
quantizer. The quantizer can be a uniform quantizer, a
non-uniform quantizer or an adaptive quantizer. The
quantizer is controlled according to the estimates of
the masking threshold calculated by the perceptual
Figure 8: Audio Coder using a uniform polyphase model in order to guarantee that the quantization
filterbank noise in each subband is below the masking threshold.
After quantization, the quantized subband samples
Static or Time-varying filterbank: Ideally the can be grouped in an efficient manner in the
filterbank should adapt to the signal statistics in order redundancy reduction block using entropy coding
to optimize the time-frequency tiling. Not all coding techniques. Figure 8 shows the reduction in bitrate
algorithms use time-varying filterbanks and window resulting from the quantization, and the redundancy
switching and block switching can be considered as a reduction.
first, but also limited approach to a time-varying
filterbank. Block-switching is used in order to avoid
pre-echoes. In case of a transient, large block-sizes
will create disturbing artifacts due to smearing of the 6 Implementations
quantization noise at the decoding throughout the There exist numerous perceptual audio coding
whole block. Therefore a transient-detector will algorithms and most of them implement the principles
monitor the signal and in case of a transient, the explained earlier [6].
block-size will be switched to typically 128 or even
less samples. From this principle, it becomes pretty 6.1 Application specific algorithms
obvious these coding schemes have to use a “look- Some audio compression algorithms have been
ahead” principle which might further increase the developed for a specific application. ATRAC
overall coding delay. In MLT based [10] algorithms, (Adaptive Transform Acoustic Coding) [ref] based on
audio samples are windowed using overlapping of a combination of QMF filters and MDCT is used in
50% or more. In case of a transient and a switching of Minidisc-recorders. AC-2 [ref] and AC-3 [ref],
the blocksize, perfect reconstruction of the filterbank developed by Dolby Laboratories, are perceptual
only can be guaranteed when the size and the shape of coding algorithms which are widely used in satellite
the window are adapted as well. This process is called links and surround sound applications for cinemas
“window-switching” and is shown in Figure 8. and the home theatre.
Many thanks to Dr. Juergen Herre for reviewing this [14] Herre, J.; Jonston J.; “Enhancing the
document. performance of perceptual coders by using
Noise substitution”, AES 104-th Convention
Preprint, #4720, 1998
REFERENCES
[1] Brandenburg K., Stoll G. et al., “The
ISO/MPEG-Audio Codec: A Standard for
Coding of High Quality Digital Audio”, AES
Convention Preprint 3336, March 1992
[2] Jayant N., Noll P., “Digital Coding of
Waveforms”, Englewood Cliffs: Prentice Hall,
1984
[3] Fletcher H., “Auditory Patterns”, Rev. Mod.
Phys., January 1940, pp. 47-65.
[4] Scharf B., “Critical Bands”,. Foundations of
Modern Auditory Theory, Academic Press,
1970
[5] Zwicker E., Fastl H, “Psychoacoustics, Facts
and Models” Springer Verlag, 1990
APPENDIX
small step
enh. module
pre- sepa- CELP
audio
processing ration core multi-
signal plex
bit
small step stream
enh. module
T/F
core
large step
enh. module
MPEG-2/4 AAC,Temporal Noise Shaping