0% found this document useful (0 votes)
10 views10 pages

Tutorial

tutorial on peceptual evaluation of audio codec

Uploaded by

dyangcad600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Tutorial

tutorial on peceptual evaluation of audio codec

Uploaded by

dyangcad600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Tutorial on Perceptual Audio Coding Algorithms

Markus Erne

Scopein Research
Sonnmattweg 6
5000 Aarau

Abstract

Digital audio devices, such as CD players or DAT-recorders, have become common during the past few years.
A growing demand for high quality audio delivery and the increased requirements for the storage of digital audio
data have motivated considerable research towards formulation of compression schemes which can satisfy both
the conflicting demands of high compression ratios and transparent reproduction quality at the same time. This
tutorial presents the principles of lossless and lossy audio compression schemes and will strongly focus on
perceptual compression algorithms. Starting from the definition of entropy, lossless audio coding is discussed
before perceptual coding will be addressed. Psychoacoustic principles, different quantizing and bit-allocation
schemes, subband coding and transform coding will be presented in more detail in order to introduce different
perceptual coding algorithms.
A short overview of the development of perceptual coding algorithms, including the different MPEG–Audio
standards and the concept of different layers, used in MPEG-1/2 will be addressed before the following
algorithms: MPEG-1, MPEG-AAC, Dolby AC3, ATRAC (Minidisc) and emerging standards, such as MPEG-4
and Wavelet-based coding schemes, are presented in more detail. In MPEG-4 audio, three different coder types
are integrated into the standard: coders based on time-frequency mapping (General Audio-coders), Speech-
coders and parametric coders each of which will be briefly presented.

of the human auditory system. Most audio


1 Introduction compression algorithms typically segment the input
signal into blocks of 2ms up to 50ms duration. A
The central objective in audio coding or audio time-frequency analysis then decomposes each
compression is to represent a digital audio signal with analysis block in the encoder. This transformation or
a minimum of bits per sample while keeping subband filtering scheme compacts the energy into a
transparent (not distinguishable form the original few transform coefficients and therefore de-correlates
under given listening conditions). Conventional successive samples. These coefficients, subband
digital audio devices (CD, DAT) typically sample samples or parameters are quantized and encoded
audio signals at rates of 44.1 or 48 kHz, using linear according to perceptual criteria. Depending on the
PCM (Pulse Code Modulation) and a quantization of system objectives, the time-frequency analysis section
16 Bits per sample. Therefore data rates in the range might contain:
of 1,5 Mbit/s result for the storage or the transmission • Unitary transform (MDCT, FFT)
of a stereo signal. In contrast to lossless audio coding • Polyphase filterbank with uniform bandpass
which is based on removing redundancy, lossy coding filters
techniques make use of removing redundancy and
• Time-varying, critically sampled bank of non-
irrelevancy based on perceptual criteria [1]. In
uniform bandpass filters
archiving applications and for the high quality
• Hybrid transform/filterbank scheme
transmission or storage of audio signals, lossless
• Harmonic/sinusoidal signal analyzer
coding is preferred in order to allow bit-identical
reconstruction of the original signal at the decoder. • Source system analysis (LPC)
For consumer and multimedia applications and for the
transmission of audio signals using low bandwidth The time-frequency analysis approach always
channels, a lossy compression scheme has to be used involves a fundamental tradeoff between time and
in order to guarantee a constant target bitrate. frequency resolution requirements. The choice of the
Perceptual coders are based on a psychoacoustic time-frequency analysis method additionally
model and take advantage of the masking properties determines the amount of coding delay introduced, a
parameter which may become important in duplex redundant parts in a signal and therefore will come as
broadcast and live-events applications. The variety of close as possible to the Shannon-bound of entropy. In
existing musical instruments, such as castanets, practice, however, average compression ratios
harpsichord or pitch-pipe, create various coding between 2 and 3,5 can be achieved for material in CD
requirements due to their completely different format (16bits, 44.1 kHz sampling rate) while the
temporal and spectral fine –structure and suggest to compression ratio will vary with the content of the
use a filterbank with a flexible time-frequency tiling. audio signal, resulting in signal-dependent
requirements in terms of transmission-bandwidth and
storage-size of the media.
2 Principles of lossless coding Lossless audio compression mostly is realized by
using a combination of linear prediction or a
The source entropy [2] of a discrete source is defined transformation followed by entropy coding. The
as the average information (bit/symbol) generated per linear predictor attempts to minimize the variance of
symbol if all M symbols are statistically independent the difference signal between the predicted sample
and identically distributed: value and its actual value. The entropy coder
M (Huffmann, LZW) allocates short codewords to
H ( A) = −∑ P(aj ) logP(aj ) samples with high probability of occurance and
j =1 longer codewords to samples with lower probability
With H(A): entropy of the signal and in this way reduces the average bit consumption.
P(aj): probability distribution Lossless coding schemes allow to reconstruct the
coded (original) signal on a bit by bit basis and
Any information can be subdivided into 4 categories: therefore, by definition, cannot degrade the signal
quality. Lossless audio coding is used in applications,
such as archiving, multichannel sound storage,
transmission of audio signals and forensic
applications. Most lossless audio coders tend to have
similar complexity for the encoder as well as for the
decoder in terms of DSP instructions needed for
realtime computation. Additionally, for the
transmission and storage of losslessly compressed
audio bitstreams, some additional error protection is
required because error concealment on decorrelated
samples becomes a difficult issue.

Fig 1: Partitioning into relevant vs. irrelevant parts 3 Principles of lossy coding
and into redundant vs. non-redundant parts. In contrast to a lossless coding system, a lossy
compression schemes not only exploits the statistical
The notion of redundancy is connected to the entropy redundancies but also the perceptual irrelevancies of
of a signal: the signal, as they result from the properties of the
The entropy of a signal denotes the least amount of human auditory system.
information in order to represent the signal in binary
(or any other) number-format form. The redundancy 3.1 The human auditory system
is the difference between the amount of symbols
Fletcher [3] reported on test results for a large range
which is currently used to represent the signal and the
of listeners, showing their absolute threshold of
lower bound, the entropy. Therefore, the most
hearing, depending on the stimulus frequency. This
efficient coding scheme will represent the signal with
threshold in quiet is well approximated by the non-
the least amount of symbols, equal to the amount of
linear function:
bits, given by the entropy and in such case,
redundancy will be zero. Irrelevancy, denotes these 2
parts of the signal which cannot be perceived by a Tq(f) = 3.64(f/1000)-0.8 –6.5e-0.6(f/1000-3.3)
human-being. + 10-3(f/1000)4 [dB-SPL]

According to these categories, a digital compression The absolute threshold of hearing is used in some
scheme targets at removing either the irrelevant or coding schemes in order to scale the tolerable
redundant information part in a signal or both. A injected quantization noise but care has to be applied
lossless coding scheme is designed to remove when using this threshold in perceptual audio coding
since the final level of sound reproduction is usually 17 3700 4400 700 4000
not known at the time of the encoding of the signal. 18 4400 5300 900 4800
In the cochlea, a frequency-to-place transformation 19 5300 6400 1100 5800
takes place which leads to the notion of critical bands 20 6400 7700 1300 7000
[4]. 21 7700 9500 1800 8500
22 9500 12000 2500 10500
23 12000 15500 3500 13500
24 15500
Figure 2: Absolute Threshold of hearing
Figure 3: Critical band derived Bark scale
Perceptual coding algorithms profit from the masking
properties of the human auditory system. Masking is a
process where one sound (maskee) becomes inaudible
in the presence of another sound (masker). Masking
can occur in the time domain (temporal masking) or
in the frequency domain (frequency domain masking).
For the implementation of a psychoacoustic model in
a perceptual coder we have to address the tone-
masking-noise and the noise-masking-tone situation
separately because their influence on the human
auditory system is different [5].

Critical bandwidth can be considered as the


3.1.1 Frequency domain masking
bandwidth at which subjective responses change
abruptly. For example, the perceived loudness of a Frequency domain masking can be observed within
narrow band noise at constant sound pressure level critical bands (intra band masking) or across critical
remains constant within a critical band and then, bands (inter band masking).
begins to increase, once the bandwidth of the stimulus In the presence of a masker, the absolute threshold of
is increased beyond one critical band [5]. There is an hearing will be modified to become the masking
almost infinite number of detectors on the cochlea threshold. All signals which are below the masking
with vastly overlapping frequency responses, each threshold are not perceived by the human auditory
one having its own critical band width. The so called system and therefore the quantization noise in every
bark scale results if one puts together critical band subband can be as high as the masking threshold
widths such that their borders line up and in the permits while maintaining subjectively perfect sound
literature about 25 bark bands can be found. . quality.
It turns out that the critical bands at low frequencies Some algorithms use a subband decomposition with
show a constant bandwidth of around 100 Hz up to frequency bands approximating the critical bands in
center frequencies of 500 Hz and at higher order take advantage of frequency domain masking.
frequencies tend to turn into a more “constant-Q-type
filter”.

Bark Lower Upper Band- Center


Scale edge fl edge fu width Frequency
1 0 100 100 50
2 100 200 100 150
3 200 300 100 250
4 400 510 110 350
5 510 630 120 450
6 630 770 140 570
7 770 920 150 700
8 920 1080 160 1000
9 1080 1270 190 1170
10 1270 1480 210 1370
11 1480 1720 240 1600
12 1720 2000 280 1850 Figure 4: Masking effect, highlighted by the modified
13 2000 2320 320 2150 masking threshold in the presence of a masker and a
14 2320 2700 380 2500 maskee
15 2700 3150 450 2900
16 3150 3700 550 3400
The presence of a masker and a maskee generates an It is assumed that backward masking is caused by the
excitation in the inner ear which can be characterized interference of masker with the yet not incompleted
by the masking threshold and the spreading function auditory processing of the sound. Despite the fact that
of the masking curve. premasking only lasts about 5 ms, it can be used to
Assuming that the masker is uniformly quantized at m compensate for pre-echo distortion in transform based
bits, then the Signal-to-Mask-Ratio (SMR) denotes coders. Postmasking could be integrated into
the log distance of the masker to the masking the design of the filterbank but it is rarely
threshold in this particular critical band and the implemented in current audio compression
Noise-to-Mask-Ratio (NMR) denotes the log distance algorithms.
of the quantization noise level and the masking
threshold.

Although pre-echoes might be masked by the pre-


masking effect, this is only true for small block sizes.
Masker
Pre-echoes are a known problem of transform coders
Sound level pressure (dB)

Masking
where an attack of a percussive sound (castanets)
Threshold
starts near the end of an analysis block. The inverse
SMR(m)
SNR(m)

distortion preceding the signal attack.


Minimum
Masking Threshold
NMR(m)

m-1 Noise level


of m-bit
m quantizer
m+1

Critical band Frequency (Hz)

Figure 5: Frequency domain masking showing the


signal to mask ratio and the noise to mask ratio

3.1.2 Temporal masking


Masking also occurs in the time domain. In the
presence of abrupt signal transients, a listener will not
perceive signals below the audibility threshold in the
pre- and post-masking regions.
Figure 7: The effect of pre-echo distortion,
demonstrated with a castanet signal, the coded version
of it showing pre-echo-distortion, the difference
signal between the original and the coded version.

4 Subband Decomposition
The filterbank is certainly one of the most important
parts of a perceptual coding scheme. Despite the fact
that there exist audio coding schemes using signal
transforms (MDCT, Lapped Transform, Wavelet
Transform) and others using filterbanks (Polyphase
filters, Time varying, Quadrature Mirror Filter), they
are mathematically almost equal and can be
interpreted as filterbanks. Transform coders usually
Figure 6: Temporal masking process the signals in blocks of samples whereas
filterbanks based on convolutions may operate on a
Forward or post stimulatory masking depicts the sample by sample basis (but can also perform
masking effect that start at the end of the masking subsampling). There are several options for the type
sound and lasts about two tenths of a second. This of filterbank which directly will affect the
effect can physically be explained by the recovery computational complexity:
time required by the hair cells after stimulation.
Backward or pre-stimulatory masking refers to the Uniform vs. Non-uniform filterbanks: uniform
phenomenon whereby the threshold of audibility is filterbanks are rather easy to implement and they are
already raised before the onset of the masking sound. widely used in the ISO/MPEG Audio standard for the
Layer-1, Layer-2 and Layer-3 coders. Non-uniform 5 Block diagram
filterbank often mimic the response of the human
auditory system (e.g. critical band filterbanks) and
they therefore better approximate the critical bands.
Nevertheless, they may not necessarily be superior in Figure 9: Block diagram of a basic perceptual audio
terms of coding gain compared to a uniform coder
filterbank.
Combining the individual building blocks, we can
now derive a block diagram of a basic audio coder.
The time-frequency mapping (filterbank) decomposes
the audio signal into subbands and therefore
decorrelates successive samples. In the perceptual
model, the masking threshold is estimated based on
the presence of frequency domain and temporal
masking present in the current block of audio data.
The reduction in irrelevance is performed in the
quantizer. The quantizer can be a uniform quantizer, a
non-uniform quantizer or an adaptive quantizer. The
quantizer is controlled according to the estimates of
the masking threshold calculated by the perceptual
Figure 8: Audio Coder using a uniform polyphase model in order to guarantee that the quantization
filterbank noise in each subband is below the masking threshold.
After quantization, the quantized subband samples
Static or Time-varying filterbank: Ideally the can be grouped in an efficient manner in the
filterbank should adapt to the signal statistics in order redundancy reduction block using entropy coding
to optimize the time-frequency tiling. Not all coding techniques. Figure 8 shows the reduction in bitrate
algorithms use time-varying filterbanks and window resulting from the quantization, and the redundancy
switching and block switching can be considered as a reduction.
first, but also limited approach to a time-varying
filterbank. Block-switching is used in order to avoid
pre-echoes. In case of a transient, large block-sizes
will create disturbing artifacts due to smearing of the 6 Implementations
quantization noise at the decoding throughout the There exist numerous perceptual audio coding
whole block. Therefore a transient-detector will algorithms and most of them implement the principles
monitor the signal and in case of a transient, the explained earlier [6].
block-size will be switched to typically 128 or even
less samples. From this principle, it becomes pretty 6.1 Application specific algorithms
obvious these coding schemes have to use a “look- Some audio compression algorithms have been
ahead” principle which might further increase the developed for a specific application. ATRAC
overall coding delay. In MLT based [10] algorithms, (Adaptive Transform Acoustic Coding) [ref] based on
audio samples are windowed using overlapping of a combination of QMF filters and MDCT is used in
50% or more. In case of a transient and a switching of Minidisc-recorders. AC-2 [ref] and AC-3 [ref],
the blocksize, perfect reconstruction of the filterbank developed by Dolby Laboratories, are perceptual
only can be guaranteed when the size and the shape of coding algorithms which are widely used in satellite
the window are adapted as well. This process is called links and surround sound applications for cinemas
“window-switching” and is shown in Figure 8. and the home theatre.

6.2 MPEG-1 and MPEG-2


MPEG-1 [7] was standardized in 1992 and has been
developed for an overall target bitrate of 1,5 Mbit/s
(audio & video).
The audio part of the MPEG-1 standard defines 3
different layers, each having different target bitrates,
coding delays and complexity.
Layer-1, the implementation with the lowest
complexity, initially has been developed for DCC-
applications and was optimized for a bitrate of 192
kbit/s per channel.
Layer-2, a derivative of the MUSICAM-algorithm [1] MPEG-4 AAC [8] includes new tools, such as PNS
has a target bitrate of 128 kbit/s per channel and is (Perceptual Noise Substitution [14]) which allows to
widely used in broadcast applications and DAB save transmission bandwidth for noise-like signals.
(Digital Audio Broadcasting [ref]). Instead of coding the noise-like signals as spectral
Layer-3, the most complex algorithm within this coefficients, a “noise flag” and the total noise power
family, targets for a high audio quality at bitrates are transmitted and noise is re-synthesized at the
down to 64 kbit/s or even less per channel. Layer-3 decoder during the period of interest.
therefore can be used for low bitrate applications An additional feature of MPEG-4 is the concept of
(satellite transmission, ISDN-applications, solid state scalability. Certain subsets of the entire bitstream are
memory recorders etc.). More recently, this coder has sufficient for the decoding of the audio signal at a
become extremely popular in the area of Internet lower quality. This feature allows to trade bitrate
audio and its nick name “MP3”. versus quality and is extremely useful for applications
where a certain bandwidth cannot be guaranteed (e.g.
Finalized in 1994, MPEG-2 Audio provides two types Audio on the Internet). In MPEG-4, scalability can be
of extensions to the MPEG-1 standard. The first achieved in large steps of several kbit/s or by the use
extension allows to use lower sampling frequencies of a fine granularity mode. In the context of
(16 kHz, 22.05 kHz and 24 kHz) for low bitrate scalability, MPEG-4 can make use of a speech core
applications. Additionally, multi-channel coding for coder and additional enhancement layers.
surround sound applications (including the popular
5.1. configuration) has been added to the MPEG-2
standard in a way which is backward (“BC”)
compatible to MPEG-1 coding.
In order to enable better multi-channel compression 7 The next generation of audio
performance than possible with the backward coding algorithms
compatibility to MPEG-1 Audio, MPEG-2 Advanced In the last few years, many high quality audio
Audio Coding (AAC) was added in 1997, initially compression algorithms have been developed but
named MPEG-2 NBC (Non Backward Compatible) most of them use a perceptual distortion measure and
coding. This addendum defines a new coding scheme, operate at different but fixed bitrates. A variety of
offering some advanced features which cannot be these algorithms are based on uniform polyphase
integrated into the MPEG-1 standard. Among these filterbanks, modified discrete cosine transforms [1],
features AAC includes tools, such as TNS (Temporal using window switching or, alternatively, lapped
Noise Shaping), Intensity Coupling, M/S-Stereo orthogonal transforms [10]. Many proposals for
Coding and Gain Control. MPEG-2 NBC can wavelet based audio coding [11] schemes have been
represent up to 48 channels of compressed audio. published recently. Uniform polyphase filterbanks can
be implemented efficiently, but they do not
approximate the human auditory system well and they
6.3 MPEG-4 do not offer large coding gains in a rate-distortion
The upcoming MPEG-4 standard has become an metric. Among filterbanks or transforms which
International Standard in 1999. In contrast to MPEG- mathematically can be considered as being equal,
1 and MPEG-2, MPEG-4 is a universal framework for Wavelet based filterbanks offer an interesting
the integration of tools, profiles and levels. MPEG-4 alternative to lapped orthogonal transforms. Wavelet
therefore not only includes a bitstream syntax and a filterbanks are known for a flexible time-frequency
set of compression algorithms but also offers a tiling [12] but most wavelet-based audio coding
complete framework for synthesis, rendering, algorithms are focussed to mimic the response of the
transport, compression and the integration of audio human auditory system.
(speech, music etc.) and visual data. MPEG-4 audio is Some new proposals have been published and they
subdivided into natural and synthetic coding whereas target to optimize the rate-distortion function based
natural audio integrates the following coding tools: on perceptual criteria [13]. Nevertheless it is fair so
• HVXC low rate clean speech coder say that audio coders using wavelet-packet transforms
• CELP wideband speech coder may exhibit other problems. The short support of the
• GA General Audio Coding (AAC-based) underlying basis functions (wavelets) creates filters
• Twin VQ Additional Coding Tools to increase which have limited stopband-attenuation. In an
the coding efficiency at very low bitrates iterated filterbank, sidelobes appart from the center
• frequency may appear and may be filled with
The target bitrates of MPEG-4 range from 2 kbit/s up quantization noise which at the sidelobes position
to 64 kbit/s per channel and, depending on the may become unmasked.
application, generic audio and speech coding or a An interesting alternative to subband coding system
combination of coding and synthesis may be used. are parametric coding schemes which are based on a
analysis-synthesis approach. The signal is
decomposed in sinusoids and noise-like parts in the [6] Gilchrist N., Grewin C., “Collected Papers on
encoder and then at the decoder, these sinusoids and Digital Audio Bit-rate Reduction”, Audio
the noise-part are re-synthesized based on the Engineering Society, September 1996
transmitted control parameters. Parametric coders not
[7] Noll P., “MPEG Digital Audio Coding”, IEEE
only offer to transmit at extremely low bitrates down
SP Magazine, September 1997, pp. 59-81
to a few kBit/s but due to the synthesis in the decoder
offer interesting rendering of the audio signal at the [8] Kahrs M., Brandenburg K.H. “Applications of
decoder (volume control, pitch-change, tempo-change Digital Signal Processing to Audio and
etc.) Acoustics” . Kluwer Academic Press, 1999
[9] Erne M., Moschytz G. “Best Wavelet-Packet
Bases for Audio Coding using Perceptual and
8 Conclusions Rate-distortion Criteria, Proceedings of ICASSP
A lot of effort has been put into the development of 99, May 1999
high quality low bitrate audio compression algorithms [10] Malvar H.S., “Signal Processing with Lapped
during the past few years. Audio coding still is a Transforms”, Artech House, Norwood, 1992.
young field of research and the progress which could
be achieved, especially thanks to the MPEG- [11] Sinha D., Tewfik A., “Low Bit Rate Transparent
standardization, is remarkable. Current research Audio Compression using Adapted Wavelets”,.
topics include the development of signal adaptive IEEE Trans. on ASSP, Vol. 41, No.12,
filterbanks, a more detailed understanding of the December 1993, pp. 3463-3479.
psychoacoustic principles for advanced perceptual [12] Erne M., Moschytz G., “Perceptual and Near-
models and the combination of coding and synthesis Lossless Audio Coding based on Signal-
methods. adaptive Wavelet filterbank, 106-th AES
Future audio coding algorithms might be based on Conference, May 1999, Munich, Preprint No.
advanced rendering techniques which allow the 4934
decoder to render the audio based on a provided set
of parameters. With increasing transmission [13] Erne M., Moschytz G., “Audio Coding Based
bandwidth over the Internet, it is possible that the on Rate-Distortion and Perceptual
customer at home will have the possibility to perform Optimization Techniques”, Proc. of the AES
his own mix at home, using a multi-channel bitstream 17-th International Conference on High
or he can choose between different mixes. Quality Audio Coding”, Florence, September
1999, pp. 220- 225

Many thanks to Dr. Juergen Herre for reviewing this [14] Herre, J.; Jonston J.; “Enhancing the
document. performance of perceptual coders by using
Noise substitution”, AES 104-th Convention
Preprint, #4720, 1998
REFERENCES
[1] Brandenburg K., Stoll G. et al., “The
ISO/MPEG-Audio Codec: A Standard for
Coding of High Quality Digital Audio”, AES
Convention Preprint 3336, March 1992
[2] Jayant N., Noll P., “Digital Coding of
Waveforms”, Englewood Cliffs: Prentice Hall,
1984
[3] Fletcher H., “Auditory Patterns”, Rev. Mod.
Phys., January 1940, pp. 47-65.
[4] Scharf B., “Critical Bands”,. Foundations of
Modern Auditory Theory, Academic Press,
1970
[5] Zwicker E., Fastl H, “Psychoacoustics, Facts
and Models” Springer Verlag, 1990
APPENDIX

MPEG-1, Layer2 Coder

MPEG-1, Layer3 Encoder


MPEG-2 AAC Encoder

MPEG-4 Scalable Encoder

signal analysis small step


and control enh. module
PARA
core

small step
enh. module
pre- sepa- CELP
audio
processing ration core multi-
signal plex
bit
small step stream
enh. module
T/F
core
large step
enh. module
MPEG-2/4 AAC,Temporal Noise Shaping

MPEG-4, Perceptual Noise Substitution

MPEG-4, Perceptual Noise Substitution

You might also like