Goldberg, R. G.
"Perceptual Speech Coding"
A Practical Handbook of Speech Coders
Ed. Randy Goldberg
Boca Raton: CRC Press LLC, 2000
© 2000 CRC Press LLC
Chapter 12
Perceptual Speech Coding
The goal of perceptual coding is to reduce the size of the signal represen-
tation while maintaining the perceived sound quality by exploiting the
limits of human auditory perception. The exploitable limits of auditory
perception stem from the frequency and temporal masking discussed in
Chapter 6.
This chapter presents an overview of perceptual speech coding. Fre-
quency and temporal masking are considered together to determine
which signal components are not perceptible. The impact of the sound
quality (tone or noise) of the maskee and masker is discussed. Because
of the limited time and frequency resolution of standard frequency-
domain transforms (discrete Fourier transform), the Multi-Band Exci-
tation (MBE) speech model is shown to have advantages for perceptual
coding. The last section lists a sampling of the current research in per-
ceptual speech coding.
While this chapter discusses perceptual speech coding schemes, these
approaches are not as dominant as perceptual approaches for general
wideband audio coding. To date, the highest quality, lowest rate speech
coders are of the type described in Chapter 11. However, the progress of
future research holds the promise of perceptual coding gains for speech.
12.1 Auditory Processing of Speech
Section 6.4 discussed monaural masking. One sound can mask an-
other simultaneous, lower amplitude sound when the two are close in
frequency. This is referred to as "simultaneous masking in frequency."
2000 CRC Press LLC
When two sounds occur at nearly the same time, the lower level signal
can be masked by the stronger signal in the phenomena of "temporal
masking." The challenge of perceptual coding is how to determine which
sounds mask which other sounds in a complex, rapidly varying speech
signal.
12.1.1 General Perceptual Speech Coder
Most of the algorithm processing steps of a perceptual speech coder are
similar to those of conventional speech coders. The primary difference
is the determination and deletion of signal components that are not
perceptible.
FIGURE 12.1
General perceptual speech coder.
Figure 12.1 displays a block diagram of a general perceptual speech
coder. The input speech is analyzed, yielding a short-term spectral rep-
resentation of the vocal tract and excitation information. These para-
2000 CRC Press LLC
meterizations are transformed by an auditory analysis into a perceptual
representation. In the perceptual representation, the frequency scale is
warped to a nonlinear scale based on critical bands (see Chapter 6).
Within the perceptual domain, masking and masked signal compo-
nents are determined. The masked components, which are not percep-
tible, are deleted from the representation or marked to be coded more
coarsely, that is, with fewer bits. The reduced perceptual representation
results in a lower bit rate due to the reduced number of parameters, but
is used to synthesize output speech of the same perceived quality as the
complete representation. Determining the particulars of the masking is
discussed in more detail in the following sections.
As with any speech encoding/decoding system, the decoder merely
reverses the operations to synthesize the output speech.
12.1.2. Frequency and Temporal Masking
It is well known that simultaneous masking in frequency is more
prominent when the masker is lower in frequency than the maskee. Re-
ferring back to Figure 6.4, the plotted threshold of detectability is much
lower for frequencies below the masker, than for frequencies above.
This observation suggests an efficient method to determine which com-
ponents are masked:
1. Transform each short time segment of speech into the frequency
domain.
2. Segment frequency domain representation into logarithmically
spaced frequency bands (constant number of barks per frequency
segment).
3. Calculate the total energy in the lowest band.
4. Determine the threshold of detectability within this critical band
and in the higher frequency critical bands.
5. Code only frequency information above the threshold level.
6. Continue threshold calculation/coding process for the next higher
critical band.
7. Repeat steps 3 through 6 until all critical bands are coded.
Although more complex, this method could be extended to include
masking regions where the maskee is in a lower frequency critical band
2000 CRC Press LLC
FIGURE 12.2
Island of perceptually significant signal and resulting area of
masking.
than the masker. The previously described method is highly efficient;
however, it does not take full advantage of the properties of simultaneous
masking. The method calculates a saturation threshold level for each
critical band, but it does not take into account spectral changes within
each critical band. The described method does not consider the effect
of temporal masking across different frames.
Simultaneous frequency and temporal masking suggest that
substantial economies in coding can be gained by spectral analysis
to determine "islands" of perceptually significant signals in the
time/frequency/intensity dimensional representation. Figure 12.2 shows
an intense complex signal surrounded in the time/frequency domain by
a box which represents the signals that would be masked in the presence
of this complex signal. These complex signals appear as high intensity
"islands" in a typical spectrogram. The majority of the available coding
capacity can then be assigned to accurately represent these islands and
a minimum assigned to regions masked by these islands.
2000 CRC Press LLC
12.1.3 Determining Masking Levels
In Figure 12.1, an auditory analysis of the speech parameters is per-
formed on each frame of speech. The auditory analysis transforms the
signal representation into the perceptual representation. The high in-
tensity regions in the time-frequency plane mask (either somewhat or
completely) some of the less intense regions as in Figure 12.2. This
masking causes the threshold of detectability of the maskee to be in-
creased. If the threshold of detectability of a region is greater than the
intensity of that region, then the portion of the signal denoted by that
region is not audible to human hearing. These values are calculated
by comparing all the regions to each other and determining how much
the threshold of detectability is raised for each time-frequency region.
Psycho-acoustic data such as those represented in Figures 6.4 and 6.5
are used in the calculations of these values.
Figure 12.3 is a 3-dimensional representation of the union of a partic-
ular set of simultaneous and temporal masking data. The time scale is
the time difference between the masker and maskee, and the frequency
scale is the frequency difference between the two. The peak of the sur-
face is the origin, where the relative time, relative frequency, and relative
amplitude are all zero.
This graphical representation can lend insight into the workings of
perceptual speech coding. A time/frequency/magnitude representation
of a speech utterance can be displayed as a 3-dimensional surface. This
is a 3-D representation of the spectrograms of Chapter 2, where the
amplitude, displayed as shades of gray in the spectrograms, is now the
vertical height of the surface.
Figure 12.4 displays this data representation for a half-second segment
of speech for the frequency range of 0 to 1000 Hz. This representation
can be visualized as a mountainous landscape. High elevation areas cor-
respond to high amplitude signals regions located at particular time and
frequency coordinates. The ridges running across time, of nearly con-
stant frequency, are the pitch harmonics. (The same pitch harmonics
appear as dark bands in the spectrograms of Chapter 2.) If the moun-
tainous speech landscape is divided up into segments, the time divisions
correspond to different analysis frames, and the frequency divisions cor-
respond to dividing the spectrum into critical bands.
Visualize a copy of Figure 12.3 (appropriate for the frequency of the
masker) placed at the time/frequency coordinate of the masker under
consideration in the speech landscape of Figure 12.4. The surface of
Figure 12.3 will be below the surface of the speech landscape of Figure
2000 CRC Press LLC
FIGURE 12.3
Psycho-acoustic masking data, both temporal and frequency.
12.4 at some places, and above at others. Figure 12.3 represents the
threshold of detectability. When the surface of the speech landscape is
below, those sounds cannot be heard. When the surface of the speech
landscape is above, those sounds can be heard, relative to the masker
under consideration.
This process is repeated for all time/frequency coordinates of the
speech landscape, with the appropriate masking surfaces, to determine
which sounds are masked by which others.
2000 CRC Press LLC
FIGURE 12.4
Time/frequency/magnitude representation of a segment of
speech.
12.2 Perceptual Coding Considerations
The discussion of the previous section describes in a conceptual man-
ner the application of simultaneous and temporal masking data to de-
termine which signal components are not perceptible. Two other spe-
cific considerations impact practical application of masking in percep-
tual coding. Standard frequency domain transformations include limits
on their time/frequency resolution (size of the x−y grid spacing on the
speech landscape). Additionally, masking data sets (the surface of Fig-
ure 12.3) differ, depending on whether the masker is tone-like or noise-
like, and whether the maskee is tone-like or noise-like.
2000 CRC Press LLC
12.2.1 Limits on Time/Frequency Resolution
Wideband perceptual coding (bandwidth of approximately 20 kHz) is
used in audio coding in standards such as MPEG-2 and MPEG-4. The
duration of the analysis window used in these wideband coding tech-
niques is around 10 ms [82, 12, 83]. The reciprocal relation between
time and frequency dictates that a frequency resolution of only 100 Hz
can be obtained using standard frequency-domain transforms (discrete
Fourier transform) for this time resolution (10 ms). Here, the frequency
resolution refers to the frequency spacing between samples of the DFT.
Given this frequency resolution, it is possible to locate regions of si-
multaneous masking in frequency at high frequencies (i.e., above 5kHz)
using the type of data in Figure 6.4. Significant economies in coding
can be achieved with frequency domain analysis for wideband audio sig-
nals. However, for the lower frequency regions of signals, a much higher
frequency resolution is required to exploit the masking properties of the
human auditory system. This results from the much narrower frequency
spacing of the critical bands at low frequencies.
For temporal masking analysis, a time length as short as 10 ms is
useful to take advantage of the qualities of both forward and backward
masking. This time resolution is crucial because of the rapid drop off
of the amount of masking with time (see Figure 6.5). Longer analysis
windows would blend together separate sounds. However, as described,
this frequency resolution (100 Hz) is not sufficient to separate the low
frequency critical bands for simultaneous masking. To fully exploit the
properties of both simultaneous masking and temporal masking, it is
necessary to bypass the constraints imposed by the reciprocal relation
between time and frequency. Section 12.2.3 suggests a method to circum-
vent these limitations by utilizing information about the human vocal
system.
12.2.2 Sound Quality of Signal Components
Psycho-acoustic experimentation on the auditory system [45, 141, 139,
176, 31] has revealed that a tone masked by a broad band of noise is
different from a broad band of noise masked by a broad band of noise, a
tone masked by a tone, or a broad band of noise masked by a tone. (It has
been shown that a narrow band of noise has similar masking properties
as a pure tone [59, 60, 61].) This is because the notches in the plot of
Figure 6.4 occur only during tone-on-tone and broadband-noise-on-tone
masking. These notches are not present for tone-on-broadband-noise or
2000 CRC Press LLC
broadband-noise-on-broadband-noise masking [81, 60]. "The notch has
been shown to be caused by the detection on the lower frequency side of
the masker, of the combination tones that are produced by the addition
of the masker and the signal" [60].
For perceptual coding, it is important to know the characteristics of
the signal. With simple signal processing techniques, the speech spec-
trum of the short-time analysis window is segmented into discrete fre-
quency bands. Each subband is classified as either noise-like or tone-like
so it can be determined which type of masking occurs on each subband
of the signal.
12.2.3 MBE Model for Perceptual Coding
The Multi Band Excitation (MBE) speech model, discussed in Sec-
tion 11.1, provides an approach to handle the considerations of the two
previous sections: tone-like or noise-like signal component classification
and limits on time/frequency resolution.
The MBE model divides the frequency spectrum into frequency bins
centered at harmonics of the pitch frequency of the speech signal. The
analysis classifies each frequency bin as voiced or unvoiced. Voiced bins
are characterized by a pitch harmonic (tone) located at that frequency.
Unvoiced bins are characterized by a band of white noise across the
frequency bin. This provides an inherent tone-like versus noise-like clas-
sification of signal components in the MBE analysis. This classification
can be used to select the appropriate masking data for perceptual cod-
ing, based on the sound qualities of the masker and maskee.
By assuming that speech follows the basic properties of the MBE
speech model, the complex magnitudes of the speech spectra at harmon-
ics of the pitch frequency, and the associated voiced/unvoiced decisions,
determine the speech spectra completely. Based on the MBE speech
analysis, the temporal resolution of the signal corresponds to the frame
rate. Considering psycho-acoustic frequency and temporal masking data
and critical bands, a 10 ms temporal resolution and 25 Hz frequency res-
olution are required to sufficiently determine the masked regions of the
signal. The 10 ms temporal resolution is required in order to utilize the
strongest aspects of temporal masking, when the maskee and masker are
close in time (see Figure 6.5). Because the critical bands in frequency
region below 800 Hz are less than 75 Hz, a 25 Hz frequency resolution
is needed to ensure at least two frequency bins in most critical bands.
2000 CRC Press LLC
MBE Analysis/Synthesis with Masking
The masking approach described in Section 12.1.3 was applied to
speech data analyzed and synthesized using the MBE model without
quantization [56, 57]. The voiced/unvoiced classification of frequency
bins was used to select appropriate masking data.
Degradation mean opinion score (DMOS) (see Section 8.2.2) listen-
ing tests were performed to rate the relative quality of two processing
schemes. The first scheme analyzed and synthesized the speech data us-
ing the MBE model without quantization or other altering of the model
parameters. This data was used as the reference. The second processing
algorithm eliminated specific spectral magnitude information as guided
by the masking thresholds.
The perceptual processing was designed to yield perceptual quality
measurements equal to those of the unaltered MBE parameters. This
indicates that the additional auditory processing is functioning trans-
parently, by not adding perceptible degradations.
Figure 12.5 displays the results of the listening tests. A DMOS score
of 4 was obtained when the threshold was held at 10dB. A score of 4
indicates no degradation. Although the speech is no longer coded trans-
parently, it is interesting to note that as the threshold level is raised, less
perceptually significant information is removed. This is a good technique
to use to lower the bit rate of the coder.
12.3 Research in Perceptual Speech Coding
Researchers are actively investigating the field of perceptual speech
coding. Current efforts are being directed at two primary concepts:
transforming the speech signal into a perceptual representation and dis-
tributing quantization noise below masking thresholds. The two are
related, and both are necessary to improve coder quality through per-
ceptual considerations.
Johnston [82, 12] was the first to use masking criteria to distribute
quantization bits in wideband audio coding. He calculated the percep-
tual significance of each frequency band in an audio signal using simul-
taneous masking calculations and distributed quantization bits accord-
ingly. Huang [76] extended these techniques to include forward masking
criteria. Johnston's work is now being incorporated into speech coders.
2000 CRC Press LLC
FIGURE 12.5
MBE synthesized data compared to auditory processed data.
Quality degrades when masking threshold is raised above 10
dB. Below 10 dB there is no audible degradation.
Bourget et al. [11], Sen and Holmes [145], and Soheili et al. [149] use
perceptual measures in creating the excitation codebook in CELP-based
coders. Carnero and Drygajlo [32, 18] decompose the signal into crit-
ical bands and use masking thresholding to determine bit allocation.
Najafzadeh-Azghandi and Kabal [118, 119] as well as George and Smith
[49] use perceptual masking thresholds to train the vector quantization
in a sinusoidal based coder.
Both Virag [161] and Drygajlo and Carnero [32] are utilizing acoustic
masking techniques for speech enhancement. Kubin and Kleijn are work-
ing on computationally efficient perceptual speech coding algorithms
[99]. Tang et al. are using perceptual techniques within a subband
speech coder [157].
Much work has been directed at using perceptual criteria to distribute
coding error to minimize perceived degradation [82, 76, 102, 11, 18, 99,
149, 145, 118, 119]. In [145], Sen and Holmes attempt to shape the error
spectrum of a CELP coder such that it is below the calculated mask-
2000 CRC Press LLC
ing threshold. This method quantizes the areas of the speech spectrum
which lie above the masking threshold with minimal distortion, while
quantizing those regions below the masking threshold with as much dis-
tortion as the masking threshold will allow. Reducing the coding bit rate
and, correspondingly, raising the quantization noise above the masking
threshold introduced only minor perceptual distortion. The reported
perceived effect of this coder is much smoother, more natural sounding
decoded speech than typical CELP encoded/decoded speech at the same
bit rate.
Drygajlo and Carnero approach coding and speech quality enhance-
ment in the same algorithm [32]. The method uses wavelet decompo-
sition (a transformation, similar to an FFT, but with unevenly spaced
frequency basis functions placed to resemble critical bands) to obtain
frequency responses of critical bands to help efficiently calculate mask-
ing thresholds. Coding bits are allocated such that the quantization
noise remains below the masked threshold of detectability.
2000 CRC Press LLC