Speech Signal Processing
Speech Signal Processing
Jonathan Y. Stein
Copyright 2000 John Wiley & Sons, Inc.
Print ISBN 0-471-29546-9 Online ISBN 0-471-20059-X 19
In this chapter we treat of one of the most intricate and fascinating signals
ever to be studied, human. speech. The reader has already been exposed
to the basic models of speech generation and perception in Chapter 11. In
this chapter we apply our knowledge of these mechanisms to the practical
problem of speech modeling.
Speech synthesis is the artificial generation of understandable, and (hope-
fully) natural-sounding speech. If coupled with a set of rules for reading text,
rules that in some languages are simple but in others quite complex, we get
text-to-speech conversion. We introduce the reader to speech modeling by
means of a naive, but functional, speech synthesis system.
Speech recognition, also called speech-to-text conversion, seems at first to
be a pattern recognition problem, but closer examination proves understand-
ing speech to be much more complex due to time warping effects. Although
a difficult task, the allure of a machine that converses with humans via natu-
ral speech is so great that much research has been and is still being devoted
to this subject. There are also many other applications-speaker verifica-
tion, emotional content extraction (voice polygraph), blind voice separation
(cocktail party effect), speech enhancement, and language identification, to
name just a few. While the list of applications is endless many of the basic
principles tend to be the same. We will focus on the deriving of ‘features’,
i.e., sets of parameters that are believed to contain the information needed
for the various tasks.
Simplistic sampling and digitizing of speech requires a high information
rate (in bits per second), meaning wide bandwidth and large storage re-
quirements. More sophisticated methods have been developed that require
a significantly lower information rate but introduce a tolerable amount of
distortion to the original signal. These methods are called speech coding
or speech compression techniques, and the main focus of this chapter is
to follow the historical development of telephone-grade speech compression
techniques that successively halved bit rates from 64 to below 8 Kb/s.
739
740 SPEECH SIGNAL PROCESSING
Figure 19.1: LPC speech model. The U/V switch selaects one of two possible excitation
signals, a pulse train created by the pitch generator, or white noise created by the noise
generator. This excitation is input to an all-pole filter.
19.1. LPC SPEECH SYNTHESIS 741
This extremely primitive model can already be used for speech synthesis
systems, and indeed was the heart of a popular chip set as early as the 1970s.
Let’s assume that speech can be assumed to be approximately stationary
for at least T seconds (T is usually assumed to be in the range from 10 to
100 milliseconds). Then in order to synthesize speech, we need to supply our
model with the following information every T seconds. First, a single bit
indicating whether the speech segment is voiced or unvoiced. If the speech
is voiced we need to supply the pitch frequency as well (for convenience
we sometimes combine the U/V bit with the pitch parameter, a zero pitch
indicating unvoiced speech). Next, we need to specify the overall gain of
the filter. Finally, we need to supply any set of parameters that completely
specify the all-pole filter (e.g., pole locations, LPC coefficients, reflection
coefficients, LSP frequencies). Since there are four to five formants, we expect
the filter to have 8 to 10 complex poles.
How do we know what filter coefficients to use to make a desired sound?
What we need to do is to prepare a list of the coefficients for the various
phonemes needed. Happily this type of data is readily available. For example,
in Figure 19.2 we show a scatter plot of the first two formants for vowels,
based on the famous Peterson-Barney data.
3600
3M)o
2ooo
2ooo
IMIO
1000
#TM
Figure 19.2: First two formants from Peterson-Barney vowel data. The horizontal axis
represents the frequency of the first formant between 200 and 1250 Hz, while the vertical
axis is the frequency of the second formant, between 500 and 3500 Hz. The data consists of
each of ten vowel sounds pronounced twice by each of 76 speakers. The two letter notations
are the so-called ARPABET symbols. IY stands for the vowel in heat, IH for that in hid,
and likewise EH head, AE had, AH hut, AA hot, A0 fought, UH hood, UW hoot, ER
heard.
742 SPEECH SIGNAL PROCESSING
EXERCISES
19.1.1 The Peterson-Barney data is easily obtainable in computer-readable form.
Generate vowels according to the formant parameters and listen to the result.
Can you recognize the vowel?
19.1.2 Source code for the Klatt formant synthesizer is in the public domain. Learn
its parameters and experiment with putting phonemes together to make
words. Get the synthesizer to say ‘digital signal processing’. How natural-
sounding is it?
19.1.3 Is the LPC model valid for a flute? What model is sensible for a guitar? What
is the difference between the excitation of a guitar and that of a violin?
there is no reason to try to model speech that doesn’t exist. Simple devices
that trigger on speech go under the name of VOX, for Voice Operated
X (X being a graphic abbreviation for the word ‘switch’), while the more
sophisticated techniques are now called Voice Activity Detection. Simple
VOXes may trigger just based on the appearance of energy, or may employ
NRT mechanisms, or use gross spectral features to discriminate between
speech and noise. The use of zero crossings is also popular as these can
be computed with low complexity. Most VADs utilize parameters based on
autocorrelation, and essentially perform the initial stages of a speech coder.
When the decision has been made that no voice is present, older systems
would simply not store or transfer any information, resulting in dead silence
upon decoding. The modern approach is to extract some basic statistics of
the noise (e.g., energy and bandwidth) in order to enable Comfort Noise
Generation, (CNG).
Once the VAD has decided that speech is present, determination of the
voicing (U/V) must be made; and assuming the speech is voiced the next
step will be pitch determination. Pitch tracking and voicing determination
will be treated in Section 19.5.
The finding of the filter coefficients is based on the principles of Sec-
tion 9.9, but there are a few details we need to fill in. We know how to find
LPC coefficients when there is no excitation, but here there is excitation.
For voiced speech this excitation is nonzero only during the glottal pulse,
and one strategy is to ignore it and live with the spikes of error. These spikes
reinforce the pitch information and may be of no consequence in speech com-
pression systems. In pitch synchronous systems we first identify the pitch
pulse locations, and correctly evaluate the LPC coefficients for blocks start-
ing with a pulse and ending before the next pulse. A more modern approach
is to perform two separate LPC analyses. The one we have been discussing
up to now, which models the vocal tract, is now called the short-term predic-
tor. The new one, called the long-term predictor, estimates the pitch period
and structure. It typically only has a few coefficients, but is updated at a
higher rate.
There is one final parameter we have neglected until now, the gain G.
Of course if we assume the excitation to be zero our formalism cannot be
expected to supply G. However, since G simply controls the overall volume, it
carries little information and its adjustment is not critical. In speech coding
it is typically set by requiring the energy of the predicted signal to equal the
energy in the original signal.
744 SPEECH SIGNAL PROCESSING
EXERCISES
19.2.1 Multipulse LPC uses an excitation with several pulses per pitch period. Ex-
plain how this can improve LPC quality.
19.2.2 Mixed Excitation Linear Prediction (MELP) does switch between periodic
and noise excitation, rather uses an additive combination of the two. Why
can this produce better quality speech than LPC?
19.2.3 Record some speech and display its sonogram. Compute the LPC spectrum
and find its major peaks. Overlay the peaks onto the sonogram. Can you
recognize the formants? What about the pitch?
19.2.4 Synthesize some LPC data using a certain number of LPC coefficients and
try to analyze it using a different number of coefficients. What happens? How
does the reconstruction SNR depend on the order mismatch?
19.3 Cepstrum
The LPC model is not the only framework for describing speech. Although
it is currently the basis for much of speech compression, cepstral coefficients
have proven to be superior for speech recognition and speaker identification.
The first time you hear the word cepstrum you are convinced that the
word was supposed to be spectrum and laugh at the speaker’s spoonerism.
However, there really is something pronounced ‘cepstrum’ instead of ‘spec-
trum’, as well as a ‘quefrency’ replacing ‘frequency’, and ‘liftering’ displacing
‘filtering’. Several other purposefully distorted words have been suggested
(e.g., ‘alanysis’ and ‘saphe’) but have not become as popular.
To motivate the use of cepstrum in speech analysis, recall that voiced
speech can be viewed as a periodic excitation signal passed through an all-
pole filter. The excitation signal in the frequency domain is rich in harmonics,
and can be modeled as a train of equally spaced discrete lines, separated by
the pitch frequency. The amplitudes of these lines decreases rapidly with in-
creasing frequency, with between 5 and 12 dB drop per octave being typical.
The effect of the vocal tract filtering is to multiply this line spectrum by a
window that has several pronounced peaks corresponding to the formants.
Now if the spectrum is the product of the pitch train and the vocal tract
window, then the logarithm of this spectrum is the sum of the logarithm of
the pitch train and the logarithm of the vocal tract window. This logarithmic
spectrum can be considered to be the spectrum of some new signal, and since
19.3. CEPSTRUM 745
the FT is a linear operation, this new signal is the sum of two signals, one
deriving from the pitch train and one from the vocal tract filter. This new
signal, derived by logarithmically compressing the spectrum, is called the
cepstrum of the original signal. It is actually a signal in the time domain,
but since it is derived by distorting the frequency components its axis is
referred to as qzlefrency. Remember, however, that the units of quefrency
are seconds (or perhaps they should be called ‘cesonds’).
We see that the cepstrum decouples the excitation signal from the vocal
tract filter, changing a convolution into a sum. It can achieve this decou-
pling not only for speech but for any excitation signal and filter, and is thus
a general tool for deconvolution. It has therefore been applied to various
other fields in DSP, where it is sometimes referred to as homomorphic de-
convolution. This term originates in the idea that although the cepstrum is
not a linear transform of the signal (the cepstrum of a sum is not the sum
of the cepstra), it is a generalization of the idea of a linear transform (the
cepstrum of the convolution is the sum of the cepstra). Such parallels are
called ‘homomorphisms’ in algebra.
The logarithmic spectrum of the excitation signal is an equally spaced
train, but the logarithmic amplitudes are much less pronounced and decrease
slowly and linearly while the lines themselves are much broader. Indeed
the logarithmic spectrum of the excitation looks much more like a sinusoid
than a train of impulses. Thus the pitch contribution is basically a line
at a well defined quefrency corresponding to the basic pitch frequency. At
lower quefrencies we find structure corresponding to the higher frequency
formants, and in many cases high-pass liftering can thus furnish both a
voiced/unvoiced indication and a pitch frequency estimate.
Up to now our discussion has been purposefully vague, mainly because
the cepstrum comes in several different flavors. One type is based on the
z transform S(Z), which being complex valued, is composed of its absolute
value R(z) and its angle 8(z). Now let’s take the complex logarithm of S(z)
(equation (A.14)) and call the resulting function S(Z).
We assumed here the minimal phase value, although for some applications
it may be more useful to unwrap the phase. Now S(Z) can be considered to
be the zT of some signal sVn,this signal being the complex cepstrum of s,.
To find the complex cepstrum in practice requires computation of the izT,
a computationally arduous task; however, given the complex cepstrum the
original signal may be recovered via the zT.
746 SPEECH SIGNAL PROCESSING
The power cepstrum, or real cepstrum, is defined as the signal whose PSD
is the logarithm of the PSD of sn. The power cepstrum can be obtained as
an iFT, or for digital signals an inverse DFT
1 =
Y
Sn =-
27r -r J
log (S(w) leiwn dw
Sn = $(Sn + S*-n)
Although easier to compute, the power cepstrum doesn’t take the phase of
S(w) into account, and hence does not enable unique recovery of the original
signal.
There is another variant of importance, called the LPC cepstrum. The
LPC cepstrum, like the reflection coefficients, area ratios, and LSP coeffi-
cients, is a set of coefficients ck that contains exactly the same information
as the LPC coefficients. The LPC cepstral coefficients are defined as the
coefficients of the zT expansion of the logarithm of the all-pole system func-
tion. From the definition of the LPC coefficients in equation (9.21), we see
that this can be expressed as follows:
G -k
1% (19.1)
l- c,M,1 b,rm = ck ‘lcz
Given the LPC coefficients, the LPC cepstral coefficients can be computed
by a recursion that can be derived by series expansion of the left-hand side
(using equations (A.47) and (A.15)) and equating like terms.
co = 1ogG
cl = bl (19.2)
1 k-l
ck = bk -k x c mcmbk-m
m=l
This recursion can even be used for cI, coefficients for which k > M’by taking
bk = 0 for such k. Of course, the recursion only works when the original LPC
model was stable.
LPC cepstral coefficients derived from this recursion only represent the
true cepstrum when the signal is exactly described by an LPC model. For
real speech the LPC model is only an approximation, and hence the LPC
cepstrum deviates from the true cepstrum. In particular, for phonemes that
19.4. OTHER FEATURES 747
are not well represented by the LPC model (e.g., sounds like f, s, and sh that
are produced at the lips with the vocal tract trapping energy and creating
zeros), the LPC cepstrum bears little relationship to its namesakes. Nonethe-
less, numerous comparisons have shown the LPC cepstral coefficients to be
among the best features for both speech and speaker recognition.
If the LPC cepstral coefficients contain precisely the same information
as the LPC coefficients, how can it be that one set is superior to the other?
The difference has to do with the other mechanisms used in a recognition
system. It turns out that Euclidean distance in the space of LPC cepstral
coefficients correlates well with the Itakuru-Saito distance, a measure of how
close sounds actually sound. This relationship means that the interpretation
of closeness in LPC cepstrum space is similar to that our own hearing system
uses, a fact that aids the pattern recognition machinery.
EXERCISES
19.3.1 The signal z(t) is corrupted by a single echo to become y(t) = ~(t)+aa(t--7).
Show that the log power spectrum of y is approximately that of x with an
additional ripple. Find the parameters of this ripple.
19.3.2 Complete the proof of equation (19.2).
19.3.3 The reconstruction of a signal from its power cepstrum is not unique. When
is it correct?
19.3.4 Record some speech and plot its power cepstrum. Are the pitch and formants
easily separable?
19.3.5 Write a program to compute the LPC cepstrum. Produce artificial speech
from an exact LPC model and compute its LPC cepstrum.
are also used. It is obvious that all of these are spectral descriptions. The
extensive use of these parameters is a strong indication of our belief that
the information in speech is stored in its spectrum, more specifically in the
position of the formants.
We can test this premise by filtering some speech in such a way as to con-
siderably whiten its spectrum for some sound or sounds. For example, we can
create an inverse filter to the spectrum of a common vowel, such as the e in
the word ‘feet’. The spectrum will be completely flat when this vowel sound
is spoken, and will be considerably distorted during other vowel sounds. Yet
this ‘inverse-E’ filtered speech turns out to be perfectly intelligible. Of course
a speech recognition device based on one of the aforementioned parameter
sets will utterly fail.
So where is the information if not in the spectrum? A well-known fact
regarding our senses is that they respond mainly to change and not to steady-
state phenomena. Strong odors become unnoticeable after a short while, our
eyes twitch in order to keep objects moving on our retina (animals without
the eye twitch only see moving objects) and even a relatively loud stationary
background noise seems to fade away. Although our speech generation sys-
tem is efficient at creating formants, our hearing system is mainly sensitive
to changes in these formants.
One way this effect can be taken into account in speech recognition
systems is to use derivative coefficients. For example, in addition to using
LPC cepstral coefficients as features, some systems use the so-called delta
cepstral coefficients, which capture the time variation of the cepstral coeffi-
cients. Some researchers have suggested using the delta-delta coefficients as
well, in order to capture second derivative effects.
An alternative to this empirical addition of time-variant information is to
use a set of parameters specifically built to emphasize the signal’s time varia-
tion. One such set of parameters is called RASTA-PLP (Relative Spectra-
Perceptual Linear Prediction). The basic PLP technique modifies the short
time spectrum by several psychophysically motivated transformations, in-
cluding resampling the spectrum into Bark segments, taking the logarithm
of the spectral amplitude and weighting the spectrum by a simulation of the
psychophysical equal-loudness curve, before fitting to an all-pole model. The
RASTA technique suppresses steady state behavior by band-pass filtering
each frequency channel, in this way removing DC and slowly varying terms.
It has been found that RASTA parameters are less sensitive to artifacts;
for example, LPC-based speech recognition systems trained on microphone-
quality speech do not work well when presented with telephone speech. The
performance of a RASTA-based system degrades much less.
19.4. OTHER FEATURES 749
changing from place to place. Yet as long as the paper is not crumpled into
a three-dimensional ball, its local dimensionality remains two. Performing
such experiments on vowel sounds has led several researchers to conclude
that three to five local features are sufficient to describe speech.
Of course this demonstration is not constructive and leaves us totally
in the dark as to how to find such a small set of features. Attempts are
being made to search for these features using learning algorithms and neural
networks, but it is too early to hazard a guess as to success and possible
impact of this line of inquiry.
EXERCISES
19.4.1 Speech has an overall spectral tilt of 5 to 12 dB per octave. Remove this tilt
(a pre-emphasis filter of the form 1 - 0.99z-1 is often used) and listen to the
speech. Is the speech intelligible? Does it sound natural?
19.4.2 If speech information really lies in the changes, why don’t we differentiate
the signal and then perform the analysis?
times during which the signal is stationary would provide unacceptably large
uncertainties in the pitch determination. Hoarse and high-pitched voices are
particularly difficult in this regard.
All this said, there are many pitch tracking algorithms available. One
major class of algorithms is based on finding peaks in the empirical autocor-
relation. A typical algorithm from this class starts by low-pass filtering the
speech signal to eliminate frequency components above 800 or 900 Hz. The
pitch should correspond to a peak in the autocorrelation of this signal, but
there are still many peaks from which to choose. Choosing the largest peak
sometimes works, but may result in a multiple of the pitch or in a formant
frequency. Instead of immediately computing the autocorrelation we first
center clip (see equation (8.7)) the signal, a process that tends to flatten out
vocal tract autocorrelation peaks. The idea is that the formant periodicity
should be riding on that of the pitch, even if its consistency results in a larger
spectral peak. Accordingly, after center clipping we expect only pitch-related
phenomena to remain. Of course the exact threshold for the center clipping
must be properly set for this preprocessing to work, and various schemes
have been developed. Most schemes first determine the highest sample in
the segment and eliminate the middle third of the dynamic range. Now au-
tocorrelation lags that correspond to valid pitch periods are computed. Once
again we might naively expect the largest peak to correspond to the pitch
period, but if filtering of the original signal removed or attenuated the pitch
frequency this may not be the case. A better strategy is to look for con-
sistency in the observed autocorrelation peaks, choosing a period that has
the most energy in the peak and its multiples. This technique tends to work
even for noisy speech, but requires postprocessing to correct random errors
in isolated segments.
A variant of the autocorrelation class computes the Average Magnitude
Difference Function
AMDF(m)
=c lxn- zn+ml
n
EXERCISES
19.5.1 In order to minimize time spent in computation of autocorrelation lags, one
can replace the center clipping operation with a three-level slicing operation
that only outputs -1, 0 or +l. How does this decrease complexity? Does this
operation strongly affect the performance of the algorithm?
19.5.2 Create a signal that is the weighted sum of a few sinusoids interrupted every
now and then by short durations of white noise. You can probably easily
separate the two signal types by eye in either time or frequency domains.
Now do the same using any of the methods discussed above, or any algorithm
of your own devising.
19.5.3 Repeat the previous exercise with additive noise on the sinusoids and narrow
band noise instead of white noise. How much noise can your algorithm toler-
ate? How narrow-band can the ‘unvoiced’ sections be and still be identifiable?
Can you do better ‘by eye’ than your algorithm?
without error. Extending techniques that work on general bit streams to the
lossy regime is fruitless. It does not really make sense to view the speech
signal as a stream of bits and to minimize the number of bit errors in the
reconstructed stream. This is because some bits are more significant than
others-an error in the least significant bit is of much less effect than an
error in a sign bit!
It is less obvious that it is also not optimal to view the speech signal
as a stream of sample values and compress it in such a fashion as to mini-
mize the energy of error signal (reconstructed signal minus original signal).
This is because two completely different signals may sound the same since
hearing involves complex physiological and psychophysical processes (see
Section 11.4).
For example, by delaying the speech signal by two samples, we create a
new signal completely indistinguishable to the ear but with a large ‘error
signal’. The ear is insensitive to absolute time and thus would not be able
to differentiate between these two ‘different’ signals. Of course simple cross
correlation would home-in on the proper delay and once corrected the error
would be zero again. But consider delaying the digital signal by half a sample
(using an appropriate interpolation technique), producing a signal with com-
pletely distinct sample values. Once again a knowledgeable signal processor
would be able to discover this subterfuge and return a very small error. Sim-
ilarly, the ear is insensitive to small changes in loudness and absolute phase.
However, the ear is also insensitive to more exotic transformations such as
small changes in pitch, formant location, and nonlinear warping of the time
axis.
Reversing our point-of-view we can say that speech-specific compression
techniques work well for two related reasons. First, speech compression tech-
niques are lossy (i.e., they strive to reproduce a signal that is similar but not
necessarily identical to the original); significantly lower information rates can
be achieved by introducing tolerable amounts of distortion. Second, once we
have abandoned the ideal of precise reconstruction of the original signal, we
can go a step further. The reconstructed signal needn’t really be similar to
the original (e.g., have minimal mean square error); it should merely sound
similar. Since the ear is insensitive to small changes in phase, timing, and
pitch, much of the information in the original signal is unimportant and
needn’t be encoded at all.
It was once common to differentiate between two types of speech coders.
‘Waveform coders’ exploit characteristics of the speech signal (e.g., energy
concentration at low frequencies) to encode the speech samples in fewer bits
than would be required for a completely random signal. The encoding is a
19.6. SPEECH COMPRESSION 755
toll quality) is ranked 4.0. To the uninitiated telephone speech may seem
almost the same as high-quality speech, however, this is in large part due
to the brain compensating for the degradation in quality. In fact different
phonemes may become acoustically indistinguishable after the band-pass
filtering to 4 KHz (e.g. s and f), but this fact often goes unnoticed, just as
the ‘blind spots’ in our eyes do. MOS ratings from 3.5 to 4 are sometimes
called ‘communications quality’, and although lower than toll quality are
acceptable for many applications.
Usually MOS tests are performed along with calibration runs of known
MOS, but there still are consistent discrepancies between the various labo-
ratories that perform these measurements. The effort and expense required
to obtain an MOS rating for a coder are so great that objective tests that
correlate well with empirical MOS ratings have been developed. Perceptual
Speech Quality Measure (PSQM) and Perceptual Evaluation of Speech
Quality (PESQ) are two such which have been standardized by the ITU.
EXERCISES
19.6.1 Why can’t general-purpose data compression techniques be lossy?
19.6.2 Assume a language with 64 different phonemes that can be spoken at the
rate of eight phonemes per second. What is the minimal bit rate required?
19.6.3 Try to compress a speech file with a general-purpose lossless data (file) com-
pression program. What compression ratio do you get?
19.6.4 Several lossy speech compression algorithms are readily available or in the
public domain (e.g., LPC-lOe, CELP, GSM full-rate). Compress a file of
speech using one or more of these compressions. Now listen to the ‘before’ and
‘after’ files. Can you tell which is which? What artifacts are most noticeable
in the compressed file? What happens when you compress a file that had
been decompressed from a previous compression?
19.6.5 What happens when the input to a speech compression algorithm is not
speech? Try single tones or DTMF tones. Try music. What about ‘babble
noise’ (multiple background voices)?
19.6.6 Corrupt a file of linear 16-bit speech by randomly flipping a small percentage
of the bits. What percentage is not noticed? What percentage is acceptable?
Repeat the experiment by corrupting a file of compressed speech. What can
you conclude about media for transmitting compressed speech?
19.7. PCM 757
19.7 PCM
In order to record and/or process speech digitally one needs first to acquire
it by an A/D. The digital signal obtained in this fashion is usually called
‘linear PCM’ (recall the definition of PCM from Section 2.7). Speech con-
tains significant frequency components up to about 20 KHz, and Nyquist
would thus require a 40 KHz or higher sampling rate. From experimentation
at that rate with various numbers of sample levels one can easily become
convinced that using less than 12 to 14 bits per sample noticeably degrades
the signal. Eight bits definitely delivers inferior quality, and since conven-
tional hardware works in multiples of 8-bit bytes, we usually digitize speech
using 16 bits per sample. Hence the simplistic approach to capturing speech
digitally would be to sample at 40 KHz using 16 bits per sample for a total
information rate of 640 Kb/s. A ssuming a properly designed microphone,
speaker, A/D, D/A, and filters, 640 Kb/s digital speech is indeed close to
being indistinguishable from the original.
Our first step in reducing this bit rate is to sacrifice bandwidth by low-
pass filtering the speech to 4 KHz, the bandwidth of a telephone channel.
Although 4 KHz is not high fidelity it is sufficient to carry highly intelligible
speech. At 4 KHz the Nyquist sampling rate is reduced to 8000 samples per
second, or 128 Kb/s.
From now on we will use more and more specific features of the speech
signal to further reduce the information rate. The first step exploits the
psychophysical laws of Weber and Fechner (see Section 11.2). We stated
above that 8 bits were not sufficient for proper digitizing of speech. What we
really meant is that 256 equally spaced quantization levels produces speech
of low perceived quality. Our perception of acoustic amplitude is, however,
logarithmic, with small changes at lower amplitudes more consequential than
equal changes at high amplitudes. It is thus sensible to try unevenly spaced
quantization levels, with high density of levels at low amplitudes and much
fewer levels used at high amplitudes. The optimal spacing function will be
logarithmic, as depicted in Figure 19.3 (which replaces Figure 2.25 for this
case). Using logarithmically spaced levels 8 bits is indeed adequate for toll
quality speech, and since we now use only 8000 eight-bit samples per second,
our new rate is 64 Kb/s, half that of linear PCM. In order for a speech
compression scheme to be used in a communications system the sender and
receiver, who may be using completely different equipment, must agree as
to its details. For this reason precise standards must be established that
ensure that different implementations can interoperate. The ITU has defined
a number of speech compression schemes. The G.711 standard defines two
758 SPEECH SIGNAL PROCESSING
and although it is hard to see this from the expression, its behavior is very
similar to that of p-law. By convention we take A = 87.56 and as in the
p-law case approximate the true form with 16 staircase line segments. It
is interesting that the A-law staircase has a rising segment at the origin
and thus fluctuates for near-zero inputs, while the approximated p-law has
a horizontal segment at the origin and is thus relatively constant for very
small inputs.
EXERCISES
19.7.1 Even 640 Kb/s does not capture the entire experience of listening to a speaker
in the same room, since lip motion, facial expressions, hand gestures, and
other body language are not recorded. How important is such auxiliary infor-
mation? When do you expect this information to be most relevant? Estimate
the information rates of these other signals.
19.7.2 Explain the general form of 1-1and A laws. Start with general logarithmic
compression, extend it to handle negative signal values, and finally force it
to go through the origin.
19.7.3 Test the difference between high-quality and toll-quality speech by perform-
ing a rhyme test. In a rhyme test one person speaks out-of-context words
and a second records what was heard. By using carefully chosen words, such
as lift-list, lore-more-nor, jeep-cheep, etc., you should be able to both esti-
mate the difference in accuracy between the two cases and determine which
phonemes are being confused in the toll-quality case.
19.7.4 What does p-law (equation (19.3)) re t urn for zero input? For maximal input?
When does y = Z? Plot p-law for 16-bit linear PCM, taking xmaZ = 215 =
32768, for various p from 1 to 255. What is the qualitative difference between
the small and large 1-1cases?
19.7.5 Plot the p-law (with p = 255) and A-law (with A = 87.56) responses on
the same axes. By how much do they differ? Plot them together with true
logarithmic response. How much error do they introduce? Research and plot
the 16 line segment approximations. How much further error is introduced?
760 SPEECH SIGNAL PROCESSING
N
Sn = Pi%+ (19.6)
c
i=l
19.8. DPCM, DM, AND ADPCM 761
Figure 19.4: Unquantized DPCM. The encoder predicts the next value, finds the pre-
diction error en = sn - ,sn,
- and transmits this error through the communications channel
to the receiver. The receiver, imitating the transmitter, predicts the next value based on
all the values it has recovered so far. It then corrects this prediction based on the error E~
received.
we call the predictor a linear predictor. If the predictor works well, the
prediction error
E, = sn - sn (19.7)
is both of lower energy and much whiter than the original signal sn. The
error is all we need to transmit for the receiver to be able to reconstruct the
signal, since it too can predict the next signal value based on the past values.
Of course this prediction Zn is not completely accurate, but the correction E~
is received, and the original value easily recovered by sn = s”,+en. The entire
system is depicted in Figure 19.4. We see that the encoder (linear predictor)
is present in the decoder, but that there it runs as feedback, rather than
feedforward as in the encoder itself.
The simplest DPCM system is Delta Modulation (DM). Delta modula-
tion uses only a single bit to encode the error, this bit signifying whether the
true value is above or below the predicted one. If the sampling frequency is
so much higher than required that the previous value ~~-1 itself is a good
predictor of sn, delta modulation becomes the sigma-delta converter of Sec-
tion 2.11. In a more general setting a nontrivial predictor is used, but we
still encode only the sign of the prediction error. Since delta modulation
provides no option to encode zero prediction error the decoded signal tends
to oscillate up and down where the original was relatively constant. This
annoyance can be ameliorated by the use a post-jilter, which low-pass filters
the reconstructed signal.
There is a fundamental problem with the DPCM encoders we have just
described. We assur;led that the true value of the prediction error E~ is
transferred over the channel, while in fact we can only transfer a quantized
version rnQ.The very reason we perform the prediction is to save bits after
quantization. Unfortunately, this quantization may have a devastating effect
762 SPEECH SIGNAL PROCESSING
on the decoder. The problem is not just that the correction of the present
prediction is not completely accurate; the real problem is that because of
this inaccuracy the receiver never has reliable sn with which to continue
predicting the next samples. To see this, define snQas the decoder’s predicted
value corrected by the quantized error. In general, sz does not quite equal
sn, but we predict the next sample values based on these incorrect corrected
predictions! Due to the feedback nature of the decoder’s predictor the errors
start piling up and after a short time the encoder and decoder become ‘out
of sync’.
The prediction we have been using is known as open-loop prediction,
by which we mean that we perform linear prediction of the input speech.
In order to ensure that the encoder and decoder predictors stay in sync, we
really should perform linear prediction on the speech as reconstructed by the
decoder. Unfortunately, the decoder output is not available at the encoder,
and so we need to calculate it. To perform closed-loop prediction we build an
exact copy of the entire decoder into our encoder, and use its output, rather
than the input speech, as input to the predictor. This process is diagrammed
in Figure 19.5. By ‘closing the loop’ in this fashion, the decoded speech is
precisely that expected, unless the channel introduces bit errors.
The international standard for 32 Kb/s toll quality digital speech is based
on Adaptive Delta-PCM (ADPCM). The ‘adaptive’ is best explained by
returning to the simple case of delta modulation. We saw above that the
DM encoder compares the speech signal value with the predicted (or simply
previous) value and reports whether this prediction is too high or too low.
How does a DM decoder work? For each input bit it takes its present esti-
mate for the speech signal value and either adds or subtracts some step size
A. Assuming A is properly chosen this strategy works well for some range
of input signal frequencies; but as seen in Figure 19.6 using a single step
in .out
AL 1r
-PF-t
Figure 19.5: Closed-loop prediction. In this figure, Q stands for quantizer, IQ inverse
quantizer, PF prediction filter. Note that the encoder contains an exact replica of the
decoder and predicts the next value based on the reconstructed speech.
19.8. DPCM, DM, AND ADPCM 763
Figure 19.6: The two types of errors in nonadaptive delta modulation. We superpose
the reconstructed signal on the original. If the step size is too small the reconstructed
signal can’t keep up in areas of large slope and may even completely miss peaks (as in
the higher-frequency area at the beginning of the figure). If the step size is too large the
reconstructed signal will oscillate wildly in areas where the signal is relatively constant
(as seen at the peaks of the lower-frequency area toward the end of the figure).
size cannot satisfy all frequencies. If A is too small the reconstructed signal
cannot keep up when the signal changes rapidly in one direction and may
even completely miss peaks (as in the higher-frequency area at the begin-
ning of the figure), a phenomenon called ‘slope overload’. If A is too large
the reconstructed signal will oscillate wildly when the signal is relatively
constant (as seen at the peaks of the lower-frequency area toward the end
of the figure), which is known as ‘granular noise’.
While we introduced the errors introduced by improper step size for
DM, the same phenomena occur for general DPCM. In fact the problem
is even worse. For DM the step size A is only used at the decoder, since
the encoder only needs to check the sign of the difference between the signal
value and its prediction. For general delta-PCM the step size is needed at the
encoder as well, since the difference must be quantized using levels spaced
A apart. Improper setting of the spacing between the quantization levels
causes mismatch between the digitizer and the difference signal’s dynamic
range, leading to improper quantization (see Section 2.9).
764 SPEECH SIGNAL PROCESSING
The solution is to adapt the step size to match the signal’s behavior.
In order to minimize the error we increase A when the signal is rapidly
increasing or decreasing, and we decrease it when the signal is more constant.
A simplistic way to implement this idea for DM is to use the bit stream itself
to determine whether the step size is too small or too large. A commonly
used version uses memory of the previous delta bit; if the present bit is
the same as the previous we multiply A by some constant K (K = 1.5 is
a common choice), while if the bits differ we divide by K. In addition we
constrain A to remain within some prespecified range, and so stop adapting
when it reaches its minimum or maximum value,
While efficient computationally, the above method for adapting A is
completely heuristic. A more general tactic is to set the step size for adap-
tive DPCM to be a given percentage of the signal’s standard deviation. In
this way A would be small for signals that do vary much, minimizing gran-
ular noise, but large for wildly varying signals, minimizing slope overload.
Were speech stationary over long times adaptation would not be needed,
but since the statistics of the speech signal vary widely as the phonemes
change, we need to continuously update our estimate of its variance. This
can be accomplished by collecting N samples of the input speech signal in
a buffer, computing the standard deviation, setting A accordingly, and only
then performing the quantization, N needs to be long enough for the vari-
ance computation to be accurate, but not so long that the signal statistics
vary appreciably over the buffer. Values of 128 (corresponding to 16 mil-
liseconds of speech at 8000 Hz) through 512 (64 milliseconds) are commonly
used.
There are two drawbacks to this method of adaptively setting the scale of
the quantizer. First, the collecting of N samples before quantization requires
introducing buffer delay; in order to avoid excessive delay we can use an
IIR filter to track the variance instead of computing it in a buffer. Second,
the decoder needs to know A, and so it must be sent as side information,
increasing the amount of data transferred. The overhead can be avoided by
having the decoder derive A, but if A is derived from the input signal, this
is not possible. The decoder could try to use the reconstructed speech to
find A, but this would not exactly match the quantization step used by the
encoder. After a while the encoder and decoder would no longer agree and
the system would break down. As you may have guessed, the solution is to
close the loop and have the encoder determine A using its internal decoder,
a technique called backward adaptation.
19.9. VECTOR QUANTIZATION 765
EXERCISES
19.8.1 Obtain a copy of the G.726 ADPCM standard and study the main block
diagrams for the encoder and decoder. Explain the function and connections
of the adaptive predictor, adaptive quantizer, and inverse adaptive quantizer.
Why is the standard so detailed?
19.8.2 Now study the expanded block diagram of the encoder. What is the purpose
of the blocks marked ‘adaptation speed control’ and ‘tone and transition
detector’ ?
19.8.3 How does the MIPS complexity of the G.726 encoder compare with that of
modern lower-rate encoders?
19.8.4 Show that the open-loop prediction results in large error because the quanti-
zation error is multiplied by the prediction gain. Show that with closed-loop
prediction this does not occur.
more important than higher ones; thus we can reduce the average (percep-
tual) error by placing the quantization levels closer together for small signal
values, and further apart for large values.
We will return to the perceptual importance later; for now we assume all
signal values to be equally important and just ask how to combine adaptivity
with nonequidistant quantization thresholds. Our objective is to lower the
average quantization error; and this can be accomplished by placing the
levels closer together where the signal values are more probable.
Rather than adapting quantization thresholds, we can adapt the mid-
points between these thresholds. We call these midpoints ‘centers’, and the
quantization thresholds are now midway between adjacent centers. It is then
obvious that classifying an input as belonging to the nearest ‘center’ is equiv-
alent to quantizing according to these thresholds. The set of all values that
are classified as closest to a given center (i.e., that lie between the two
thresholds) is called its ‘cluster’.
The reason we prefer to set centers is that there is an easily defined
criterion that differentiates between good sets of centers and poor ones,
namely mean square error (MSE) . Accordingly, if we have observed N signal
values {~~}~=r, we want to place M centers {c,}!$r in such a way that we
minimize the mean square quantization error.
1 N
E= N c Izn - c,12
n=l
We have used here the short-hand notation cn to mean that center closest
t0 Xn-
Figure 19.7: Quantization thresholds found by the scalar quantization algorithm for
uniform and Gaussian distributed data. For both cases 1000 points were generated, and
16 centers found by running the basic scalar quantization algorithm until convergence.
EXERCISES
19.9.1 Prove the point closest to all points in a cluster is their average.
19.9.2 Generate bimodal random numbers, i.e., ones with a distribution with two
separated peaks. Determine the error for the best standard quantization. Now
run the LBG algorithm with the same number of levels and check the error
again. How much improvement did you get?
19.9.3 Generate random vectors that are distributed according to a ‘Gaussian mix-
ture’ distribution. This is done as follows. Choose M cluster centers in N-
dimensional space. For each number to be generated randomly select the
cluster, and then add to it Gaussian noise (if the noise has the same variance
for all elements then the clusters will be hyperspherical). Now run the LBG
algorithm. Change the size of the codebook. How does the error decrease
with codebook size?
19.10 SBC
The next factor-of-two can be achieved by noticing that the short time spec-
trum tends to have a only a few areas with significant energy. The SubBand
Coding (SBC) technique takes advantage of this feature by dividing the
19.10. SBC 769
EXERCISES
19.10.1 Can we always decimate subbands according to their bandwidth? (Hint: Re-
call the ‘band-pass sampling theorem’.)
19.10.2 When dividing into equal-bandwidth bands, in which are more bits typically
needed, those with lower or higher frequencies? Is this consistent with what
happens with logarithmic division?
19.10.3 Will dividing the bandwidth into arbitrary bands adaptively matched to the
signal produce better compression?
770 SPEECH SIGNAL PROCESSING
times longer than those of the pitch frequency. Our assumption that the
pitch excitation could be modeled as a single pulse per pitch period and
otherwise zero has apparently been pushed beyond its limits. If we remove
the residual pitch period correlations the remaining error seems to be white
noise. Hence, trying to efficiently compress the error signal would seem to
be a useless exercise.
EXERCISES
19.11.1 You can find code for LPC-1Oe in the public domain. Encode and then decode
some recorded speech. How do you rate the quality? Can you always under-
stand what is being said? Can you identify the speaker? Are some speakers
consistently hard to understand?
19.11.2 In Residual Excited Linear Prediction (RELP) the residual is low-pass fil-
tered to about 1 KHz and then decimated to lower its bit rate. Diagram
the RELP encoder and decoder. For what bit rates do you expect RELP to
function well?
r
(y *
CB. 7 . PP LPC = (> =-PW - EC
Figure 19.8: ABS CELP encoder using short- and long-term prediction. Only the essen-
tial elements are shown; CB is the codebook, PP the pitch (short-term) predictor, LPC the
long-term predictor, PW the perceptual weighting filter, and EC the error computation.
The input is used directly to find LPC coefficients and estimate the pitch and gain. The
error is then used in ABS fashion to fine tune the pitch and gain, and choose the optimal
codebook entry.
%G> = 1 - ;+ (19.10)
where D is the pitch period. D may be found open loop, but for high quality
it should be found using analysis by synthesis. For unvoiced segments the
pitch predictor can be bypassed, sending the excitation directly to the LPC
predictor, or it can be retained and its delay set randomly. A rough block
diagram of a complete CELP encoder that uses this scheme is given in
Figure 19.8.
Adaptive codebooks reinforce the pitch period using a different method.
Rather than actually filtering the excitation, we use an effective excitation
composed of two contributions. One is simply the codebook, now called the
fixed codebook. To this is added the contribution of the adaptive codebook,
which is formed from the previous excitation by duplicating it at the pitch
period. This contribution is thus periodic with the pitch period and supplies
the needed pitch-rich input to the LPC synthesis filter.
One last trick used by many CELP encoders is ‘post-filtering’. Just as
for ADPCM, the post-filter is appended after the decoder to improve the
19.13. TELEPHONE-GRADE SPEECH CODING 775
EXERCISES
19.12.1 Explain why replacing LPC coefficient b, with yyb, with 0 < y < 1 is called
bandwidth expansion. Show that 15 Hz expansion is equivalent to y = 0.994.
19.12.2 The G.723.1 coder when operating at the 5.3 Kb/s rate uses an algebraic
codebook that is specified by 17 bits. The codewords are of length 60 but
have no more than four nonzero elements. These nonzero elements are either
all in even positions or all in odd positions. If in even positions, their indexes
modulo 8 are all either 0, 2, 4, or 6. Thus 1 bit is required to declare whether
even or odd positions are used, the four pulse positions can be encoded using
3 bits, and their signs using a single bit. Write a routine that successively
generates all the legal codewords.
19.12.3 Explain how to compute the delay of an ABS CELP coder. Take into account
the buffer, lookahead, and processing delays. What are the total delays for
G.728 (frame 20 samples, backward prediction), G.729 (frame 80 samples,
forward prediction), and G.723.1 (frame 240 samples, forward prediction)?
19.12.4 Obtain a copy of the G.729 standard and study the main block diagram.
Explain the function of each block.
19.12.5 Repeat the previous exercise with the G.723.1 standard. What is the differ-
ence between the two rates? How does G.723.1 differ from G.729?
the PSTN is growing at a rate of about 5% per year, while digital com-
munications use is growing at several hundred percent a year. The amount
of data traffic exceeded that of voice sometime during the year 2000, and
hence voice over data is rapidly becoming the more important of the two
technologies .
The history of telephone-grade speech coding is a story of rate halving.
Our theoretical rate of 128 Kb/s was never used, having been reduced to
64 Kb/s by the use of logarithmic PCM, as defined in ITU standard G.711.
So the first true rate halving resulted in 32 Kb/s and was accomplished
by ADPCM, originally designated G.721. In 1990, ADPCM at rates 40, 32,
24, and 16 Kb/ s were merged into a single standard known as G.726. At
the same time G.727 was standardized; this ‘embedded’ ADPCM covers
these same rates, but is designed for use in packetized networks. It has the
advantage that the bits transmitted for the lower rates are subsets of those of
the higher rates; congestion that arises at intermediate nodes can be relieved
by discarding least significant bits without the need for negotiation between
the encoder and decoder.
Under 32 Kb/s the going gets harder. The G.726 standard defines 24 and
16 Kb/s rates as well, but at less than toll-quality. Various SBC coders were
developed for 16 Kb/s, either dividing the frequency range equally and us-
ing adaptive numbers of bits per channel, or using hierarchical wavelet-type
techniques to divide the range logarithmically. Although these techniques
were extremely robust and of relatively high perceived quality for the com-
putational complexity, no SBC system was standardized for telephone-grade
speech. In 1988, a coder, dubbed G.722, was standardized that encoded
wideband audio (7 KHz sampled at 16,000 samples per second, 14 bits per
sample) at 64 Kb/s. This coder divides the bandwidth from DC to 8 KHz
into two halves using QMFs and encodes each with ADPCM.
In the early 199Os, the ITU defined performance criteria for a 16 Kb/s
coder that could replace standard 32 Kb/s ADPCM. Such a coder was re-
quired to be of comparable quality to ADPCM, and with delay of less than
5 milliseconds (preferably less than 2 milliseconds). The coder, selected in
1992 and dubbed G.728, is a CELP with backward prediction, with LPC
order of 50. Such a high LPC order is permissible since with closed-loop
prediction the coefficients need not be transmitted. Its delay is 5 samples
(0.625 ms), but its computational complexity is considerably higher than
ADPCM, on the order of 30 MIPS.
The next breakthrough was the G.729 8 Kb/s CELP coder. This was ac-
cepted simultaneously with another somewhat different CELP-based coder
for 6.4 and 5.4 Kb/s. The latter was named G.723.1 (the notation G.723
19.13. TELEPHONE-GRADE SPEECH CODING 777
having been freed up by the original merging into G.726). Why were two
different coders needed? The G.729 specification was originally intended for
toll-quality wireless applications. G.728 was rejected for this application be-
cause of its rate and high complexity. The frame size for G.729 was set at
10 ms. and its lookahead at 5 ms. Due to the wireless channel, robustness
to various types of bit errors was required. The process of carefully evaluat-
ing the various competing technologies took several years. During that time
the urgent need arose for a low-bit-rate coder for videophone applications.
Here toll-quality was not an absolute must, and it was felt by many that
G.729 would not be ready in the alloted time. Thus an alternative selection
process, with more lax testing, was instigated. For this application it was de-
cided that a long 30 millisecond frame was acceptable, that a lower bit rate
was desirable, but that slightly lower quality could be accommodated. In
the end both G.729 and G.723.1 were accepted as standards simultaneously,
and turned out to be of similar complexity.
The G.729 coder was extremely high quality, but also required over 20
MIPS of processing power to run. For some applications, including ‘voice
over modem’, this was considered excessive. A modified coder, called G.729
Annex A, was developed that required about half the complexity, with al-
most negligible MOS reduction. This annex was adopted using the quick
standardization strategy of G.723.1. G.723.1 defined as an annex a standard
VAD and CNG mechanism, and G.729 soon followed suit with a similar
mechanism as its Annex B. More recently, G.729 has defined annexes for
additional bit rates, including a 6.4 Kb/s one.
At this point in time there is considerable overlap (and rivalry) between
the two standards families. G.723.1 is the default coder for the voice over
IP standard H.323, but G.729 is allowed as an option. G.729 is the default
for the ‘frame relay’ standard FRF.11, but G.723.1 is allowed there as an
option. In retrospect it is difficult to see a real need for two different coders
with similar performance.
For even lower bit rates one must decide between MIPS and MOS. On the
low MIPS low MOS front the U.S. Department of Defense initiated an effort
in 1992 to replace LPC-1Oe with a 2.4 Kb/s encoder with quality similar to
that of the 4.8 Kb/s CELP. After comparing many alternatives, in 1997 a
draft was published based on MELP. The excitation used in this encoder
consists of a pulse train and a uniform-distributed random noise generator
filtered by time-varying FIR filters. MELP’s quality is higher than that of
straight LPC-10 because it addresses the latter’s main weaknesses, namely
voicing determination errors and not treating partially-voiced speech.
For higher MOS but with significantly higher MIPS requirements there
778 SPEECH SIGNAL PROCESSING
EXERCISES
19.13.1 Cellular telephony networks use a different set of coders, including RPE-LTP
(GSM) and VSELP (IS-54). What are the principles behind these coders and
what are their parameters?
BIBLIOGRAPHICAL NOTES 779
Bibliographical Notes
There is a plethora of books devoted to speech signal processing. The old standard
references include [210, 2111, and of the newer generation we mention [66]. A rel-
atively up-to-date book on speech recognition is [204] while [176] is an interesting
text that emphasizes neural network techniques for speech recognition.
The first artificial speech synthesis device was created by Wolfgang von Kem-
pelen in 1791. The device had a bellows that supplied air to a reed, and a manually
manipulated resonance chamber. Unfortunately, the machine was not taken seri-
ously after von Kempelen’s earlier invention of a chess-playing machine had been
exposed as concealing a midget chess expert. In modern times Homer Dudley from
Bell Labs [55] was an early researcher in the field of speech production mechanisms.
Expanding on the work of Alexander Graham Bell, he analyzed the human speech
production in analogy to electronic communications systems, and built the VODER
(Voice Operation DEmonstratoR), an analog synthesizer that was demonstrated
at the San Francisco and New York World’s Fairs. An early digital vocoder is de-
scribed in [80]. In the 198Os, Dennis Klatt presented a much improved formant
synthesizer [130, 1311.
The LPC model was introduced to speech processing by Atal [lo] in the U.S.
and Itakura [ill] in Japan. Many people were initially exposed to it in the popular
review [155] or in the chapter on LPC in [210]. The power cepstrum was introduced
in [20]; the popular DSP text [186] devotes a chapter to homomorphic processing;
and [37] is worth reading. We didn’t mention that there is a nonrecursive connection
between the LPC and LPC cepstrum coefficients [239].
Distance measures, such as the Itakura-Saito distance, are the subject of (112,
113, 110, 841. The inverse-E filtering problem and RASTA-PLP are reviewed in
[102, 1011. The sinusoidal representation has an extensive literature; you should
start with [163, 2011.
For questions of speech as a dynamical system and its fractal dimension consult
[259, 156, 172, 2261. Unfortunately, there is as yet no reference that specifies for the
optimal minimal set of features.
Pitch detectors and U/V decision mechanisms are the subject of [205, 206,121].
Similar techniques for formant tracking are to be found in [164, 2301.
Once, the standard text on coding was [116], but the field has advanced tremen-
dously since then. Vector quantization is covered in a review article [85] and a text
[69], while the LBG algorithm was introduced in [149].
Postfiltering is best learnt from [35]. The old standard coders are reviewed in [23]
while the recent ones are described in [47]. For specific techniques and standards,
LPC and LPC-10: [9, 261, 1211; MELP: [170]; b asic CELP: [ll]; federal standard
1016: [122]; G.729 and its annexes: [231, 228, 229, 151; G.728: [34]; G.723.1: no
comprehensive articles; waveform interpolation: [132].