Towards Neurocomputational Speech and So
Towards Neurocomputational Speech and So
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Yannis Stylianou Marcos Faundez-Zanuy
Anna Esposito (Eds.)
Progress
in Nonlinear
Speech Processing
13
Volume Editors
Yannis Stylianou
University of Crete
Computer Science Department
Heraklion, Crete, Greece, 71409
E-mail: [email protected]
Marcos Faundez-Zanuy
Escola Universitària Politècnica de Mataró
Barcelona, Spain
E-mail: [email protected]
Anna Esposito
Seconda Università di Napoli
Dipartimento di Psicologia
Via Vivaldi 43, 81100 Caserta, Italy
E-mail: [email protected]
ISSN 0302-9743
ISBN-10 3-540-71503-7 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-71503-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12038923 06/3142 543210
Preface
The last meeting of the Management Committee of the COST Action 277:
“Nonlinear Speech Processing” was held in Heraklion, Crete, Greece, September
20–23, 2005 during the Workshop on Nonlinear Speech Processing (WNSP).
This was the last event of COST Action 277. The Action started in 2001. Dur-
ing the workshop, members of the Management Committee and invited speak-
ers presented overviews of their work during these four years (2001–2005) of
research combining linear and nonlinear approaches for processing the speech
signal. In this book, 13 contributions summarize part of this (mainly) Euro-
pean effort in this field. The aim of this book is to provide an additional and/or
an alternative way to the traditional approach of linear speech processing to
be used by researchers working in the domain. For all the chapters presented
here, except Chaps. 4, 5, and 12, there is audiovisual material available at
http://www.ics.forth.gr/wnsp05/index.html, where corresponding lectures and
Power Point presentations are available by the authors.
The contributions cover the following areas:
1. Speech analysis for speech synthesis, speech recognition, speech–non-speech
discrimination and voice quality assessment
2. Speaker recognition/verification from a natural or modified speech signal
3. Speech recognition
4. Speech enhancement
5. Emotional state detection
Speech Analysis
Although in many speech applications the estimation of the glottal waveform is
very useful, the estimation of this signal is not always robust and accurate. Given
a speech signal, the glottal waveform may be estimated through inverse filtering.
For certain speech processing areas like the analysis of pathologic voices, where
voice quality issues are closely related to the vocal folds activity, an accurate
inverse filtering is highly desired. The chapter by Jacqueline Walker and Peter
Murphy provides and extensive review on the estimation and analysis of the glot-
tal waveform. The presentation starts with analog inverse filtering approaches
and extends to approaches based on nonlinear least squares estimation methods.
Analysis of speech signals provides many features for efficient voice quality
assessment. Peter Murphy presents a tool based on the rahmonic analysis of
speech for detecting irregularities in synthetic and natural human voice signals.
The cepstrum is decomposed into two areas; the low quefrency and the high
quefrency. The first rahmonic in the high-quefrency region provides information
about the periodicity of a signal. In this chapter, a new measure taking into
account all rahmonics in the cepstrum is proposed, and results using synthetic
VI Preface
and real speech data are presented providing therefore an additional measure to
the usual harmonic-to-noise ratio-related measures for voice quality assessment.
A new tool for spectral analysis of speech signals referred to as chirp group
delay is presented by Baris Bozkurt, Thierry Dutoit and Laurent Couvreur. With
this tool a certain number of group delay functions are computed in the z-plane
on circles different from the usual unit circle. Two important applications of this
tool are presented by the authors; formant tracking and a new set of features for
the automatic speech recognition task.
Two applications of the so-called neurocomputational speech and sounds pro-
cessing are presented in the chapter prepared by Jean Rouat, Stéphane Loiselle
and Ramin Pichevar. There is evidence that for both the visual and auditory
systems the sequence order of firing is crucial to perform recognition tasks (rank
order coding). In the first application a speech recognition system based on
rank order coding is presented and it is compared against a conventional HMM-
based recognizer. In the second application the acoustical source separation is
addressed where simultaneous auditory images are combined with a network of
oscillatory spiking neurons to segregate and bind auditory objects.
In the chapter presented by Maria Markaki, Michael Wohlmayer and Yannis
Stylianou a speech–non-speech classifier is developed based on modulation spec-
trograms. The information bottleneck method is used for extracting relevant
speech modulation frequencies across time and frequency dimensions creating
therefore a set of characteristic modulation spectra for each type of sound. An
efficient and simple classifier based on the similarity of a sound to these charac-
teristic modulation spectra is presented.
The chapter by Yannis Pantazis and Yannis Stylianou deals with the auto-
matic detection of audible discontinuities in concatenative speech synthesis. Both
linear and nonlinear features are extracted at the boundaries of connected speech
segments and a list of distances is evaluated. For the evaluation purposes, results
from a subjective test for the same task have been taken into account. Among
the most promising features for this task are the amplitude and frequency mod-
ulations occuring in the spectra of the continuous speech when two perceptually
incompatible speech segments are joined. The Fisher linear discriminator seems
to perform a high detection score.
Speaker Recognition/Verification
A review and perspectives on voice disguise and its automatic detection are
given in the chapter prepared by Patrick Perrot, Guido Aversano and Gérard
Chollet. A list of ways for modifying the quality of voice is presented along with
the different techniques proposed in the literature. A very difficult topic is that
of automatic detection of disguised voice. The authors describe a list of main
indicators that can be used for this task.
Bouchra Abboud, Hervé Bredin, Guido Aversano, and Gérard Chollet present
an overview on audio-visual identity verification tasks. Face and voice transfor-
mation techniques are reviewed for the face, speaker and talking face verification.
Preface VII
It is shown that rather a limited amount of information can be modified for trou-
bling state-of-the-art audiovisual identity verification systems. An explicit talking
face modeling is then proposed to overcome the weak points of these systems.
State-of-the-art systems and challenges in the text-independent speaker verifi-
cation task are presented in the chapter prepared by Dijana Petrovska-Delacrétaz,
Asma El Hannani and Gérard Chollet. Speakers’ variability and variabilities on
the transmission channel are discussed along with the possible choices of speech
parameterization and speaker models. The use of speech recognition for the
speaker verification task is also discussed, showing that a development of new
services based on speaker and speech recognition is possible.
Marcos Faundez-Zanuy and Mohamed Chetouani present an overview of non-
linear predictive models and their application in speaker recognition. Challenges
and possibilities in extracting nonlinear features towards this task are provided
along with the various strategies that one can follow for using these nonlinear
features. Both nonparametric (e.g., codebook based) and parametric approaches
(e.g., Volterra series) are described. A nonlinear extension of the well-known
linear prediction theory is provided, referred to as neural predictive coding.
Speech Recognition
Although hidden markov models dominate the speech recognition area, the sup-
port vector machine(SVM) is a powerful tool in machine learning and the chapter
by R.Solera-Ureña, J.Padrell-Sendra, D. Martı́n-Iglesias, A. Gallardo-Antolı́n,
C.Peláez-Moreno and F.Dı́az-de-Marı́a is an overview of the application of SVMs
for automatic speech recognition, for isolated word recognition and for continu-
ous speech recognition and connected digit recognition.
Speech Enhancement
Considering single and multichannel-based solutions for the speech enhancement
task, A. Hussain, M. Chetouani, S. Squartini, A. Bastari and F. Piazza present an
overview of the noise reduction approaches focusing on the additive independent
noise case. The non-Gaussian properties of the involved signals and the lack
of linearity in the related processes provide a motivation for the development
of nonlinear algorithms for the speech enhancement task. A very useful table
summarizing the advantages and drawbacks of the currently proposed nonlinear
techniques is presented at the end of the chapter.
Scientific Committee
Gérard Chollet ENST Paris, France
Thierry Dutoit FPMS, Mons, Belgium
Anna Esposito 2nd University of Napoli, Italy
Marcos Faundez-Zanuy EUPMT, Barcelona, Spain
Eric Keller University of Lausanne, Switzerland
Gernot Kubin TUG, Graz, Austria
Petros Maragos NTUA, Athens, Greece
Jean Schoentgen University Libre Bruxelles, Belgium
Yannis Stylianou University of Crete, Greece
Program Committee
Conference Chair Yannis Stylianou (University of Crete, Greece)
Organizing Chair Yannis Agiomyrgiannakis (University of Crete,
Greece)
Audio-Video Material Manolis Zouraris (University of Crete, Greece)
Local Arrangements Maria Markaki (University of Crete, Greece)
Sponsoring Institutions
COST Office, Brussels, Belgium
University of Crete, Heraklion, Crete, Greece
Institute of Computer Science, FORTH, Heraklion, Crete, Greece
Table of Contents
1 Introduction
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 1–21, 2007.
c Springer-Verlag Berlin Heidelberg 2007
2 J. Walker and P. Murphy
the better this knowledge can be used for speaker identification (or conversely
it may lead to improved glottal de-emphasis strategies for speech recognition).
Due to non-availability of a standard, automatic GIF algorithm for use on con-
nected speech (perhaps due to a lack of knowledge of the overall voice production
process), representation and processing of the speech signal has generally side
stepped the issue of accurate glottal inverse filtering and pragmatic alternatives
have been implemented. However these alternatives come at a cost, which can
necessitate the recording of considerably more data than would be required if
the dynamics of voice production were better understood and the appropriate
parameters could be extracted. For example, in existing methods for pitch mod-
ification of voiced speech, the glottal parameters are not manipulated explicitly,
e.g., in the sinusoidal model a deconvolution is performed to extract the filter
and residual error signal. The linear prediction residual signal (or the corre-
sponding harmonic structure in the spectrum) is then altered to implement the
desired pitch modification. This zero-order deconvolution ensures that the for-
mant frequencies remain unaltered during the modification process, giving rise
to a shape-invariant pitch modification. However, examining this from a voice
production viewpoint reveals two important consequences of this approach: to
a first approximation, the glottal closed phase changes and eventually overlaps
as the fundamental frequency (f0) increases and if the glottal periods are scaled
the spectral tilt will change [38]. The former may explain the hoarseness re-
ported in [65] after 20% modification. The solution to this problem has been
to record more data over a greater range of f0 and always modify within this
limit. Integrating better production knowledge into the system would facilitate
modification strategies over a broader range without recourse to such an exten-
sive data set. In what follows, we review the state of the art in glottal inverse
filtering and present a discussion of some of the important issues which have
not always been at the forefront of consideration by investigators. After first
establishing a framework for glottal waveform inverse filtering, the range of ap-
proaches taken by different investigators is reviewed. The discussion begins with
analog inverse filtering using electrical networks [53] and extends to the most
recent approaches which use nonlinear least squares estimation [64]. A brief re-
view of the earliest approaches shows that most of the basic characteristics of the
glottal waveform and its spectrum were established very early. There was also
interest in developing specialist equipment which could aid recovery of the wave-
form. With the introduction of the technique of linear prediction and the steady
improvement of computing power, digital signal processing techniques came to
dominate. Although parametric modelling approaches have been very successful,
alternatives to second-order statistics have not been used extensively and have
not, so far, proved very productive. As the glottal waveform is a low frequency
signal, recording conditions and phase response play a very important role in its
reconstruction. However, sufficient attention has not always been paid to these
issues. Finally, the question of identifying a good result in the reproduction of
such an elusive signal is discussed.
A Review of Glottal Waveform Analysis 3
where A is a gain factor applied to the input p (n). In the frequency domain,
this model represents an all-pole filter applied to the input:
A
V (z) = p . (2)
1+ k=1 ak z −k
As is well known, using the method of least squares, this model has been
successfully applied to a wide range of signals: deterministic, random, station-
ary and non-stationary, including speech, where the method has been applied
assuming local stationarity. The linear prediction approach has been dominant
in speech due to its advantages:
1. Mathematical tractability of the error measure (least squares) used.
2. Favorable computational characteristics of the resulting formulations.
3. Wide applicability to a range of signal types.
4. Generation of a whitening filter which admits of two distinct and useful
standard input types.
5. Stability of the model.
6. Spectral estimation properties.
Applied to the acoustic wave signal, linear prediction is used to produce an all-
pole model of the system filter, V (z), which turns out to be a model of the
vocal tract and its resonances or formants. As noted above, the assumed input
to such a model is either an impulse or white noise, both of which turn out to
suit speech very well. White noise is a suitable model for the input to the vocal
tract filter in unvoiced speech and an impulse (made periodic or pseudo-periodic
by application in successive pitch periods) is a suitable model for the periodic
excitation in voiced speech. In the simplest model of voiced speech, shown in
Fig. 1, the input is the flow of air provided by the periodic opening and closing
of the glottis, represented here by a periodic impulse train p (n). The vocal tract
acts as a linear filter, v (n), resonating at specific frequencies known as formants.
Speech, s (n), is produced following radiation at the lips represented by a simple
differentiation, r (n).
a consequence of the least squares modelling approach, two models fit the as-
sumptions of linear prediction: the input impulse and white noise. Both of these
inputs have a flat spectrum. In other words, the inverse filter which results from
the process is a whitening filter and what remains following inverse filtering is
the modelling error or residual. The simplest glottal pulse model is the periodic
impulse train [21] as used in the LPC vocoder [78]. However, speech synthesiz-
ers and very low bit rate speech coders using only periodic impulses and white
noise as excitations have been found to be poor at producing natural sounding
speech. To improve speech quality, it has been found useful to code the residual,
for example using vector quantized codebooks, in speech coding techniques such
as CELP [68], since to the extent that the residual differs from a purely random
signal in practice, it retains information about the speech including the glottal
waveform.
The linear speech production model can be extended as shown in Fig. 2 so
that it includes two linearly separable filters [21]. The glottal excitation, p (n)
does not represent a physical signal but is simply the mathematical input to
a filter which will generate the glottal flow waveform, g (n). In this model, lip
radiation is represented by a simple differencing filter:
R (z) = 1 − z −1 . (3)
and glottal inverse filtering requires solving the equation:
S (z)
P (z) G (z) = . (4)
AV (z) R (z)
To remove the radiation term, define the differentiated glottal flow waveform as
the effective driving function:
Q (z) = P (z) G (z) R (z) . (5)
A Review of Glottal Waveform Analysis 5
Fig. 2. The linear speech model with linearly separable source model
as shown in part (a) of Fig. 3. Now, as shown in part (b) of Fig. 3, inverse
filtering simplifies to:
S (z)
Q (z) = . (6)
AV (z)
To solve for both Q(z) and V (z) is a blind deconvolution problem. However,
during each period of voiced speech the glottis closes, g(n) = 0, providing an
impulse to the vocal tract. While the glottis is closed, the speech waveform must
be simply a decaying oscillation which is only a function of the vocal tract and
its resonances or formants [81] i.e. it represents the impulse response of the vocal
tract. Solving for the system during this closed phase should exactly capture the
vocal tract filter, V (z), which may then be used to inverse filter and recover
Q(z). G (z) may then be reconstructed by integration (equivalently by inverse
1
filtering by R(z) ). This approach is known as closed phase inverse filtering and
is the basis of most approaches to recovering the glottal flow waveform.
S(z) G(z)P(z)R(z)
X1/A V-1(z)
estimation and a capacitor in the inverse filter network could be adjusted until
the ripple disappeared. As well as attempting to recover the glottal waveform
in the time domain, it could be modelled in the frequency domain. In [49], a
digital computer was used to perform a pitch synchronous analysis using succes-
sive approximations to find the poles and zeros present in the speech spectrum.
With the assumption of complete glottal closure, there will be discontinuous first
derivatives at the endpoints of the open phase of the glottal waveform, 0 and
Tc , and a smooth glottal waveform open phase (i.e. the second derivative exists
and is bounded), it is then shown that the glottal waveform can be modelled by
a set of approximately equally spaced zeros σ + jω where:
−g(Tc −)
σ = ln( ) (7)
g(0+)
π
ω= (1 ± 2n) (8)
Tc
Thus, the vocal tract is modelled by poles and the glottal waveform by zeros.
Despite its simplicity, this approach to finding a model of the glottal waveform
has not often been pursued since. Most pole-zero modelling techniques are ap-
plied to find vocal tract zeros, especially those occurring in nasal or consonant
sounds. Furthermore, it is often pointed out that it is not possible to unam-
biguously determine whether zeros ‘belong’ to the glottal waveform or the vocal
tract. However, there are exceptions [48].
Glottal waveform identification and inverse filtering with the aim of devel-
oping a suitable glottal waveform analog to be used in speech synthesis was
first attempted in [66]. In this work, inverse filtering was performed pitch syn-
chronously over the whole pitch period. A variety of waveforms ranging from
A Review of Glottal Waveform Analysis 7
or impedance provided by the tube which could have been too high [32]. Mask
(or tube) loading can cause attenuation of and, more importantly, shifts in the
formants [67].
Such a model allows for more realistic modeling of speech sounds apart from
vowels, particularly nasals, fricatives and stop consonants [58]. However, estimat-
ing the parameters of a pole-zero model is a nonlinear estimation problem [47].
There are many different approaches to the estimation of a pole-zero model for
speech ranging from inverse LPC [47], iterative pre-filtering [72], [73], SEARMA
(simultaneous estimation of ARMA parameters) [58], weighted recursive least
squares (WRLS) [29], [54], [55], weighted least squares lattice [45], WRLS with
variable forgetting factor (WRLS-VFF) [18]. These methods can give very good
results but are computationally more intensive. They have the advantage that
they can easily be extended to track the time-varying characteristics of speech
[18],[54],[77], but the limited amount of data can lead to problems with conver-
gence. Parametric techniques also have stability problems when the model order
is not estimated correctly [58].
The periodic nature of voiced speech is a difficulty [55] which may be dealt
with by incorporating simultaneous estimation of the input [54], [55]. If the input
A Review of Glottal Waveform Analysis 9
the inverse filter with respect to a glottal waveform model for the whole pitch
period [51]. Interestingly in this approach, the result is the appearance of ripple
in the source-corrected inverse filter during the closed phase of the glottal source
even for synthesized speech with zero excitation during the glottal phase, (note
that the speech was synthesized using the Ishizaka-Flanagan model [37]). Thus,
this ripple must be an analysis artefact due to the inability of the model to ac-
count for it [51]. Improvements to the model are presented in [52],[76] and the
sixth-order Milenkovic model is used in GELP [19].
In terms of the potential applications of glottal inverse filtering, the main
difficulty with the use of glottal source models in glottal waveform estimation
arises from the influence the models may have on the ultimate shape of the result.
This is a particular problem with pathological voices. The glottal waveforms of
these voices may diverge quite a lot from the idealized glottal models. As a
result, trying to recover such a waveform using an idealized source model as
a template may give less than ideal results. A model-based approach which
partially avoids this problem is described in [64] where nonlinear least squares
estimation is used to fit the LF model to a glottal derivative waveform extracted
by closed phase filtering (where the closed phase is identified by the absence of
formant modulation). This model-fitted glottal derivative waveform is the coarse
structure. The fine structure of the waveform is then obtained by subtraction
from the inverse filtered waveform. In this way, individual characteristics useful
for speaker identification may be isolated. This approach also shows promise for
isolating the characteristics of vocal pathologies.
period, the iterative adaptive procedure may be applied pitch synchronously [7]
as shown in Fig. 5.
Comparing the results of the IAIF method with closed phase inverse filtering
show that the IAIF approach seems to produce waveforms which have a shorter
and rounder closed phase. In [7] comparisons are made between original and
estimated waveforms for synthetic speech sounds. It is interesting to note that
pitch synchronous IAIF produces a closed phase ripple in these experiments
(when there was none in the original synthetic source waveform).
In [8] discrete all-pole modelling was used to avoid the bias given toward har-
monic frequencies in the model representation. An alternative iterative approach
is presented in [2]. The method de-emphasises the low frequency glottal infor-
mation using high-pass filtering prior to analysis. In addition to minimising the
influence of the glottal source, an expanded analysis region is provided in the
form of a pseudo-closed phase. The technique then derives an optimum vocal
tract filter function through applying the properties of minimum phase systems.
s(n) shp(n)
HPF
LPC-1
G1(z)
Inverse V1(z)
LPC-v1
filter
LPC-2 G2(z)
V2(z)
Inverse LPC-v2
filter
g2(n) – final glottal waveform estimate
Inverse Integrate
filter
shp(n)
s(n) HPF gpa(n)
IAIF-1 Pitch synchronism
IAIF-2 g(n)
glottal waveform given an impulse train input. Doubly differentiating the speech
production model gives:
by using the Fourier series and thus performing a pitch synchronous analysis [34].
It has been demonstrated that the higher order statistics approach can recover a
system filter for speech, particularly for speech sounds such as nasals [34]. Such a
filter may be non-minimum phase and when its inverse is used to filter the speech
signal will return a residual which is much closer to a pure pseudo-periodic pulse
train than inverse filters produced by other methods [14], [34]. In [14], the speech
input estimate generated by this approach is used in a second step of ARMA
parameter estimation by an input-output system identification method.
The properties of the cepstrum have also been exploited in speech processing.
Transformed into the cepstral domain, the convolution of input pulse train and
vocal tract filter becomes an addition of disjoint elements, allowing the sepa-
ration of the filter from the harmonic component [61]. Cepstral techniques also
have some limitations including the requirement for phase unwrapping and the
fact that the technique cannot be used (although it often is) when there are zeros
on the unit circle. In [42], various ARMA parameter estimation approaches are
applied to the vocal tract impulse response recovered from the cepstral analysis
of the speech signal [60].
There are a few examples of direct glottal waveform recovery using higher
order spectral or cepstral techniques. In [80], ARMA modelling of the linear bis-
pectrum [25] was applied to speech for joint estimation of the vocal tract model
and the glottal volume velocity waveform using higher-order spectral factoriza-
tion [75] with limited success. Direct estimation from the complex cepstrum was
used in [4] based on the assumption that the glottal volume velocity waveform
may be modelled as a maximum phase system. As the complex cepstrum sepa-
rates into causal and acausal parts corresponding to the minimum and maximum
phase parts of the system model this then permits a straightforward separation
of the glottal waveform.
With a fundamental frequency varying in the range 80–250 Hz the glottal wave-
form is a low-frequency signal and so the low-frequency response, including the
phase response, of the recording equipment used is an important factor in glottal
pulse identification. However, many authors do not report in detail on this. In
[81], the following potential problems are identified: ambient noise, low-frequency
bias due to breath burst on the microphone, equipment and tape distortion of
the signal and improper A/D conversion (p. 355). The problem of ambient noise
can be overcome by ensuring suitable recording conditions. Use of special equip-
ment [67], [70] can also minimize the noise problem, but is not always possible, or
may not be relevant to the method under investigation. Paradoxically, the prob-
lem of the low-frequency bias producing a trend in the final recovered waveform
can occur when a high quality microphone and amplifier with a flat frequency
response down to 20 Hz are used. It can be overcome by high pass filtering with
a cut-off frequency no greater than half the fundamental frequency [81] or by
cancelling out the microphone transfer function [51], [79].
14 J. Walker and P. Murphy
Phase distortion was a problem with analog tape recording [12], [36] and is
illustrated as a characteristic ‘humped’ appearance, although doubtless there are
other causes as such waveforms are still being recovered [43]. For example, phase
distortion can result from HOS and cepstral approaches when phase errors occur
due to the need for phase unwrapping [80]. However, more modern recording
techniques especially involving the use of personal computer (PC) sound cards
can also introduce phase distortion at low frequencies which will impact on the
glottal waveform reconstruction. This is clearly demonstrated by experiments
conducted by [1] where synthetic glottal waveforms created using the LF model
were recorded through a PC sound card (Audigy2 SoundBlaster) resulting in the
characteristic humped appearance. The effect was noticeable up to 320 Hz, but
was especially pronounced at the lowest fundamental frequency (80 Hz). In all
cases, the flat closed phase was entirely lost. The correction technique proposed
is to model the frequency response of the recording system using a test signal
made of a sum of sinusoids and thus to develop a compensating filter [1].
Few researchers take the care shown in [79] who plots an example of a glottal
waveform with a widely varying baseline due to the influence of low-frequency
noise picked up by a high-quality microphone such as a Brüel & Kjær 4134
[6], [79]. To overcome this problem, it is common to high-pass filter the speech
[6], [42], [81] but according to [79] this is not sufficient as it removes the flat
part of the closed phase and causes an undershoot at glottal closure: a better
approach is to compensate by following the high pass filter by a low pass filter.
According to [81], a pole may arise at zero frequency due to a non-zero mean
in the typically short duration closed phase analysis window. It appears that in
[81] such a pole is removed from the inverse filter if it arises (and not by linear
phase high pass filtering as suggested by [79]), whereas in [79] the resulting
bias is removed by polynomial fitting to ‘specific points of known closed phase’
(presumably the flattest points). An alternative approach is to take advantage
of specialized equipment such as the Rothenberg mask [67] to allow for greater
detail of measurement of the speech signal at low frequencies. The characteristics
of the mask may then be removed by filtering during analysis [51].
Most experimenters who have reported on recording conditions have used con-
denser type microphones [6], [79], [81] with the exception of [51] who claims that
these microphones are prone to phase distortion around the formant frequen-
cies. However, available documentary information on microphone characteristics
[13], the weight of successful inverse filtering results using condenser microphone
recordings and direct comparison of results with different microphone types [15],
[64] seem to contradict this claim. Depending on the application, it will not
always be possible to apply such stringent recording conditions. For example,
Plumpe et al. [64] test a glottal flow based speaker identification on samples
from the TIMIT and NTIMIT databases. The TIMIT database is recorded with
a high-quality (Sennheiser) microphone in a quiet room while the NTIMIT data-
base represents speech of telephone-channel quality. Here it is in fact the cheaper
microphone which is suspected of causing phase distortion which shows up in
the estimated glottal flow derivatives. In other cases, the recording conditions
A Review of Glottal Waveform Analysis 15
may not be under the control of the investigator who may be using commercially
provided data sources such as [16].
4 Evaluation of Results
One of the primary difficulties in glottal pulse identification is in the evaluation of
the resulting glottal flow waveforms. How do we know we have the ‘right answer’ ?
How do we even know what the ‘right answer’ looks like? There are several
approaches which can be taken. One approach is to verify the algorithm which
is being used for the glottal flow waveform recovery. Algorithms can be verified by
applying the algorithm to a simulated system which may be synthesized speech
but need not be [41], [42]. In the case of synthesized speech, the system will be a
known all-pole vocal tract model and the input will be a model for a glottal flow
waveform. The success of the algorithm can be judged by quantifying the error
between the known input waveform and the version recovered by the algorithm.
This approach is most often used as a first step in evaluating an algorithm [6],
[7], [48], [77], [80] and can only reveal the success of the algorithm in inverse
filtering a purely linear time-invariant system. Synthesized speech can also be
provided to the algorithm using a more sophisticated articulatory model [37]
which allows for source-tract interaction [51].
Once an algorithm has been verified and is being used for inverse filtering real
speech samples, there are two possible approaches to evaluating the results. One
is to compare the waveforms obtained with those obtained by other (usually
earlier) approaches. As, typically, the aim of this is to establish that the new
approach is superior, the objectivity of this approach is doubtful. This approach
can be made most objective when methods are compared using synthetic speech
and results can be compared with the original source, as in [7]. However, the
objectivity of this approach may also be suspect because the criteria used in the
comparison are often both subjective and qualitative as for example in [77] where
visual inspection seems to be the main criterion: “The WRLS-VFF method ap-
pears to agree with the expected characteristics for the glottal excitation source
such as a flat closed region and a sharp slope at closure better than the other two
methods.” (p. 392) Other examples of such comparisons are in [24] and [43]. In
many papers no comparisons are made, a stance which is not wholly unjustified
because there is not a great deal of data available to say which are the correct
glottal flow waveforms.
On the other hand, using two different methods to extract the glottal flow could
be an effective way to confirm the appearance of the waveform as correct. The ra-
tionale behind this is that if two (or more) different approaches garner the same
result then it has a greater chance of being ‘really there’. If one of the methods, at
least for experimental work, utilizes additional help such as the EGG to accurately
identify glottal closure, then that would provide additional confirmation. This ap-
proach was taken in [43] but the results, albeit similar for two approaches, are
most reminiscent of a type of waveform labelled as exhibiting phase distortion in
16 J. Walker and P. Murphy
[81]. The same could be said about many of the results offered in [24] and [80], where
low-frequency (baseline) drift is also in evidence. Once again, if new techniques for
glottal inverse filtering produce waveforms which ‘look like’ the other waveforms
which have been produced before, then they are evaluated as better than those
which do not: examples of the latter include [4], [22].
Improved guidelines for assessing glottal waveform estimates can come from
experiments with physiologically based articulatory synthesis methods. Glottal
inverse filtering can be applied to speech produced with such models where the
models are manipulated to produce various effects. The types of glottal wave-
forms recovered can then be assessed in the light of the perturbations introduced.
An interesting example of what is possible with this idea is shown by [20] where
various degrees and types of air leakage are shown to correlate with varying
amounts of open and closed phase ripple in the derivative glottal flow and the
glottal flow itself.
An alternative approach is to apply some objective mathematical criterion. In
[23], it is shown how the evolution of the phase-plane plot of g(t) versus dg(t)dt to
a single closed loop indicates that a periodic solution has been produced and all
resonances have been removed since resonances will appear as self-intersecting
loops on the phase-plane plot.
be obtained by inverse filtering by a closed phase method and it should have the
ripple (more visible on the differentiated waveform) and a flat closed phase.
However, a common result in inverse filtering is a ripple in the closed phase
of the glottal volume velocity waveform. In [79] this occurs in hoarse or breathy
speech and is assumed to show that there is air flow during the glottal closed
phase. In [79] it is shown through experiments that this small amount of air
flow does not significantly alter the inverse filter coefficients (filter pole positions
change by < 4%) and that true non-zero air flow can be captured in this way.
However, the non-zero air flow and resultant source-tract interaction may still
mean that the ‘true’ glottal volume velocity waveform is not exactly realized
[79]. A similar effect is observed when attempting to recover source waveforms
from nasal sounds. Here the strong vocal tract zeros mean that the inverse filter
is inaccurate and so a strong formant ripple appears in the closed phase [79].
Most recently, a sliding window approach to closed phase inverse filtering has
been attempted [21], [24]. Originally this approach required manual intervention
to choose the best glottal waveform estimates from those obtained in periods
of glottal closure in the speech waveform which were also identified by the op-
erator [21]. Again, this is a very subjective procedure. Such an approach may
be automated by using the maxiumum amplitude negative peaks in the linear
prediction residual to estimate the glottal closure, but this is nothing new [81].
The best glottal waveform estimates are also chosen automatically by choosing
the smoothest estimates [24]. The results obtained by this method were verified
by comparing with waveforms obtained using the EGG to detect glottal closure.
5 Conclusion
Although convincing results for glottal waveform characteristics are reported in
the literature from time to time, a standard fully automatic inverse filtering al-
gorithm is not yet available. An extensive review of the literature has established
that the salient features of the glottal waveform were established fairly early on,
as was the technique of choice which continues to be closed phase inverse fil-
tering. This technique has been successful because it allows the adoption of the
linear time-invariant model for both determining the filter in the source-filter
speech model and then for applying it as an inverse filter to recover the source.
Despite concern about features of recovered waveforms which may be due to in-
accuracies and oversimplifications in this model, alternative approaches have met
with limited success. ARMA modelling has limitations due to the insufficiency of
data and the ‘magic bullet’ promise of alternative statistical techniques such as
the cepstrum and higher order statistics has not delivered. Low frequency phase
response and low frequency noise have been shown to be important issues for
glottal waveform recovery (at least in some contexts such as vocal pathology, ex-
perimental studies on voice production and benchmark generation) which have
not always received due attention by researchers. However, nonlinear approaches
(with the exception of the statistical techniques mentioned already) are only just
beginning to be explored.
18 J. Walker and P. Murphy
References
1. Akande, O., O.: Speech analysis techniques for glottal source and noise estimation
in voice signals. Ph. D. Thesis, University of Limerick (2004)
2. Akande, O. and Murphy, P. J.: Estimation of the vocal tract transfer function for
voiced speech with application to glottal wave analysis. Speech Communication,
46 (2005) 15–36
3. Akande, O., Murphy, P. J.: Improved speech analysis for glottal excited linear
predictive speech coding. Proc. Irish Signals and Systems Conference. (2004) 101–
106
4. Alkhairy, A.: An algorithm for glottal volume velocity estimation. Proc. IEEE Int.
Conf. Acoustics, Speech and Signal Processing. 1 (1999) 233–236
5. Alku, P., Vilkman, E., Laine, U. K.,: Analysis of glottal waveform in different
phonation types using the new IAIF-method. Proc. 12th Int. Congress Phonetic
Sciences, 4 (1991) 362–365
6. Alku, P.: An automatic method to estimate the time-based parameters of the
glottal pulseform. Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing.
2 (1992) 29–32
7. Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse
filtering. Speech Communication. 11 (1992) 109–118
8. Alku, P., Vilkman, E.: Estimation of the glottal pulseform based on Discrete All-
Pole modeling. Proc. Int. Conf. on Spoken Language Processing. (1994) 1619-1622
9. Ananthapadmanabha, T. V., Fant, G.: Calculation of true glottal flow and its
components. STL-QPR. (1985) 1–30
10. Atal, B. S., Hanauer, S. L.: Speech analysis and synthesis by linear prediction of
the speech wave. J. Acoust. Soc. Amer. 50 (1971) 637–655
11. Bergstrom, A., Hedelin, P.: Codebook driven glottal pulse analysis. Proc. IEEE
Int. Conf. Acoustics, Speech and Signal Processing. 1 (1989) 53–56
12. Berouti, M., Childers, D., Paige, A.: Correction of tape recorder distortion. Proc.
IEEE Int. Conf. Acoustics, Speech and Signal Processing. 2 (1977) 397–400
13. Brüel & Kjær: Measurement Microphones. 2nd ed. (1994)
14. Chen, W.-T., Chi, C.-Y.: Deconvolution and vocal-tract parameter estimation of
speech signals by higher-order statistics based inverse filters. Proc. IEEE Workshop
on HOS. (1993) 51–55
15. Childers, D. G.: Glottal source modeling for voice conversion. Speech Communica-
tion. 16 (1995) 127–138
16. Childers, D. G.: Speech processing and synthesis toolboxes. Wiley: New York (2000)
17. Childers, D. G., Chieteuk, A.: Modeling the glottal volume-velocity waveform for
three voice types. J. Acoust. Soc. Amer. 97 (1995) 505–519
18. Childers, D. G., Principe, J. C., Ting, Y. T. Adaptive WRLS-VFF for Speech
Analysis. IEEE Trans. Speech and Audio Proc. 3 (1995) 209–213
19. Childers, D. G., Hu, H. T.: Speech synthesis by glottal excited linear prediction.
J. Acoust. Soc. Amer. 96 (1994) 2026-2036
20. Cranen, B., Schroeter, J.: Physiologically motivated modelling of the voice source
in articulatory analysis/synthesis. Speech Communication. 19 (1996) 1–19
21. Cummings, K. E., Clements, M. A.: Glottal Models for Digital Speech Processing:
A Historical Survey and New Results. Digital Signal Processing. 5 (1995) 21–42
22. Deng, H., Beddoes, M. P., Ward, R. K., Hodgson, M.: Estimating the Glottal
Waveform and the Vocal-Tract Filter from a Vowel Sound Signal. Proc. IEEE
Pacific Rim Conf. Communications, Computers and Signal Processing. 1 (2003)
297–300
A Review of Glottal Waveform Analysis 19
23. Edwards, J. A., Angus, J. A. S.: Using phase-plane plots to assess glottal inverse
filtering. Electronics Letters 32 (1996) 192–193
24. Elliot, M., Clements, M.: Algorithm for automatic glottal waveform estimation
without the reliance on precise glottal closure information. Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing. 1 (2004) 101–104
25. Erdem, A. T., Tekalp, A. M.: Linear Bispectrum of Signals and Identification of
Nonminimum Phase FIR Systems Driven by Colored Input. IEEE Trans. Signal
Processing. 40 (1992) 1469–1479
26. Fant, G. C. M.: Acoustic Theory of Speech Production. (1970) The Hague, The
Netherlands: Mouton
27. Fant, G., Liljencrants, J., Lin, Q.: A four-parameter model of glottal flow. STL-
QPR. (1985) 1–14
28. Fant, G., Lin, Q., Gobl, C.: Notes on glottal flow interaction. STL-QPR. (1985)
21–45
29. A recursive maximum likelihood algorithm for ARMA spectral estimation. IEEE
Trans. Inform. Theory 28 (1982) 639–646
30. Fu, Q., Murphy, P.: Adapive Inverse filtering for High Accuracty Estimation of the
Glottal Source. Proc. NoLisp’03. (2003)
31. Fu, Q., Murphy, P. J.: Robust glottal source estimation based on joint source-filter
model optimization. IEEE Trans. Audio, Speech Lang. Proc., 14 (2006) 492–501
32. Hillman, R. E., Weinberg, B.: A new procedure for venting a reflectionless tube. J.
Acoust. Soc. Amer. 69 (1981) 1449–1451
33. Holmberg, E. R., Hillman, R. E., Perkell, J. S.: Glottal airflow and transglottal
air pressure measurements for male and female speakers in soft, normal and loud
voice. J. Acoust. Soc. Amer. 84 (1988) 511–529
34. Hinich, M. J., Shichor, E.: Bispectral Analysis of Speech. Proc. 17th Convention
of Electrical and Electronic Engineers in Israel. (1991) 357–360
35. Hinich, M. J., Wolinsky, M. A.: A test for aliasing using bispectral components. J.
Am. Stat. Assoc. 83 (1988) 499-502
36. Holmes, J. N.: Low-frequency phase distortion of speech recordings. J. Acoust. Soc.
Amer. 58 (1975) 747–749
37. Ishizaka, K., Flanagan, J. L.: Synthesis of voiced sounds from a two mass model
of the vocal cords. Bell Syst. Tech. J. 51 (1972) 1233–1268
38. Jiang, Y., Murphy, P. J.: Production based pitch modification of voiced speech.
Proc. ICSLP, (2002) 2073–2076
39. Klatt, D.: Software for a cascade/parallel formant synthesizer. J. Acoust. Soc.
Amer. 67 (1980) 971–994
40. Klatt, D., Klatt, L.: Analysis, synthesis, and perception of voice quality variations
among female and male talkers. J. Acoust. Soc. Amer. 87 (1990) 820–857
41. Konvalinka, I. S., Mataušek, M. R.: Simultaneous estimation of poles and zeros in
speech analysis and ITIT-iterative inverse filtering algorithm. IEEE Trans. Acoust.,
Speech, Signal Proc. 27 (1979) 485–492
42. Kopec, G. E., Oppenheim, A. V., Tribolet, J. M.: Speech Analysis by Homomorphic
Prediction IEEE Trans.Acoust., Speech, Signal Proc. 25 (1977) 40–49
43. Krishnamurthy, A. K.: Glottal Source Estimation using a Sum-of-Exponentials
Model. IEEE Trans. Signal Processing. 40 (1992) 682–686
44. Krishnamurthy, A. K., Childers, D. G.: Two-channel speech analysis. IEEE Trans.
Acoust., Speech, Signal Proc. 34 (1986) 730–743
45. Lee, D. T. L., Morf, M., Friedlander, B.: Recursive least squares ladder estimation
algorithms. IEEE Trans. Acoust., Speech, Signal Processing. 29 (1981) 627–641
20 J. Walker and P. Murphy
46. Lee, K., Park, K.: Glottal Inverse Filtering (GIF) using Closed Phase WRLS-VFF-
VT Algorithm. Proc. IEEE Region 10 Conference. 1 (1999) 646–649
47. Makhoul, J.: Linear Prediction: A Tutorial Review. Proc. IEEE. 63 (1975) 561–580
48. Mataušek, M. R., Batalov, V. S.: A new approach to the determination of the
glottal waveform. IEEE Trans. Acoust., Speech, Signal Proc. 28 (1980) 616–622
49. Mathews, M. V., Miller, J. E., David, Jr., E. E.: Pitch synchronous analysis of
voiced sounds. J. Acoust. Soc. Amer. 33 (1961) 179–186
50. Mendel, J. M.: Tutorial on Higher-Order Statistics (Spectra) in Signal Processing
and System Theory: Theoretical Results and Some Applications. Proc. IEEE. 79
(1991) 278–305
51. Milenkovic, P.: Glottal Inverse Filtering by Joint Estimation of an AR System with
a Linear Input Model. IEEE Trans. Acoust., Speech, Signal Proc. 34 (1986) 28–42
52. Milenkovic, P. H.: Voice source model for continuous control of pitch period. J.
Acoust. Soc. Amer. 93 (1993) 1087-1096
53. Miller, R. L.: Nature of the Vocal Cord Wave. J. Acoust. Soc. Amer. 31 (1959)
667-677
54. Miyanaga, Y., Miki, M., Nagai, N.: Adaptive Identification of a Time-Varying
ARMA Speech Model. IEEE Trans. Acoust., Speech, Signal Proc. 34 (1986)
423–433
55. Miyanaga, Y., Miki, N., Nagai, N., Hatori, K.: A Speech Analysis Algorithm which
eliminates the Influence of Pitch using the Model Reference Adaptive System. IEEE
Trans. Acoust., Speech, Signal Proc. 30 (1982) 88–96
56. Monsen, R. B., Engebretson, A. M.: Study of variations in the male and female
glottal wave. J. Acoust. Soc. Amer. 62 (1977) 981–993
57. Monsen, R. B., Engebretson, A. M., Vemula, N. R.: Indirect assessment of the
contribution of subglottal air pressure and vocal-fold tension to changes of funda-
mental frequency in English. J. Acoust. Soc. Amer. 64 (1978) 65–80
58. Morikawa, H., Fujisaki, H.: Adaptive Analysis of Speech based on a Pole-Zero
Representation. IEEE Trans. Acoust., Speech, Signal Proc. 30 (1982) 77–87
59. Nikias, C. L., Raghuveer, M. R.: Bispectrum Estimation:A Digital Signal Process-
ing Framework. Proc. IEEE. 75 (1987) 869–891
60. Oppenheim, A. V.: A speech analysis-synthesis system based on homomorphic
filtering. J. Acoust., Soc. Amer. 45 (1969) 458–465
61. Oppenheim, A. V., Schafer, R. W.: Discrete-Time Signal Processing. Englewood
Cliffs:London Prentice-Hall (1989)
62. Pan, R., Nikias, C. L.: The complex cepstrum of higher order cumulants and non-
minimum phase system identification. IEEE Trans. Acoust., Speech, Signal Proc.
36 (1988) 186–205
63. Parthasarathy, S., Tufts, D. W.: Excitation-Synchronous Modeling of Voiced
Speech. IEEE Trans. Acoust., Speech, Signal Proc. 35 (1987) 1241–1249
64. Plumpe, M. D., Quatieri, T. F., Reynolds, D. A.: Modeling of the Glottal Flow
Derivative Waveform with Application to Speaker Identification. IEEE Trans.
Speech and Audio Proc. 7 (1999) 569–586
65. Quatieri, T. F., McAulay, R. J.: Shape invariant time-scale and pitch modification
of speech. IEEE Trans. Signal Process., 40 (1992) 497–510
66. Rosenberg, A.: Effect of the glottal pulse shape on the quality of natural vowels.
J. Acoust. Soc. Amer. 49 (1971) 583–590
67. Rothenberg, M.: A new inverse-filtering technique for deriving the glottal air flow
waveform. J. Acoust. Soc. Amer. 53 (1973) 1632–1645
A Review of Glottal Waveform Analysis 21
68. Schroeder, M. R., Atal, B. S.: Code-excited linear prediction (CELP): High quality
speech at very low bit rates. Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing. 10 (1985) 937–940
69. Shanks, J. L.: Recursion filters for digital processing. Geophysics. 32 (1967) 33–51
70. Sondhi, M. M.: Measurement of the glottal waveform. J. Acoust. Soc. Amer. 57
(1975) 228–232
71. Sondhi, M. M., Resnik, J. R.: The inverse problem for the vocal tract: Numerical
methods, acoustical experiments, and speech synthesis. J. Acoust. Soc. Amer. 73
(1983) 985–1002
72. Steiglitz, K.: On the simultaneous estimation of poles and zeros in speech analysis.
IEEE Trans. Acoust., Speech, Signal Proc. 25 (1977) 194–202
73. Steiglitz, K., McBride, L. E.: A technique for the identifcation of linear systems.
IEEE Trans. Automat. Contr., 10 (1965) 461–464
74. Stylianou, Y.: Applying the harmonic plus noise model in concatenative speech
synthesis. IEEE Trans. Speech Audio Process., 9(2001) 21–29
75. Tekalp, A. M., Erdem, A. T.: Higher-Order Spectrum Factorization in One and
Two Dimensions with Applications in Signal Modeling and Nonminimum Phase
System Identification. IEEE Trans. Acoust., Speech, Signal Proc. 37 (1989)
1537–1549
76. Thomson, M. M.: A new method for determining the vocal tract transfer function
and its excitation from voiced speech. Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing. 2 (1992) 23–26
77. Ting, Y., T., Childers, D. G.: Speech Analysis using the Weighted Recursive
Least Squares Algorithm with a Variable Forgetting Factor. Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing. 1 (1990) 389–392
78. Tremain, T. E.: The government standard linear predictive coding algorithm: LPC-
10. Speech Technol. 1982 40–49
79. Veeneman, D. E., BeMent, S. L.: Automatic Glottal Inverse Filtering from Speech
and Electroglottographic Signals. IEEE Trans. Acoust., Speech, Signal Proc. 33
(1985) 369–377
80. Walker, J.: Application of the bispectrum to glottal pulse analysis. Proc. NoLisp’03.
(2003)
81. Wong, D. Y., Markel, J. D., Gray, A. H.: Least squares glottal inverse filtering
from the acoustic speech waveform. IEEE Trans. Acoust., Speech, Signal Proc. 27
(1979) 350–355
Rahmonic Analysis of Signal Regularity in Synthesized
and Human Voice
Peter J. Murphy
1 Introduction
Aperiodicity of the voice signal can result from jitter, shimmer, aspiration noise,
nonlinear phenomena, inter-period glottal waveshape differences, non-stationarity of
the vocal tract or some combination of these attributes. An index for representing the
degree of aperiodicity of the voice signal is useful for non-invasive analysis of voice
disorders. A popular measurement for extracting this information is the harmonics-to-
noise ratio (HNR); it provides an indication of the overall periodicity of the voice
signal. Specifically, it quantifies the ratio between the periodic and aperiodic compo-
nents in the signal. A reliable method for estimating the degree of abnormality in the
acoustic waveform of the human voice is important for effective evaluation and man-
agement of voice pathology.
A number of temporal and spectral based methods have been employed for HNR
estimation. However, pitch synchronous time-domain methods for HNR estimation
are problematic because of the difficulty in estimating the fundamental period for
voiced speech (especially for pathological conditions), while frequency-domain
methods encounter the problem of estimating the noise level at harmonic locations.
Cepstral techniques have been introduced to supply noise estimates at all frequency
locations in the spectrum and/or to avoid the need for accurate fundamental frequency
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 22 – 40, 2007.
© Springer-Verlag Berlin Heidelberg 2007
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 23
(f0) or harmonic estimation. These methods have employed the use of either the low
quefrency cepstral coefficients or the rahmonic amplitude corresponding to the fun-
damental frequency.
In this chapter, two existing low quefrency cepstral-based noise estimation tech-
niques [1-2] are briefly reviewed. It is shown that the techniques can provide a tradi-
tional HNR measure though they appear to have originally been developed to supply
an average of the dB HNRs calculated over specific frequency bands. In the high
quefrencies, the cepstrum is comprised of prominent peaks termed rahmonics, spaced
at the fundamental period and it’s multiples. The first rahmonic amplitude has been
employed to assess voice quality [3] and pathological voice [4]. A specific measure of
the first rahmonic energy, termed cepstral peak prominence (CPP) has been used on
real speech by a number of investigators [3-5]. Presently it is tested on synthesis data.
In addition, a more recent method [6] employing all rahmonics in the cepstrum is
tested against the same synthesis data. The performance of both measures in terms of
separating a set of patient and normal data is also assessed.
Representative time domain HNR estimation techniques are reported in, for example,
[7] and [8]. In general, these methods calculate the harmonics-to-noise ratio by first
computing the average waveform of a single period of speech through calculating the
mean of successive periods. The noise energy is then calculated as the mean squared
difference between the average waveform and the individual periods. The ratio of the
energy of the average waveform to the average variance gives the HNR (Eq.1).
Though very simple and less computationally intensive as compared to frequency and
cepstral based approaches, time domain HNR estimation requires accurate estimation
of the beginning and end of individual periods in the underlying speech waveform.
This requires the use of complex signal processing algorithms. Moreover, the pitch
24 P.J. Murphy
boundaries are very sensitive to phase distortion, hence a high quality recording de-
vice with good amplitude and phase response is required for collecting the speech
data.
⎛ T
⎞
⎜
⎜
M ∑ 2
s avg (n) ⎟
⎟
HNR = 10 log n=0
(1)
10
⎜ T M
⎟
⎜
⎝
∑∑
n=0 i =1
( s i ( n ) − s avg (n )) 2 ⎟
⎠
M is the total number of fundamental periods, i is the ith fundamental period (of
length T) and savg(n) is the waveform averaged over M fundamental periods at each
point ‘n’ in the cycle.
⎧ N/ 2 2 ⎫
⎪
⎪⎪ ∑ Si ⎪
⎪⎪
HNR = 10 log 10 ⎨ N/ 2 i
⎬ (2)
⎪ 2⎪
⎪ ∑ N i ⎪
⎩⎪ i =1 ⎭⎪
where |Si| represents harmonic amplitudes, and |Ni| is the noise estimate.
s(t)=e(t)*v(t) (3)
|S(f)|=|E(f)×V(f)| (4)
Taking the logarithm changes the multiplicative components into additive components.
log|S(f)|=log|E(f)|+log|V(f)| (5)
It is noted (Fig.1(a)) that the vocal tract contributes a slow variation, while the pe-
riodicity of the source manifests itself as a fast variation in the log spectrum. Taking
the Fourier transform of the logarithmic power spectrum yields a prominent peak
corresponding to the high frequency source component and a broader peak corre-
sponding to the low frequency formant structure. To distinguish between the fre-
quency components in a temporal waveform and the frequency in the log spectrum
the term quefrency is used to describe the log spectrum “frequency”, while the first
prominent cepstral peak is termed the first rahmonic (Fig.1(b)). In present day proc-
essing the inverse Fourier transform of the log magnitude spectrum is generally taken
to represent the cepstrum of voiced speech. The real cepstrum is defined as the in-
verse Fourier transform of the log of the magnitude spectrum:
∞ j 2 π ft
Cˆ ( t ) = ∫ log | S ( f ) | e df . (6)
−∞
Fig. 1. Schematic representation of (a) a Fourier spectrum of a periodic speech waveform, and
(b) its corresponding cepstrum with rahmonics spaced at integer multiples of T0, the fundamen-
tal period
26 P.J. Murphy
Fig. 2. HNR estimation using de Krom [1] cepstral baseline technique using a window length
of 1024 sample points
A modification to the de Krom technique [1] is presented in [2]. Problems with the
baseline fitting procedure are highlighted and a way to avoid these problems by calcu-
lating the energy and noise estimates at harmonic locations only is proposed. In addi-
tion, rather than comb-liftering the rahmonics, the cepstrum is low-passed filtered to
provide a smoother baseline (Fig.3).
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 27
Fig. 3. HNR estimation (Qi and Hillman [2]) cepstral baseline technique (window length 3200
points)
In a recent study [28] the basis behind, and the accuracies of the cepstral-based
HNR estimation techniques in [1] and [2] were evaluated and a new cepstral-based
HNR method was proposed and tested. HNR is overestimated in [1] while [2] gives
underestimates due to underestimating and overestimating the noise baseline respec-
tively (as is evident in Fig.2 and Fig.3).
A new technique described in [27] employed a harmonic pre-emphasis to over-
come the problem of baseline fitting to the noise floor (Fig.4). The noise baseline
(which is equivalent to a traditional vocal tract transfer function estimate via the cep-
strum) is influenced by the glottal source excited vocal tract and by the noise excited
vocal tract. The liftered spectral baseline does not rest on the actual noise level but
interpolates the harmonic and between harmonic estimates and, hence, resides some-
where between the noise and harmonic levels. As the window length increases, the
contribution of harmonic frequencies to the cepstral baseline estimate decreases.
However, the glottal source still provides a bias in the estimate. To remove the influ-
ence of the source, pre-emphasis is applied to the harmonics for the voiced speech
signals (i.e. noiseless signals). Noise-free harmonics are approximated through perio-
dogram averaging [28].
A pre-emphasis filter,
−1
h ( z ) = 1 − 0 .97 z (7)
is applied to these estimates in the frequency domain by multiplying each harmonic
value by the appropriate pre-emphasis factor.
In the review [27] of the methods described in [1] and [2] the measures described
are understood to be taken in order to estimate HNR but it appears that what the au-
thors may have intended was not in fact HNR estimation in the traditional sense but
28 P.J. Murphy
an average of the HNRs in dB at each harmonic location which is quite different. This
interpretation has a close relationship to the index extracted from the high quefrencies
as will be shown.
Cepstral estimation of irregularity in voice signals has also focussed on the high
quefrency region of the cepstrum [3, 29-31]. In particular the amplitude of the first
rahmonic is used for voice quality assessment. In [29], the height of the first cepstral
peak (first rahmonic) is used as an indicator of periodicity and the location of this
peak on the quefrency axis determines the pitch period. Objective assessment is re-
ported through using the method and a call for further studies is made. The height of
the first cepstral peak is also estimated in [3], using a normalisation procedure, in
which pitch tracking errors were not checked for, and the resulting index is found to
be a strong predictor (correlation coefficient 0.84) of breathiness. In [4] it is reported
that the main cepstral peak integrates both high frequency noise and aperiodicity
aspects of pathological voice and hence the measure shows promise for quantifying
improvements in voice quality following phonosurgery. The measure is found to be a
superior indicator of voice quality when compared against other acoustic indices. CPP
has been used by a number of researchers and the measure is shown to provide a
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 29
3 Method
A one second sample of voiced speech is used and the following two methodological
procedures are followed:
Fig. 5. Estimation of cepstral peak prominence (CPP). Distance from regression line to promi-
nent peak indicated CPP.
4
ampl. (arb. units)
-2
-4
-6
0 500 1000 1500 2000
-4
quefrency (s*10 )
Fig. 6. Estimation of sum of rahmonic amplitudes (SRA). Rahmonic peaks are located and
summed (one sided cepstrum up to 1024 points) to produce SRA.
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 31
4 Analysis
A sequence of glottal pulses (type C) [33] was used as input into a delay line digital
filter, where the filter coefficients were obtained based on area function data for the
vowel /AH/ and a reflection coefficient at the lip end of 0.71 [34]. Radiation at the
lips was modelled by the first order difference equation R(z) = (1-z-1). The sampling
frequency for the synthesis was 10 kHz.
Aperiodicity was introduced into the waveform by altering the impulse train or the
glottal input directly for a 110 Hz waveform. Random shimmer was introduced by
adding a random variable gain factor (of a given std. dev. ranging form 1 to 32 %) to
the amplitude of the pitch period impulse train prior to convolution with the glottal
pulse. Jitter was introduced through scaling of the glottal periods. Two variations of
jitter were implemented, cyclic jitter (1 to 6 %) and random jitter (std. dev. 1 to 6 %);
for the former, alternate periods are equal while in the latter the variation is random.
Random additive noise was introduced by multiplying the glottal pulse by a random
noise generator arranged to give signal dependent additive noise of a user specified
variance (std. dev. 1 to 32 %). Further files were also created for three levels of addi-
tive noise for signals beginning at 80 Hz and increasing in six, approximately equi-
spaced steps of 60 Hz up to 350 Hz.
Recordings were made of the participants phonating the sustained vowel /AH/ and
uttering the phonetically balanced sentence “Joe took father’s shoe bench out” at their
comfortable pitch and loudness level. All recordings were made using a Tandberg
audio recorder (AT 771, Audio Tutor Educational, Japan) prior to the participants
(thirteen in all) undergoing laryngovideostroboscopic (LVS, Endo-Stroboskop,
Atmos, Germany) evaluation at the outpatient’s ENT clinic in Beaumont Hospital,
Dublin.
Along with the results of the stroboscopic examination, full medical details regard-
ing the vocal pathology were taken for each patient as well as any further diagnostic
comments at the time of assessment. The audio (and video) data from the strobo-
scopic evaluation were recorded using SONY SVHS (E-180, France) cassettes.
Twelve normals were subsequently recorded under the same conditions. All re-
cordings were low-passed filtered at 4 kHz and digitized at a sampling rate of 10 kHz
into a data acquisition expansion card (Integrated Measurement Systems, PCL-814,
Southampton, UK) with 14-bit resolution and stored for subsequent analysis.
Points (1) to (7) in the second list under Method provide the outline of the SRA algo-
rithm used for analysis. Another program implements Hillenbrand’s normalisation
scheme (CCP, first list under Method). Band-pass and high-pass versions (of CPP)
using a 250th order, finite impulse response filter, are also coded. In the SRA analysis,
a window length of 2048 points is used.
32 P.J. Murphy
5 Results
In order to test the analysis techniques in a systematic manner they are firstly applied
to the synthetically generated voice signals. The methods are tested with six levels of
noise, jitter (both random and cyclic) and shimmer for signals at 110 Hz. The methods
are also tested with three different noise levels for six different fundamental frequen-
cies ranging from 80 Hz to 350 Hz and therefore covering the extremes of the ex-
pected vocal pitch range in modal register.
3.5 CPPb
3
CPPh
2.5
2
1.5
1
0.5
0
1 2 4 8 16 32
glottal noise (std. dev. %)
Fig. 7. Cepstral Peak Prominence (CPP) and filtered versions (CPPb, cepstrum applied to band-
passed (2.5-3.5 kHz) filtered speech) (CCPh, cepstrum applied to high-passed (2.5 kHz) filtered
speech) vs glottal noise (110 Hz signal)
3.5
3
2.5 CPP
2 CPPb
1.5 CPPh
1
0.5
0
1 2 4 8 16 32
source ampl. pertubation (std. dev. %)
Fig. 8. Cepstral Peak Prominence (CPP) and filtered versions (CPPb, cepstrum applied to band-
passed (2.5-3.5 kHz) filtered speech) (CCPh, cepstrum applied to high-passed (2.5 kHz) filtered
speech) vs shimmer (110 Hz signal)
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 33
3.5
CPP
3
2.5 CPPb
2
1.5 CPPh
1
0.5
0
1 2 3 4 5 6
random source period perturbation (std. dev. %)
Fig. 9. Cepstral Peak Prominence (CPP) and filtered versions (CPPb, cepstrum applied to band-
passed (2.5-3.5 kHz) filtered speech) (CCPh, cepstrum applied to high-passed (2.5 kHz) filtered
speech) vs random jitter (110 Hz signal)
3
cpp
2.5
cppbp
2
cpphp
1.5
1
0.5
0
1 2 3 4 5 6
cyclic jitter (std. dev. %)
Fig. 10. Cepstral Peak Prominence (CPP) and filtered versions (CPPb, cepstrum applied to
band-passed (2.5-3.5 kHz) filtered speech) (CCPh, cepstrum applied to high-passed (2.5 kHz)
filtered speech) vs cyclic jitter (110 Hz signal)
3
2.5
noise 4%
2
1.5 noise 8%
1 noise 16%
0.5
0
80 110 160 220 290 350
f0 (Hz)
Fig. 11. Cepstral Peak Prominence (CPP) vs fundamental frequency for three levels of glottal
noise
15
SRA (arbitrary
10
units)
noise
5
0
1 2 4 8 16 32
glottal noise (std. dev. %)
Fig. 12. Sum of rahmonic amplitudes (SRA) vs glottal noise for 110 Hz signal (vowel /AH/)
SRA vs shimmer
SRA (arbitrary units)
20
15
10 shimmer
5
0
1 2 4 8 16 32
shimmer (std. dev. %)
Fig. 13. Sum of rahmonic amplitudes (SRA) vs shimmer for 110 Hz signal (vowel /AH/)
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 35
corresponding spectrum for speech signals containing additive noise shows early
harmonic structure that quickly deteriorates with increasing frequency. CPP decreases
slightly as random jitter increases (Fig.9). Band-pass or high-pass filtering does not
provide any new information and occasionally the results are less sensitive or less
reliable (Fig.7 and Fig.9). CPP is somewhat invariant across f0, representing in-
creased noise levels with decreased CPP values (Fig.11).
Sum of rahmonic amplitudes (SRA) is tested against the same synthesis data (six
levels of additive noise, shimmer, random jitter and cyclic jitter).
The response of SRA to the aperiodicity measures is shown in Fig.12-Fig.14. The
measure is similar in response to cyclic jitter and shimmer i.e. it is relatively (com-
pared to noise and random jitter) less sensate to these measures. The SRA measure is
more responsive to additive noise and random jitter (Fig.7 and Fig.9). SRA is not
invariant across f0, however it represents increased noise levels with decreased SRA
values (Fig.15).
SRA vs jitter
20
SRA (arbitrary units)
15
cyclic jitter
10
random jitter
5
0
1 2 3 4 5 6
jitter (std. dev. %)
Fig. 14. Sum of rahmonic amplitudes (SRA) vs jitter for 110 Hz signal (vowel /AH/)
12
10 noise 4
8 noise 8
6 noise 16
4
2
0
80 110 160 f0 220 290 350
Fig. 15. Sum of rahmonic amplitudes (SRA) vs fundamental frequency for three levels of glot-
tal noise (vowel /AH/)
36 P.J. Murphy
Both CPP (including filtered versions) and SRA are evaluated against the patient
(13)/normal (12) data sets. All data were rated by two speech and language thera-
pists (individually) based on a modified GRBAS scale. Two of the normals
CPP
8
occurance
6 pat.
4
2 nrm.
0
1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9
CPP (arbitrary units)
(a)
CPPh
8
occurance
6
pat.
4 nrm.
2
0
1.1 1.2 1.3 1.4 1.5 1.6 1.7
CPPh (arbitrary units)
(b)
CPPb
6
occurance
4 pat.
nrm.
2
0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 2.1
CPPb (arbitrary units)
(c)
Fig. 16. Histogram of the estimated CPP values (a) CPP (cepstrum applied to unfiltered speech)
(b) CPPh (cepstrum applied to high-passed (2.5 kHz) filtered speech) and (c) CPPb (cepstrum
applied to band-passed (2.5-3.5 kHz) filtered speech) for normal and disordered corpus
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 37
6
occurance
4 pat.
2 nrm.
0
20 30 40 50 60 70 80 90 100 110
SRA (arbitrary units)
Fig. 17. Histogram of the estimated SRA values for normal and disordered corpus
6 Discussion
The CPP and SRA measures in general decrease as noise or perturbation increases.
CPP and SRA decrease more noticeably for noise and random jitter than for shimmer
and cyclic jitter. This is explainable through considering the spectral consequences of
these underlying aperiodicities. Signals containing shimmer and cyclic jitter retain
strong harmonic structure in the spectrum whereas the higher harmonics are reduced
when random noise or random jitter is present. However, the response of CPP and
SRA to the aperiodicities is not exactly the same.
CPP appears less sensitive than SRA.
CPP differs from SRA in four distinct ways:
1. The window lengths differ.
2. Only the first rahmonic is used in the CPP evaluation.
3. The CPP results from a normalization procedure that involved taking a
logarithm of the cepstral amplitudes and a noise floor.
4. In calculating CPP an average of logarithmic values is taken.
Therefore, the apparent reduced sensitivity produced using CPP may be conjec-
tured to be due to any one of these four variations. However, it is hypothesized that
with a window length of 52 ms, that harmonic structure should be present in the sig-
nal. Further, it is expected that the height of all rahmonics will vary in a similar fash-
ion to that of the first rahmonic only. The reduced sensitivity therefore, is conjectured
to be due to the fact that the normalization involves taking dB values of the first cep-
stral rahmonic, which has previously been stated to be representative of the mean of
dB ratios (that the amplitude of the first rahmonic represents a mean of dB HNRs - at
each harmonic/between harmonic location - in the log spectrum has been formally
38 P.J. Murphy
validated recently [36]). Therefore, taking the logarithm a second time has com-
pressed the cepstral index, reducing its sensitivity. Band-pass and high-pass versions
of CPP did not reflect the aperiodicity increases.
It is reported [3] that a strong correlation existed between CPP and perceived
breathiness (correlation coefficient 0.84). The reduced rahmonic amplitude recorded
for breathy voices may not be entirely due to aperiodicity of the time domain wave-
form. One of the main indicators of breathiness reported in the literature is increased
first harmonic amplitude [13] along with aspiration noise (aperiodicity). Each acoustic
parameter (increased first harmonic amplitude and increased aspiration noise) has
direct (and the same) aerodynamic and physiological correlations in the form of in-
creased volume velocity and more abducted vocal folds respectively. Therefore, in the
case of breathy signals the reduced amplitude of the cepstral peak may also be due, in
part, to the increase in the amplitude of the first harmonic. To understand this, it is
observed that periodicity in the frequency domain is offset by this increase in ampli-
tude of the first harmonic, leading to a less obvious ‘separation of the log’. It is noted
that periodicity in one domain does not necessarily indicate periodicity in another. On
the contrary, a sharp peak in one domain corresponds to a more broadened (sinusoi-
dal) event in the other domain (in fact, this is the reverse of the cepstrum). Therefore,
perfect periodicity (sinusoid) can exist in the time-domain and yet no cepstral peak is
found at the expected quefrency location.
In summary, SRA provides dependable estimates of noise levels and it is also af-
fected by jitter. The CPP measure also reflects the deterioration in harmonic structure
quite well but the trend is less reliable. The filtered versions are unsuccessful in follow-
ing the trends of increased aperiodicity/reduced magnitude of acoustic index. All meth-
ods were applied to the patient data in an attempt to separate the patients from the
normals (Fig.16 and Fig.17). The CPP method shows good ability in differentiating
between the patient/normal data sets with it’s filtered versions showing no discrimina-
tory ability. However, SRA gives the best overall discrimination, being highly signifi-
cant at the 5 % level (one tailed, equal variance, two sample mean, student’s t-test).
7 Conclusion
Two high–quefrency acoustic indices, cepstral peak prominence (CPP) and sum of
rahmonic amplitudes (SRA) have been implemented and evaluated using synthesized
data and a set of patient and normal productions of the vowel /AH/. The reduction in
the ability to ‘separate the log’ when signals become more dominantly sinusoidal (i.e.
reduced richness of harmonics) has been proposed as a contributory factor for the
success of rahmonic amplitude(s) in differentiating between patient and normal data
sets and for the high correlation with perceived breathiness ratings. In this sense the
high quefrency cepstral measures may comprise an extra dimension not captured in a
traditional HNR estimate. SRA has been shown to be a potentially useful indicator of
vocal pathology, reflecting additive noise levels and discriminating between a set of
13 patients with varying vocal pathologies and a group of 12 ‘normals’ with statistical
significance. CPP is shown to be relatively f0-independent, however the index ap-
pears to be compressed when compared against SRA. Band-passing the original
time-domain signals prior to extracting the cepstral coefficients results in less useful
Rahmonic Analysis of Signal Regularity in Synthesized and Human Voice 39
acoustic indices of voice quality. Future work will examine the benefit of normalisa-
tion in CPP and will explore, in a quantitative fashion, the relationship between, noise
and perturbation levels, and the measures CPP, first rahmonic amplitude and SRA. In
addition, aperiodicities due to waveshape change, asymmetry or nonlinearities in
vocal fold dynamics and non-stationarity of the vocal tract need to be examined.
Acknowledgements
Partial support for this work is provided through an Enterprise Ireland Research In-
novation Fund 2002/RIF/037 and a Health Research Board Grant 01/1995. The au-
thor wishes to thank Professor Michael Walsh and Dr. Michael Colreavy, ENT,
Beaumont Hospital, Dublin for examination of voice patients and Dr. Kevin McGui-
gan, Royal College of Surgeons in Ireland, Dublin and Yvonne Fitzmaurice, Antonio
Hussey and Jenny Robertson, SLT, Beaumont Hospital, Dublin for participating in
this work and/or performing perceptual ratings of the voice recordings.
References
1. de Krom, G.: A cepstrum based technique for determining a harmonics-to-noise ratio in
speech signals. J. Speech Hear Res. 36 (1993) 254-266
2. Qi, Y. and Hillman, R.E.: Temporal and spectral estimations of harmonics-to-noise ratio in
human voice signals. J. Acoust. Soc. Am. 102(1) (1997) 537-543
3. Hillenbrand, J., Cleveland, R.A. and Erickson, R.L.: Acoustic correlates of breathy vocal
quality. J. Speech and Hear. Res. 37 (1994) 769-777
4. Dejonckere, P. and Wieneke, G.H.: Spectral, cepstral and aperiodicity characteristics of
pathological voice before and after phonosurgical treatment. Clinical Linguistics and Pho-
netics 8(2) (1994) 161-169
5. Herman-Ackah, Y., Michael, D.D. and Goding, Jr., G.S.: The relationship between cep-
stral peak prominence and selected parameters of dysphonia. J. Voice. 16(1) (2002) 20-27
6. Murphy, P.J.: A cepstrum-based harmonics-to-noise ratio in voice signals. Proceedings In-
ternational Conference on Spoken Language Processing, Beijing, China (2000) 672-675
7. Yumoto, E., Gould, W. J., and Baer, T.: Harmonics-to-noise ratio as an index of the degree
of hoarseness. J. Acoust. Soc. Am. 71 (1982) 1544–1549
8. Kasuya, Y.: An adaptive comb filtering method as applied to acoustic analysis of patho-
logical voice. Proceedings IEEE ICASSP, Tokyo (1986) 669–672
9. Kasuya, H., and Ando, Y. Analysis, synthesis and perception of breathy voice. In: Gauffin,
J., Hammarberg, B. (eds.): Vocal Fold Physiology: Acoustic, Perceptual and Physiologic
Aspects of Voice Mechanisms, Singular Publishing Group, San Diego (1991) 251-258
10. Imaizumi, S.: Acoustic measurement of pathological voice qualities for medical purposes.
Proceedings IEEE ICASSP, Tokyo (1986) 677–680
11. Klatt, D., Klatt, L.: Analysis, synthesis, and perception of voice quality variations among
female and male talkers. J. Acoust. Soc. Amer., 87 (1991) 820-857
12. Qi, Y.: Time normalization in voice analysis. J. Acoust. Soc. Am. 92 (1992) 1569–1576.
13. Ladefoged, P., and Antonanzas-Barroso, N.: Computer measures of breathy voice quality.
UCLA Working Papers in Phonetics 61 (1985) 79–86
14. Kitajima, K.: Quantitative evaluation of the noise level in the pathologic voice, Folia Pho-
niatr. 3 (1981) 145-148
40 P.J. Murphy
15. Klingholtz, M., and Martin, F.: Quantitative spectral evaluation of shimmer and jitter, J.
Speech Hear. Res. 28 (1985) 169–174
16. Kasuya, H., Ogawa, S., Mashima, K., and Ebihara, S.: Normalized noise energy as an
acoustic measure to evaluate pathologic voice, J. Acoust. Soc. Am. 80 (1986) 1329–1334
17. Kasuya, H., and Endo, Y.: Acoustic analysis, conversion, and synthesis of the pathological
voice. In: Fujimura, O., Hirano, M. (eds.): Vocal Fold Physiology: Voice Quality Control,
Singular Publishing Group, San Diego, (1995) 305–320
18. Kojima, H., Gould, W. J., Lambiase, A., and Isshiki, N.: Computer analysis of hoarseness,
Acta Oto-Laryngol. 89 (1980) 547–554
19. Muta, H., Baer, T., Wagatsuma, K., Muraoka, T., and Fukuda, H.: A pitch synchronous
analysis of hoarseness in running speech. J. Acoust. Soc. Am. 84 (1988) 12292–1301
20. Hiraoka, N., Kitazoe, Y., Ueta, H., Tanaka, S., and Tanabe, M.: Harmonic intensity analy-
sis of normal and hoarse voices, J. Acoust. Soc. Am. 76 (1984) 1648–1651
21. Qi, Y., Weinberg, B., Bi, N., and Hess, W. J.: Minimizing the effect of period determina-
tion on the computation of amplitude perturbation in voice, J. Acoust. Soc. Am. 97 (1995)
2525–2532
22. Michaelis, D., Gramss, T., and Strube, H. W.: Glottal to noise excitation ratio-a new
measure for describing pathological voices, Acust. Acta Acust. 83 (1997) 700–706
23. Murphy, P.J.: Perturbation-free measurement of the harmonics-to-noise ratio in speech
signals using pitch-synchronous harmonic analysis, J. Acoust. Soc. Am. 105(5) (1999)
2866:2881
24. Manfredi, C., Iadanza, E., Dori, F. and Dubini, S.: Hoarse voice denoising for real-time
DSP implementation: continuous speech assessment. Models and analysis of vocal emis-
sions for biomedical applications:3rd International workshop, Firenze, Italy (2003)
25. Noll, AM.: Cepstrum pitch determination. J. Acoust. Soc. Am. 41 (1967) 293-309
26. Schafer, RW. and Rabiner, LR.: System for automatic formant analysis of voiced speech,
J. Acoust. Soc. Am. 47 (1970) 634-648
27. Murphy, P.J. and Akande, O.: Cepstrum-based harmonics-to-noise ratio measurement in
voiced speech. In: Chollet, G., Esposito, A., Faundez-Zanuy, M., Marinaro, M. (eds.):
Nonlinear Speech Modeling and Applications. Lecture notes in Artificial Intelligence, Vol.
3445. Springer Verlag, Berlin Heidelberg New York (2005) 119-218.
28. Murphy, P.J.: Averaged modified periodogram analysis of aperiodic voice signals. Pro-
ceedings Irish Signals and Systems Conference, Dublin (2000) 266-271
29. Koike, Y.: Cepstrum analysis of pathological voices. J. Phonetics 14 (1986) 501-507
30. Koike, Y. and Kohda, J.: The effect of vocal fold surgery on the speech cepstrum. In:
Gaufin, J., Hammarberg, B. (eds.): Vocal fold physiology: Acoustic, perceptual and physi-
ologic aspects of voice mechanisms. Singular, San Diego (1991) 259-264
31. Yegnanarayana, B., d’Alessandro, C. and Darsinos, V.: An iterative algorithm for decom-
position of speech signals into periodic and aperiodic components. IEEE Trans. Speech
and Audio Processing 6(1) (1998) 1-11.
32. Awan, S. and Roy, N.: Toward the development of an objective index of dysphonia sever-
ity: A four-factor acoustic model. Clinical Linguistics and Phonetics 20(1) (2005) 35-49
33. Rosenberg, A.: Effect of glottal pulse shape on the quality of natural vowels. J. Acoust.
Soc. Am. 49 (1971) 583-590
34. Rabiner, L. and Schafer, R.: Digital processing of speech signals. Prentice Hall, Engle-
wood Cliffs, NJ (1978)
35. Isshiki, N., Takeuchi, Y.: Factor analysis of hoarseness. Studia Phonologica 5 (1970) 37-44
36. Murphy, P.: On first rahmonic amplitude in the analysis of synthesized aperiodic voice
signals. J. Acoust. Soc. Am. 120(5) (2006) 2896-2907
Spectral Analysis of Speech Signals Using Chirp Group
Delay
Abstract. This study presents chirp group delay processing techniques for
spectral analysis of speech signals. It is known that group delay processing is
potentially very useful for spectral analysis of speech signals. However, it is
also well known that group delay processing is difficult due to large spikes that
mask the formant structure. In this chapter, we first discuss the sources of
spikes on group delay functions, namely the zeros closely located to the unit
circle. We then propose processing of chirp group delay functions, i.e. group
delay functions computed on a circle other than the unit circle in z-plane. Chirp
group delay functions can be guaranteed to be spike-free if zero locations can
be controlled. The technique we use here for that is to compute the zero-phased
version of the signal for which the zeros appear very close (or on) the unit
circle. The final representation obtained is named as the chirp group delay of
zero-phased version of a signal (CGDZP). We demonstrate use of CGDZP in
two applications: formant tracking and feature extraction for automatic speech
recognition (ASR). We show that high quality formant tracking can be
performed by simply picking peaks on CGDZP and CGDZP is potentially
useful for improving ASR performance.
Keywords: Phase processing, chirp group delay, group delay, zzt, ASR feature
extraction.
1 Introduction
Magnitude spectrum has been the preferred part of Fourier Transform (FT) spectrum
in most of the speech processing methods although it carries only part of the available
information. One of the main reasons for this is the difficulties involved in phase
processing. In speech processing research, phase processing is often cited as very
difficult. Especially when linear models are used in analysis of speech signals, the
phase components is to a great extend left remained in the residual, the component
left over from linear modeling.
By its nature, the phase spectrum is in a wrapped form and the negative first
derivative of its unwrapped version, the so-called group delay function is generally
preferred since it is easier to study and process. It has been shown by Yegnanarayana
and Murthy that group delay functions are potentially very useful for spectral analysis
since vocal tract resonance peaks appear with higher resolution than in the magnitude
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 41 – 57, 2007.
© Springer-Verlag Berlin Heidelberg 2007
42 B. Bozkurt, T. Dutoit, and L. Couvreur
spectrum [1], [2], [3]. However they also report that group delay functions suffer from
having many spikes and direct processing of group delay spectra is very difficult. In
various papers, they have presented ways to indirectly compute smoothed group delay
functions using cepstral transformations and smoothing.
This paper describes a similar approach but uses a new technique and a new
representation to perform spectral processing using group delay functions. Our
previous studies [4] showed that for most speech frames, a ‘cloud‘ of zeros/roots of
polynomials computed as the z-tranform of the speech samples is located around the
unit circle, resulting in many spikes in the group delay function. One way to
overcome this problem is to control/modify the location of the zeros/roots and also to
use an analysis circle other than the unit circle for group delay computation, i.e. to
compute chirp group delay (CGD) functions instead. We show in two applications
that chirp group delay processing is potentially very useful for spectral analysis of
speech signals.
In Section 2, we start by explaining the difficulties in group delay processing and
the methods proposed in literature to tackle those problems. In Section 3, we present
our alternative: chirp group delay of zero-phase version of signals. Section 4 and 5 are
dedicated to applications in formant tracking and feature estimation for automatic
speech recognition (ASR), respectively. In Section 6, we present our conclusions.
For a short-term discrete-time signal {x(n)}, n=0,1,…,N-1, the group delay function
(GDF) is defined as the negative first-order derivative of the discrete-time Fourier
transform phase function θ(ω),
dT (Z )
GDF(Z ) (1)
dZ
with θ(ω)= arg(X(ω)) and X(ω) being the Fourier transform of {x(n)}. Equation (1)
can also be expressed as:
where Y(ω) is defined as the Fourier transform of {nx(n)} [5]. The notations R and I
refer to real and imaginary parts, respectively. The advantage of this representation is
that explicit phase unwrapping operation, which is generally a problematic task, is not
necessary in order to compute the group delay function.
Group delay analysis is known to be difficult because many large spikes are often
present [6]. This is simply explained by noting that the term |X(ω)|2 in Equation (2)
can get very small at frequencies where there exist one or more zeros of the
z-transform of the signal x(n) very close to the unit circle [7].
Spectral Analysis of Speech Signals Using Chirp Group Delay 43
where Zm are the zeros/roots of the z-transform polynomial1 X(z). The Fourier
transform, which is simply the z-transform computed on the unit circle, can be
expressed as:
( )
N −1
∏ (e ω − Z
( − N +1)
X (ω ) = x(0) e jω j
m ) (4)
m =1
provided that x(0) is non-zero. Each factor in Equation (4) corresponds to a vector
starting at Zm and ending at ejω in the z-plane. In practice, the Fourier transform is
evaluated at frequencies regularly spaced along the unit circle, the so-called frequency
bins. As illustrated in Fig. 1, the vector ejω – Zm changes drastically its orientation
when going from a frequency bin to the next one around the root Zm. The change
increases as the root gets closer to the unit circle. Hence the group delay function, i.e.
the rate of change in the phase spectrum, is very high at frequency bins close to a
root/zero of the z-transform polynomial. It becomes ill-defined when the zero/root
coincides with a frequency bin. This results in spikes in the group delay function that
get larger as the roots come closer to the unit circle.
Fig. 1. Geometric interpretation for spikes in the group delay function of a signal at frequency
locations on the unit
Due to the strong link between the position of zeros of the z-transform (ZZT) and
spikes on the group delay functions, a study of the ZZT patterns obtained from the
z-transform of short-term discrete-time speech signals would be very valuable. Such a
study, however, would require the analytical computation of root locations of high
order polynomials, which is impossible. The problem becomes even more
complicated due to windowing in time domain, which corresponds to term-wise
coefficient multiplication of corresponding polynomials in z-domain. For this reason,
the location of roots can be derived analytically only for some elementary signals (i.e.
power series, damped sinusoidal, impulse train, etc). For more complex signals such
1
The formulation in Equation (3) is referred to as the ZZT (zeros of the z-transform)
representation [8]. It is a direct factorisation of the polynomial (i.e. root finding), not a low
degree moving average (MA) modelling for the signal.
44 B. Bozkurt, T. Dutoit, and L. Couvreur
as speech, the analytical solution becomes intractable and the roots should be
estimated numerically. It allows then to observe their location in the z-plane. At first
sight, it was almost impossible to get an intuition about where the zeros/roots could be
located on the z-plane for a given speech signal. However, some patterns can be
inferred for speech signals and have been presented in detail in [8]. Here we provide a
short review due to space limitations.
In Fig. 2, we present the source-filter model of voiced speech in various domains: the
first row is in the time domain, the second row is in the ZZT domain and the third row
is in the log-magnitude spectrum domain. In each domain, the different contributions
of the source-filter model of speech interact with each other through some operator:
convolution (*), union (U) and addition (+), respectively.
Consider the second row of Fig. 2. The ZZT pattern for the impulse train is such
that zeros are equally spaced on the unit circle with the exception that there exist gaps
at all harmonics of the fundamental frequency, which create the harmonic peaks on
the magnitude spectrum (third row of Fig. 2).
The ZZT representation of the differential glottal flow signal (LF model [9]),
which is shown in second column of Fig. 2, contains two groups of zeros: a group of
zeros below R=1 (inside the unit circle) and a group of zeros above R=1 (outside the
unit circle) in polar coordinates. The group of zeros inside the unit circle is due to the
return phase of the glottal flow excitation waveform and the group outside the unit
circle is due to the first phase of the LF signal.
The zeros of the vocal tract filter response are mainly located inside the unit circle
due to the decreasing exponential shape of this signal and there are gaps for the
formant locations, which create formant spectral peaks. We observe a wing-like shape
for the ZZT pattern of the vocal tract response depending on the location of the
truncation point for the time-domain response (demonstrated in movie 2 on our demo
website [10]).
It is interesting to note here that the ZZT set of speech is just the union of ZZT sets
of its three components. This is due to the fact that the convolution operation in time-
domain corresponds to multiplication of the z-transform polynomials in z-domain.
What is interesting is that the ZZT of each component appear at a different area on the
z-plane and have effect on the magnitude spectrum relative to their distance to the
unit circle. The closest zeros to the unit circle are the impulse train zeros and they
cause the spectral dips on the magnitude spectrum, which give rise to harmonic peaks.
Vocal tract zeros are the second closest set and the zero-gaps due to formants
contribute to the magnitude spectrum with formant peaks on the spectral envelope.
Differential glottal flow ZZT are further away from the unit circle and their
contribution to the magnitude spectrum is rather vague and distributed along the
frequency axis.
We have shown that most of the roots are located close to the unit circle (as if there
was a ‘zero-cloud’ around the unit circle) and that they can be further decomposed into
contributions of impulse train, glottal flow and vocal tract filter components in the
Spectral Analysis of Speech Signals Using Chirp Group Delay 45
Fig. 2. ZZT and source-filter model of speech. Note that magnitude spectra added are in dB.
46 B. Bozkurt, T. Dutoit, and L. Couvreur
Fig. 3. Effects of zeros on the Fourier Transform of a signal: (a) Hanning windowed real speech
frame (phoneme /a/ in the word “party”), (b) zeros in polar coordinates, (c) magnitude spectrum
and (d) group delay function. The zeros close to the unit circle are superimposed on spectrum
plots to show the link between the zeros and the dips/spikes on the plots.
Yegnanarayana and Murthy proposed various methods [1], [2], [3] to remove these
spikes and perform formant tracking from the spike-free group delay function. The
common steps involved in their methods are: obtaining the magnitude spectrum for a
2
It should be noted here that ZZT of real speech signals hardly look like the ZZT plot in Fig. 2.
This is mainly due to windowing, window synchronization problems and the mismatches of
the source-filter model with the actual speech signal. However, with appropriate windowing,
similar patterns are obtained for speech signals [8].
Spectral Analysis of Speech Signals Using Chirp Group Delay 47
short-time windowed speech frame, smoothing the magnitude spectrum via cepstral
liftering and computing smooth minimum-phase group delay from this representation
through cepstrum. Note that the resulting group delay function is a kind of smoothed
magnitude spectrum but the advantage of this representation compared to the
magnitude spectrum is that the formant peaks appear with better resolution [2]. The
authors propose formant tracking algorithms by picking peaks on this representation.
Other recent studies address the spike problem and propose group delay based
features: the modified group delay function (MODGDF) [7] and the product spectrum
(PS) [11]. These representations are obtained by modifying one term in the group
delay computation, which is supposed to be the main source of spikes in the group
delay function. These rather recent representations are further discussed in the
following sections.
⎛ τ p (ω ) ⎞
MODGDF (ω ) = ⎜
⎜ τ (ω )
(
⎟ τ (ω )
⎟ p
)
α
(5)
⎝ p ⎠
with
⎛ X R (ω )YR (ω ) + X I (ω )YI (ω ) ⎞
τ p (ω ) = ⎜⎜ ⎟⎟ (6)
⎝ S (ω ) 2γ ⎠
3
In all tests/plots of this study, we have set the parameters as in [17], namely α=0.4 and γ=0.9.
48 B. Bozkurt, T. Dutoit, and L. Couvreur
( )
N −1
~ ~
X (ω ) = X ( z ) z = ρe jω = ∑ x(n) ρe jω
−n ~
= X (ω ) e jθ (ω ) (8)
n=0
where ρ is the radius of the analysis circle. Similarly to Equation (1), the chirp group
delay function CGD(ω) is defined as:
~
d (θ (ω ))
CGD(ω ) = − (9)
dω
Interestingly enough, a fast Fourier transform (FFT) can be used to compute
~
X (ω ) by re-writing Equation (8) as:
( )( ) ( )
N −1 N −1
~
X (ω ) = ∑ x(n) ρ − n e jω =∑~
−n −n
x (n) e jω (10)
n =0 n=0
4
The first presentation of Chirp Group Delay can be found in [22]. Initially the term
“differential phase spectrum” has been used, then we switched to “chirp group delay”, which
is more accurate.
5
Our Matlab code for CGDZP computation is shared on [23].
Spectral Analysis of Speech Signals Using Chirp Group Delay 49
Fig. 4. A 30 ms speech frame and its group delay function. The frame example is extracted
from the noise-free utterance “mah_4625'' of the test set A of the AURORA-2 speech database
[13] and corresponds to vowel /i/ in word “six”.
Fig. 5. Power spectrum (PowerS) and group delay representations (Product Spectrum, Modified
Group Delay Function and Chirp Groupd Delay of Zero-Phased signal) for the speech signal
frame in Fig. 3
Fig. 6. Spectrogram plots of a noise-free utterance. Only the first half of the file, “mah_4625a”,
that contains the digit utterance “4 6” is presented.
In Fig. 6, we also present spectrogram-like plots obtained using the same group
delay representations as well as the classical power spectrum. The formant tracks can
be well observed on all of the spectrograms except for MODGDF, and PS is again
very close to PowerS (as in Fig. 1 of [11]). These observations suggest that the
CGDZP has some potential for formant tracking and ASR feature extraction. In the
following sections we present our tests in these two applications.
Automatic tracking of acoustic resonance frequencies of the vocal tract filter, the
formant frequencies, has been an important speech analysis problem for a long time.
Many proposed algorithms exist and they show large variety. The reader is referred to
[8] for a list of some of the approaches. Moderate-good quality formant tracking has
been achieved with most of the recently proposed technologies and further
improvement is necessary for increasing robustness of applications based on formant
tracking.
Spectral Analysis of Speech Signals Using Chirp Group Delay 51
SPEECH DATA
Formant frequencies
Fig. 7. Formant tracking algorithm using peak picking on chirp group delay computed outside
the unit circle
Formant tracking can simply be achieved by tracking the peaks of the CGDZP of
constant frame size/shift windowed speech frames. The flow chart of the proposed
method is presented in Fig. 7. Note that the radius ρ of the analysis circle is
continuously adapted to properly track all formants.
Fig. 9. Formant tracking example: male speech, 1SX9.wav file from [14] corresponding to text
“where were you while we were away”. Upper plot: CGDZP-based formant tracker. Lower
plot: Praat and WaveSurfer formant trackers.
The smoothness and resolution of peaks of the chirp group delay varies with the
radius of the analysis circle. Given a fixed number of formants to track, the ρ value is
iteratively optimized by decrementing/incrementing it with steps of 0.01: if the
number of peaks picked is higher than the number of formants to be tracked, ρ is
incremented and vice versa. For tracking of five formants on a 30 ms frame-size
analysis (at 16000Hz sampling frequency), the optimum value found for most of the
examples is ρ=1.12, so this value is taken as the optimum initial value for the iterative
procedure. In Fig. 8, we present the histogram of results for iteration of ρ on 2865
speech frames (1453 male, 1412 female). The initial value is set to ρ=1.12 and the
limits of the iteration are [1.05 1.25] where the number of formants is fixed to five.
This figure shows that ρ=1.12 is an appropriate value for tracking five formant peaks.
To reduce computational time of the algorithm one can remove the iteration block and
set ρ=1.12 for frame size of 30 ms. with 16000Hz sampling frequency.
4.2 Tests
is included in their procedures, which is not the case for the CGDZP based formant
tracker. The fact that high quality formant tracking can be performed on CGDZP with
a simple algorithm, as presented here, shows that CGDZP reveals clearly formant
peaks and can potentially be used in various other applications.
In this section, we investigate the possibilities of using group delay representation for
ASR feature extraction. The idea that comes naturally to mind is to replace the power
spectrum by any group delay function in an MFCC-like feature extraction algorithm.
Two recent studies have already considered this approach using the modified group
delay function [17] or the product spectrum [11], which are reviewed in Section 2.
Here, we compare these two representations as well as the standard power-based
MFCC with our proposed CGDZP. In Table 1, we list all the feature extractions that
are considered in the ASR experiments in next section.
The ASR system that is considered in this work is based on the STRUT toolkit [18]. It
relies on the hybrid Multi Layer Perceptron / Hidden Markov Model (MLP/HMM)
technology [19] where the phonemes of the language under consideration are modeled
by HMM's whose observation state probabilities are estimated as the outputs of a
MLP. Such an acoustic model is trained beforehand in a supervised fashion on a large
database of phonetically segmented speech material and is naturally dependent on the
nature of the extracted acoustic features. Therefore, an acoustic model was built for
every feature extraction of Table 1.
The AURORA-2 database [13] was used in this work. It consists of connected
English digit utterances sampled at 8kHz. More exactly, we used the clean training
set, which contains 8440 noise-free utterances spoken by 110 male and female
speakers, for building our acoustic models. These models were evaluated on the test
set A. It has 4004 different noise-free utterances spoken by 104 other speakers. It also
contains the same utterances corrupted by four types of real-world noises (subway,
babble, car, exhibition hall) at various signal-to-noise ratios (SNR) ranging from
20dB to -5dB. During the recognition experiments, the decoder was constrained by a
lexicon reduced to the English digits and no grammar was applied.
Table 2 gives the word error rates (WER) for the ASR system tested with the
feature extractions described in Table 1. Errors are counted in terms of word
substitutions, deletions and insertions, and error rates are averaged over all noise
types. In Table 3, the results are also provided when combining MFCC feature
extraction with the features used in Table 2. The combination is simply performed by
taking a weighted geometric average of the outputs probabilities of the combined
acoustic models:
where p12, p1 and p2 denote the combined probability and the probability provided by
the two combined acoustic models, respectively. The combination parameter λ takes
its value in the range (0,1) and is optimized for every combination.
The results show that CGDZP-CC outperforms the other representations. It should
be noted however that for computation of MODGDF-CC we used the same parameter
values as in [17], while the authors mention that the parameters need to be tuned for
the specific system and data. For this reason our comparison is limited in the sense
that only one set of parameters is used for comparison with MODGDF-CC.
One of our main targets in these experiments was to test whether a phase/group
delay representation carries complementary information to that of the power spectrum
in the framework of feature extraction for ASR systems. The values corresponding to
CGDZP-CC combined with MFCC, compared to the MFCC-only results on the first
row in Table 2 are in all cases lower except for the extreme noise setting SNR=-5dB.
This demonstrates that CGDZP-CC features extraction does indeed provide
complementary information to MFCC and is potentially useful for improving ASR
performance.
Noise robustness is definitively a sensitive issue and neither MFCC nor the group
delay representations are effectively robust to additive noise. The degradation of the
recognition performances in the presence of noise is primarily due to the mismatch
Table 1. Methods for ASR feature extraction based on power spectrum or group delay function.
All methods are applied on 30ms-long frames, shifted every 10 ms.
Table 2. ASR performances for various feature extractions on the AURORA-2 task. Results are
given in terms of word error rate (WER) in percent.
Table 3. ASR performances for features combined with MFCC on the AURORA-2 task.
Results are given in terms of word error rate (WER) in percent.
between the training conditions (clean speech) and the test conditions (noisy speech).
Several approaches can be adopted to reduce these acoustics discrepancies [20], [21].
First, the speech signal can be captured with as less noise as possible by using
spatially selective microphone or arrays of microphones. Further techniques can be
applied to enhance speech signals. These techniques concentrate on enhancing the
amplitude spectrum and can be hardly generalized to the phase spectrum. Other
techniques aim at adapting the acoustic models to noisy conditions and could clearly
be used for group delay representations.
6 Conclusion
In this chapter, we have presented the CGDZP representation and demonstrated its
successful utilization in two speech processing problems: formant tracking and
feature extraction for speech recognition.
The main motivation for processing chirp group delay (CGD) of speech frames
computed on circles other than the unit circle is to get rid of spikes created by zeros of
their z-transform (ZZT) close to the unit circle, which actually mask formant peaks on
classical group delay functions. By both manipulating the zeros/roots (ZZT) by zero-
phasing and adjusting the analysis circle radius for CGD computation, we can
guarantee a certain distance between zeros and the analysis circle. The choice of the
circle radius is definitively a sensitive issue and its value should be tuned for every
application. When the analysis circle is set too far away from the areas on z-plane
where zeros are located, the resolution of formant peaks become poorer (i.e. CGD
gets too smooth). As the analysis circle gets closer to those areas, the resolution of
56 B. Bozkurt, T. Dutoit, and L. Couvreur
formant peaks gets higher but also the risk of having spurious spikes. For this reason,
in our experiments with formant tracking and ASR feature estimation, we first
searched for some optimum value for the radius (by incrementing the radius by 0.1 in
the range [0.9,2] and checking the system performance) and found that ρ=1.12
appears to be a good choice.
We have shown that formant tracking can be performed succesfully by just picking
peaks in the CGDZP representation. In addition the speech recognition test results are
very promising and should be confirmed on more challenging speech recognition
tasks. Although the tests applied were limited, they are sufficient to demonstrate that
CGDZP is potentially a very useful function for spectral processing of speech signals.
References
1. Yegnanarayana, B., Duncan, G., Murthy, H. A.: Improving formant extraction from
speech using minimum-phase group delay spectra. Proc. of European Signal Processing
Conference (EUSIPCO). vol. 1, Sep. 5–8, Grenoble, France, (1988) 447–450
2. Murthy, H. A., Murthy, K. V., Yegnanarayana, B.: Formant extraction from phase using
weighted group delay function. Electronics Letters. vol. 25, no. 23, (1989) 1609–1611.
3. Murthy, H. A., Yegnanarayana, B.: Formant extraction from group delay function. Speech
Communication. vol. 10, no. 3, (1991) 209–221.
4. Bozkurt, B., Doval, B., D'Alessandro, C., Dutoit, T.: Appropriate windowing for group
delay analysis and roots of z-transform of speech signals. Proc. of European Signal
Processing Conference (EUSIPCO), Sep. 6–10, Vienna, Austria, (2004).
5. Oppenheim, A. V., Schafer, R. W., Buck, J. R.: Discrete-Time Signal Processing. Second
edition, Prentice-Hall, (1999).
6. Yegnanarayana, B., Saikia, D. K., Krishnan, T. R.: Significance of group delay functions
in signal reconstruction from spectral magnitude or phase. IEEE Trans. on Acoustics,
Speech and Signal Processing. vol. 32, no. 3, (1984) 610–623.
7. Hegde, R. M., Murthy H. A., Gadde, V. R.: The modified group delay feature: A new
spectral representation of speech. Proc. of International Conference on Spoken Language
Processing (ICSLP), Oct. 4–8, Jeju Island, Korea, (2004)
8. Bozkurt, B.: New spectral methods for analysis of source/filter characteristics of speech
signals. PhD Thesis, Faculté Polytechnique De Mons, Presses universitaires de Louvain,
ISBN: 2-87463-013-6, (2006).
9. Fant, G.: The LF-model revisited. Transformation and frequency domain analysis. Speech
Trans. Lab.Q.Rep., Royal Inst. of Tech. Stockholm, vol. 2-3, (1995) 121-156
10. Demo Page for Zeros of the Z-Transform (ZZT) Representation:
http://tcts.fpms.ac.be/demos/zzt.
11. Zhu, D. Paliwal, K. K.: Product of power spectrum and group delay function for speech
recognition. Proc. of International Conference on Acoustics, Speech and Signal Processing
(ICASSP), May 17–21, Montreal, Canada, (2004) 125–128
12. Rabiner, L. R., Schafer R. W., Rader, C. M.: The chirp z-transform algorithm and its
application. Bell System Tech. J. vol. 48, no. 5, (1969) 1249–1292
Spectral Analysis of Speech Signals Using Chirp Group Delay 57
13. Hirsch, H. G., Pearce, D.: The AURORA experimental framework for the performance
evaluation of speech recognition Systems under noisy conditions. Proc. of ASR 2000, Sep.
18–20, Paris, France, (2000)
14. http://www.ldc.upenn.edu/readme_files/timit.readme.html
15. http://www.praat.org
16. http://www.speech.kth.se/wavesurfer/
17. Hegde, R. M., Murthy H. A., Gadde, V. R.: Continuous speech recognition using joint
features derived from the modified group delay function and MFCC. Proc. of International
Conference on Spoken Language Processing (ICSLP), Oct. 4–8, Jeju Island, Korea, (2004)
18. J.-M. Boite, L. Couvreur, S. Dupont and C. Ris, Speech Training and Recognition Unified
Tool (STRUT), http://tcts.fpms.ac.be/asr/project/strut.
19. Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach,
Kluwer Academic Publisher, (1994)
20. Gong, Y.: Speech recognition in noisy environments: a survey. Speech Communication.
vol. 16, no. 3, (1995) 261–291
21. Junqua, J. C.: Robust Speech Processing in Embedded Systems and PC Applications,
Kluwer Academic Publishers,( 2000)
22. Bozkurt, B., Dutoit, T.: Mixed-phase speech modeling and formant estimation, using
differential phase spectrums. Proc. of ISCA ITRW VOQUAL. Aug, 21–24, (2003)
23. Introduction page for Chirp Group Delay processing:
http://tcts.fpms.ac.be/demos/zzt/cgd.html
24. Bozkurt, B., Doval, B., d’Alessandro, C., Dutoit, T.: Zeros of z-transform representation
with application to source-filter separation in speech. IEEE Signal Processing Letters.
vol. 12, no. 4, (2005) 344–347
25. Fant, G.: Acoustic Theory of Speech Production, Mouton and Co. Netherlands, (1960)
Towards Neurocomputational Speech and Sound
Processing
1 Introduction
Speech recognition and separation of mixed signals is an important problem with
many applications in the context of audio processing. It can be used to assist
a robot in segregating multiple speakers, to ease the automatic transcription of
video via the audio tracks, to separate musical instruments before automatic
This work has been funded by NSERC and Université de Sherbrooke. S. Loiselle has
been funded by FQRNT of Québec for the year 2006.
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 58–77, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Towards Neurocomputational Speech and Sound Processing 59
transcription, to clean the signal and performs speech recognition, etc. The ideal
instrumental set–up is based on the use of an array of microphones during record-
ing in order to obtain many audio channels. In fact, in the latter-mentioned
situation, very good separation can be obtained.
In many situations, only one channel is available to the audio engineer and
no training data are easily available. The automatic separation, segregation of
the sources and speech recognition are, then, much more difficult.
From the scientific literature, most of the proposed monophonic systems per-
form reasonably well on specific signals (generally voiced speech), but fail to
efficiently segregate and recognize a broad range of signals if there is no suffi-
cient training data to estimate the feature distributions.
In the present paper we present two speech processing systems based on an
auditory motivated approach. One system is a prototype speech recognizer that
uses the sequence of spikes, as a recognition feature, in a network of spiking
neurons. When compared with a Hidden Markov Models recognizer, preliminary
results show that the approach as a great potential for transients recognition and
situations where no sufficient data are available to train the speech recognizer.
The second system shows that separation and segregation of simultaneous audio
sources can be performed by using the temporal correlation paradigm to group
auditory objects coming from the same source.
We know that, for source separation and speech recognition, an efficient and
suitable signal representation has first to be found. Ideally the representation
should be adapted to the problem in hand and has not to be the same for both
systems (as the separation and recognition cues are different). We briefly review
some of the most important common features that should be preserved and
enhanced when analysing the speech signals.
Voiced Speech. With at least two interfering speakers and voiced speech, it
is observed that the separation is relatively easy – speakers with different fun-
damental frequencies – since spectral representations or auditory images exhibit
different regions with structures dominated by different pitch trajectories. Hence,
amplitude modulation of cochlear filter outputs (or modulation spectrograms)
are discriminative. In situations where speakers have similar pitches, the separa-
tion is difficult and features, like the phase, should be preserved by the analysis to
increase the discrimination. Most conventional source segregation or recognition
systems use an analysis that is effective, as long as speech segments under the
analysis window are relatively stationary and stable. Furthermore, these systems
need a sufficient amount of data to be trained.
with conventional speech analysis that are usually based on spectral and time-
averaged features. It is very likely that this kind of system would fail. Hence,
adaptive (or at least multi-features) and dynamic signal analysis is required.
So, Which Signal Representation? The question is still open, but the signal
representation should enhance speech structured features that ease discrimina-
tion and differentiation.
Based on the literature on auditory perception, it is possible to argue that the
auditory system generates a multi–dimensional spatio-temporal representation
of the one dimensional acoustic signal. One view of this can be found in the
remarkable work by Shamma and his team [1,2] with a recent work in source
separation [3].
In the first half of the paper (speech recognition application) we propose a
simple time sequence of 2D auditory images (cochlear channels – neuron thresh-
olds) or we use the time sequence of 3D Shamma’s multiscale analysis (tonotopic
frequency – temporal rate – modulation of the auditory spectrum). In the second
half of the paper we use 2D auditory image (AM modulation and spectrum of
cochlear envelops) sequences and we treat them as videos.
2 Perceptive Approach
From physiology we know that the auditory system extracts simultaneous features
from the underlying signal, giving birth to simultaneous multi-representation of
speech. We also learn that fast and slow efferences can selectively enhance speech
representations in relation to the auditory environment. This is in opposition with
most conventional speech processing systems that use a systematic analysis 1 that
is effective only when speech segments under the analysis window are relatively
stationary and stable.
Psychology observes and attempts to explain the auditory sensations by pro-
posing models of hearing. The interaction between sounds and their perception
1
A systematic analysis extracts the same features independently of the signal context.
Frame by frame extraction of Mel Frequency Cepstrum Coefficients (MFCC) is an
example of a systematic analysis.
Towards Neurocomputational Speech and Sound Processing 61
In the cochlea inner and outer hair cells establish synapses with efferent and affer-
ent fibres. The efferent projections to the inner hair cells synapse on the afferent
connection, suggesting a modulation of the afferent information by the efferent
system. On the contrary, other efferent fibres project directly to the outer hair
cells, suggesting a direct control of the outer hair cells by the efferences. It has also
been observed that all afferent fibres (inner and outer hair cells) project directly
into the cochlear nucleus. It has a layered structure that preserves frequency tono-
topic organisation where one finds very different neurons that response to various
features 2 .
It is clear from physiology that multiple and simultaneous representations of
the same input signal are observed in the cochlear nucleus [7] [8]. In the remaining
parts of the paper, we call these representations, auditory images.
VanRullen et al. [9] discuss the importance of the relative timing in neuronal
responses and have shown that the coding of the Rank Order of spikes can
explain the fast responses that are observed in the human somatosensory system.
Along the same line, Nätschlager and Maass [10] have technically shown that
information about the result of the computation is already present in the current
neural network state long before the complete spatio-temporal input patterns
have been received by the neural network. This suggests that neural networks
use the temporal order of the first spikes yielding ultra-rapid computation in
accordance with the physiology [11,9]. As an example, we can cite the work by
DeWeese et al. [12], where the authors observe in the auditory cortex of the
rat that transient responses in auditory cortex can be described as a binary
process, rather than as a highly variable Poisson process. Once again, these
results suggest that the spike timing is crucial. As the Rank Order Coding is
one of the potential neural codes that respects these spike timing constraints, we
explore here a possible way of integration in the context of speech recognition.
On the other hand, we explore, in the context of source separation, the use
of the Temporal Correlation to compute dynamical spatio–temporal correlation
between features obtained with a bank of cochlear filters. This time the coding is
made through the synchronization of neurons. Neurons that fire simultaneously
will characterize the same sound source. The implementation we made is based
on the Computational Auditory Scene Analysis that derives from the work of
Al. Bregman [13]. In our work we were interested in pushing ahead the study of
2
Onset, chopper, primary–like, etc.
62 J. Rouat, S. Loiselle, and R. Pichevar
Rank Order Coding has been proposed by Simon Thorpe and his team from
CERCO, Toulouse to explain the impressive performance of our visual sys-
tem [14,15]. The information is distributed through a large population of neurons
and is represented by spikes relative timing in a single wave of action potentials.
The quantity of information that can be transmitted by this type of code in-
creases with the number of neurons in the population. For a relatively large
number of neurons, the code transmission power can satisfy the needs of any
visual task [14]. There are advantages in using the relative order and not the
exact spike latency: the strategy is easier to implement, the system is less subject
to changes in intensity of the stimulus and the information is available as soon
as the first spike is generated.
1.5
0.5
Potentiel
ï0.5
ï1
0 50 100 150 200 250
Temps
Fig. 1. Example of time evolution of the potential from the leaky IF neuron model
with a constant input
The Leaky Integrate and Fire Neuronal Model. Our simple Integrate-and-
Fire neuron model has four parameters: adaptive threshold, leaky current, resting
potential and reset potential. At the beginning of the simulation, the internal
potential is at the resting potential. Throughout the simulation, the neuron
integrates the input to determine the neuron’s internal potential evolution. Once
the internal potential is sufficient (it reaches the threshold), a spike is generated.
The internal potential is then set at the reset potential (which is below the resting
potential) and the threshold is increased. The threshold adaptation and the reset
potential are used to avoid generating multiple spikes in a short time window.
However, with time the threshold will slowly decrease to its original value as long
as the internal potential is smaller. Furthermore, without excitation the internal
potential also slowly decreases to its resting value. A similar evolution applies
when the internal potential is below the resting value where it will slowly increase
with time to reach the resting potential. Thus, a neuron which is continuously
64 J. Rouat, S. Loiselle, and R. Pichevar
excited will have difficulty in generating new spikes. On the other hand, a neuron
that is not excited or that has a low excitation level will be more sensitive to an
increase in its excitation.
10
15
20
45 50 55 60
(b)
(a)
Fig. 2. Illustration of spike sequence generation on a French digit ’un’ [Ẽ] with the
cochlear channel analysis and threshold neurons. (a) The signal is first filtered through
a cochlear gammatone filter-bank. It is then rectified and compressed with a square-
root law. For each channel, three threshold neurons are used with different threshold
values. If the amplitude in the channel reaches one of the neuron’s threshold, a spike
is generated. After firing, a neuron becomes inactive for the remaining time of the
stimulus. (b) White stars represent spikes. The x-axis is the time samples (sampling
frequency of 16 kHz) and the y-axis shows the filter-bank channels. Center frequencies
of channels 1 and 20 are respectively equal to 8000 and 100 Hz.
.
3.6 Learning and Recognition Modules
The same training and recognition procedure has been used with the simple
thresholding and leaky integrate and fire neuronal models. During training a
template is generated for each reference word. It is a sequence of the first most
likely N cochlear ”channel/threshold” numbers. For the French digit ”UN” ([˜ ]),
Towards Neurocomputational Speech and Sound Processing 65
Table 1. First four columns: Sequence with the spike order of the first 20 cochlear
channels/threshold numbers to produce a spike; Last columns: Generated weights ki –
illustration for the cochlear channels analysis module with threshold neurons. Channels
1 to 5 and channels greater than 19 are not taken into account (threshold neurons firing
too lately).
ki = (N − i) + 1, (1)
the highest weight. We keep going downwards until we reach the N th. ”chan-
nel/threshold”. Since there are more than one neuron per channel, more than
one weight can be given to a single channel (see table 1). The procedure is
independent of the neuronal model.
During recognition, for each isolated word test Tj , a sequence of ”chan-
nel/thresholds” Sj (n) (which contains the ”channel/thresholds” numbers that
generated each corresponding spike) is generated. From this sequence, we keep
the first N ”channel/thresholds” to be compared with each existing templates.
During comparison with a given template, a scalar product is computed between
the weight vector K of that template and the weighted rank order vector of the
test Tj . That weighted rank order vector ROj is obtained from the sequence
Sj (n) :
[rank of the ”channel/threshold” number 1 from Sj (n)]−1
I
..
.
ROj = I [rank of the ”channel/threshold” number i from Sj (n)]−1
..
.
I [rank of the ”channel/threshold” number N from Sj (n)]−1
where I stands for a constant inhibition factor (lower than or equal to one).
This gives a similarity measure between the test and the reference template
that depends on the ”channel/threshold” rank in the sequence (Equ. 2) [18].
N
Similarity = ki I (rank of spike i−1) (2)
i=1
Finally, the template with the highest similarity with the test Tj is selected
from the dictionary. The maximum similarity, with an inhibition factor of 0.9,
would be computed as follows : 20 × 0.90 + 19 × 0.91 + 18 × 0.92 + ... + 2 × 0.918 +
1 × 0.919 . It corresponds to the situation where the test is equal to the reference.
3.7 Experiments
Speech Database. We performed a proof-of-concept test of speech recognition
using in house speech databases made of 10 French digits spoken by 5 men and
4 women and of the French vowels spoken by the same 5 men and 5 women
(including the 4 women from the digits database). Each speaker pronounced ten
times the same digits (from 0 to 9) and vowels. The speakers were presented
with random sequences of digits and vowels to be read. The speech was recorded
at 16 kHz using a microphone headset.
Training and Recognition. For each digit (or vowel), two reference models are
used for the recognizer (one pronunciation for each sex). For each digit (or vowel),
for each sex and for each pronunciation inside the same sex group, a preliminary
comparison between the same digits (or vowels) is performed using the very
Towards Neurocomputational Speech and Sound Processing 67
Recognition of vowels. The 5 French vowels [a@ioy] recognition has been made
with three recognizers. The conventional reference HMM, one ROC prototype
with the cochlear analysis and the threshold neuron model and another ROC
prototype with the complete auditory model with LIAF neurons. The average
recognition rates are reported on table 2. The HMM and the Cochlear–Treshold
recognizers have comparable performance while the complete model (Complete-
LIAF) is better. The Cochlear–Treshold recognizer does relatively well (recall
that it uses only one spike per channel/threshold neuron) even if it uses only the
first signal frames. On the opposite, the HMM and the Complete-LIAF systems
3
For example, the model of digit 1 for the male speakers is obtained as:
For j = 1 to 50 do (each digit 1 pronounced by a male – 50 pronunciations):
1. Compute the similarity with the other digits (49 similarities)
2. Compute the average
The pronunciation with the highest average similarity will be the model for digit
”un” [Ẽ] pronounced by the male speakers.
4
The same reference pronunciation has been used for each digit (or vowel).
5
Approximately 130 parameters for each Markovian Model: Mean and Variance of a
twelve dimensional vector for each state, 5 states and transitions between states.
68 J. Rouat, S. Loiselle, and R. Pichevar
Table 2. Averaged recognition rates on the five French vowels [a@ioy] for the HMM
speech recognizer, the Cochlear–Treshold and Complete-LIAF speech recognizers
use the full vowel signal. However, since the ROC is used with the Complete-
LIAF systems, the first part of the vowel remains the most important for this
prototype.
French digit recognition. In this paragraph we further push the evaluation of the
simple Cochlear–Treshold model on the 10 French digits. We respectively report
in tables 3 and 4 the results for the very simple system (Cochlear–Treshold) and
those from the HMM recognizer. Each row model gives the number of recognized
pronunciations; each line is the pronounced digit to be recognized.
With the Cochlear–Threshold system, the best score is obtained for digit
”un” [Ẽ] (usually short and difficult to be recognized with conventional speech
recognizers). On the other hand, digits beginning with similar fricatives (like
”cinq” [sẼk], ”six” [sis] and ”sept” [sEt]) or plosives (”trois” [tRwa ] and ”quatre”
[katR]) are often confused. Since the neurons in this prototype fire only once,
the emphasis is made on the first frames of the signal. Therefore, it is coherent
that the Cochlear–Threshold system bases the recognition on the consonants of
the digits (except digit 1 that begins with the vowel).
The best score for the MFCC-HMM (table 4) is obtained with digit 5 (rela-
tively long digit) and the worst with 1 (the best with our prototype) and 8 [4it].
Table 3. Confusion table and recognition for each pronunciation of the ten French dig-
its – Cochlear filter analysis combined with the one time threshold neuron (Cochlear–
Threshold system)
Models %
Digits 1 2 3 4 5 6 7 8 9 0 65
1 (”un”) 84 1 4 1 93
2 (”deux”) 69 2 1 3 13 2 76
3 (”trois”) 10 58 18 1 1 2 64
4 (”quatre”) 22 68 75
5 (”cinq”) 11 1 2 42 21 6 7 46
6 (”six”) 1 68 13 2 1 5 75
7 (”sept”) 1 1 2 11 49 12 1 1 12 13
8 (”huit”) 1 68 16 5 75
9 (”neuf”) 4 9 54 23 60
0 (”zéro”) 15 2 3 3 9 61 67
Towards Neurocomputational Speech and Sound Processing 69
Table 4. Confusion table and recognition for each pronunciation of the ten French
digits. MFCC analysis with HMM s.
Models %
Digits 1 2 3 4 5 6 7 8 9 0 52
1 (”un”) 15 1 20 28 16 16
2 (”deux”) 55 11 1 9 14 61
3 (”trois”) 42 48 46
4 (”quatre”) 33 32 10 15 36
5 (”cinq”) 2 81 1 1 5 90
6 (”six”) 13 72 1 4 80
7 (”sept”) 40 8 30 12 33
8 (”huit”) 31 5 8 5 38 3 5
9 (”neuf”) 4 10 19 3 44 10 49
0 (”zéro”) 90 100
It is clear that digit model 8 is not correctly trained and that the HMM speech
recognizer did not have enough data to correctly estimate the HMM parameters
(approximately 130 parameters for each Markovian reference model).
3.9 Discussion
Both ROC based speech recognizer prototypes do surprisingly well when com-
pared with a state of the art HMM recognizer.
The Complete–LIAF system uses a shorter test signal (the spike sequence
length being limited to N spikes) than the HMM system but without the sta-
tionary assumption of the MFCC analysis and yields better results on the vowels
when the training set is limited (only 1 reference speaker for each sex). The simple
Cochlear–Threshold system has comparable results than the HMM on stationary
signals (vowels) and much better results on the consonants of the digits. Clearly,
one reference speaker per sex with only one occurrence is not sufficient to train
the HMM. Furthermore, the transient cues from the consonants are spread out
with the stationary MFCC analysis.
Apart from the fact that the Rank Order Coding scheme could be a viable
approach to the recognition it is important to notice that with only one spike
per neuronal model (reported results with the Cochlear–Threshold system) the
recognition is still promising. It is interesting to link our preliminary results with
70 J. Rouat, S. Loiselle, and R. Pichevar
the arguments of Thorpe and colleagues [11] [18]. They argue that first spike
relative latencies provide a fast and efficient code of sensory stimuli and that
natural vision scene reconstructions can be obtained with very short durations.
If such a coding is occuring in the auditory system, the study conducted here
could be a good start point in the design of a speech recognizer.
The statistical HMM speech recognizer and our ROC prototypes are com-
plementary. The ROC prototypes seem to be robust to transient and unvoiced
consonants recognition (which is known to be difficult for statistical based speech
recognizers) while the HMM based systems are know to be very robust for sta-
tionary voiced speech segments. Also, an important aspect to consider is that
the HMM technology dates from the middle of the seventies and plenty of good
training algorithms are available which is not yet the case for the ROC. For now,
a mixed speech recognizer, that relies i) on a Perceptive & ROC approach for
the transients and ii) on the MFCC & HMM approach for the voiced segments
could be viable.
While the HMM recognizer performance can be improved by increasing the
training set, the performance of our prototypes could benefit from various pre-
or post-processing schemes that are currently under investigation.
In this second half of the paper we present a work that explores the feasibility of
sound source separation without any priori knowledge on the interfering sources.
In other words we begin to pave the road to answer to the question: ”How far can
we go in the separation of sound sources without integrating any prior knowledge
on the sources?”.
[(S1, F1), (S2, F2)] or [(S1, F2), (S2, F1)]? Three solutions to this problem are
proposed in the literature:
– The most straightforward solution is the hierarchical coding of the infor-
mation. One neuron is triggered when the stimulus (S1, F1) is present and
another one turned on when the input (S1, F2) is applied and so on, for all
possible combinations [20]. The problem with this approach is an exponen-
tial increase in the number of neurons with an increase in the number of
classes and a lack of autonomy for new (not previously seen) classes.
– Another solution is the use of attentional models [21]. In this method, atten-
tion is focused on one of the elements in the stimulus, ignoring the others.
When the classification of this element is completed, it is dismissed and other
elements in the input are analyzed.
– The third solution is the temporal correlation. We use that approach as
suggested by Milner [22] and Malsburg [23,19] who observed that synchrony
is a crucial feature to bind neurons associated to similar characteristics.
Objects belonging to the same entity are bound together in time. In this
framework, synchronization between different impulse (or spiking) neurons
and desynchronization among different regions perform the binding.
Sound Mixture
Analysis filterbank
Envelope FFT
and
FFT
Frequency
CSM
Frequency
CAM Channels
Switch
Channels
Channels
y
nc
ue
eq
Fr
Channel binding
spiking network
Synthesis filterbank
Generated masks
Separated signals
Fig. 3. Source Separation System. Two auditory images are simultaneously generated
– a Cochleotopic/AMtopic map (CAM) and a Cochleotopic/Spectrotopic map (CSM)
– in two different paths. Based on the neural synchrony of the lower layer neurons,
a binary mask is generated to mute – in time and across channels – the synthesis
filterbank’s channels that do not belong to the desired source. The first layer of the
neural layer segregates the auditory objects on the CAM/CSM maps and the second
layer binds the channels that are dominated by the same stream. For now, the user
decides which representation (CAM or CSM) is being presented to the neural network.
Towards Neurocomputational Speech and Sound Processing 73
0.9
0.8
0.7
0.6
Frequency
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000
Time
Fig. 4. The spectrogram of the mixture of an utterance, a tone, and a telephone ring
0.9
0.8
0.7
0.6
Frequency
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000
Time
Fig. 5. The spectrogram of the extracted utterance. The telephone and the tone have
been suppressed, while the voice has been preserved.
4.4 Discussion
We have presented a system that comprises a perceptive analysis to extract
multiple and simultaneous features to be processed by an unsupervised neural
Towards Neurocomputational Speech and Sound Processing 75
network. There is no need to tune-up the neural network when changing the
nature of the signal. Furthermore, there is no training or recognition phase.
Even with a crude approximation such as binary masking, non overlapping and
independent time window 8 , we obtain relatively good synthesis intelligibility.
5 Conclusion
Starting in the mid ’80s, auditory models were already proposed and tested
on corrupted speech but with limited success because of pattern recognizers’
inabilities to exploit the rich time-structured information generated by these
models. Spiking neural networks open doors to new systems with a stronger
integration between analysis and recognition.
For now, the presented systems are relatively limited and have been tested
on limited corpus. Many improvements should be made before considering an
extensive use of these approaches in real situations.
Still, the experimental results led us to the conclusion that computational
neuroscience in combination with speech processing offers a strong potential for
unsupervised and dynamic speech processing systems.
Aknowledgments
Many thanks to Simon Thorpe and Daniel Pressnitzer for receiving S. Loiselle
during his 2003 summer session in CERCO, Toulouse. The authors would also
like to thank DeLiang Wang and Guoning Hu for fruitful discussions on oscilla-
tory neurons, Christian Feldbauer, Gernot Kubin for discussions on filterbanks
and software exchanges.
References
1. Shibab Shamma. Physiological foundations of temporal integration in the percep-
tion of speech. Journal of Phonetics, 31:495–501, 2003.
2. Dmitry N. Zotkin, Taishih Chi, Shihab A. Shamma, and Ramani Duraiswami. Neu-
romimetic sound representation for percept detection and manipulation. EURASIP
Journal on Applied Signal Processing, Special Issue on Anthropomorphic Process-
ing of Audio and Speech:1350–1364, June 2005.
3. Mounya Elhilali and Shihab Shamma. A biologically-inspired approach to the
cocktail party problem. In ICASSP, volume V, pages 637–640, 2006.
4. Special Issue. Speech annotation and corpus tools. Speech Communication Journal,
33(1–2), Jan 2001.
5. http://www.elda.fr, 2004.
6. B. L. Karlsen, G. J. Brown, M. Cooke, M. Crawford, P. Green, and S. Renals.
Analysis of a Multi-Simultaneous-Speaker Corpus. L. Erlbaum, 1998.
8
Binary masks create artifacts by placing zeros in the spectrum where the interfering
source was. The absence of energy at these locations could be heard (hearing of
absent signal or musical noise).
76 J. Rouat, S. Loiselle, and R. Pichevar
30. Jean Rouat and Ramin Pichevar. Source separation with one ear: Proposition for
an anthropomorphic approach. EURASIP Journal on Applied Signal Processing,
Special Issue on Anthropomorphic Processing of Audio and Speech:1365–1373,
June 2005.
31. Ramin Pichevar and Jean Rouat. A Quantitative Evaluation of a Bio-Inspired
Sound Source Segregation Technique Based for two and three Source Mixtures. In
G. Chollet, A. Esposito, M. Faundez-Zanuy, and M. Marinaro, editors, Advances
in Nonlinear Speech Modeling and Applications, volume 3445 of Lectures Notes in
Computer Science, pages 430–434. Springer Verlag, 2005.
32. Ramin Pichevar. http://www-edu.gel.usherbrooke.ca/pichevar/Demos.htm
33. J. Rouat. http://www.gel.usherb.ca/rouat
Extraction of Speech-Relevant Information from
Modulation Spectrograms
1 Introduction
One of the most technically challenging issues in speech recognition is to handle
additive (background) noises and convolutive (e.g. due to microphone and data
acquisition line) noises, in a changing acoustic environment. The performance
of most speech recognition systems, degrades when these two types of noises
corrupt speech signal simultaneously. General methods for signal separation or
enhancement, require multiple sensors. For a monaural (one microphone) signal,
intrinsic properties of speech or interference must be considered [1].
The auditory system of humans and animals, can efficiently extract the behav-
iorally relevant information embedded in natural acoustic environments. Evolu-
tionary adaptation of the neural computations and representations, has probably
facilitated the detection of such signals with low SNR over natural, coherently
fluctuating background noises [2,3].
The neural representation of sound undergoes a sequence of substantial trans-
formations going up to the primary auditory cortex (A1) via the midbrain and
thalamus. The representation of the physical structure of simple sounds seems
to be ”degraded” [4,5]. However, certain features, such as spectral shape infor-
mation, are greatly enhanced [6]. Auditory cortex maintains a complex repre-
sentation of the sounds, which is sensitive to temporal [5,7] and spectral [8,9]
context over timescales of seconds and minutes [10]. Auditory neuroethologists
have discovered pulse-echo tuned neurons in the bat [11], song selective neu-
rons in songbirds [12], call selective neurons in primates [13]. It has been argued
[14] that the statistical analysis of natural sounds - vocalizations, in particular
- could reveal the neural basis of acoustical perception. Insights in the auditory
processing then, could be exploited in engineering applications for efficient sound
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 78–88, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Extraction of Speech-Relevant Information from Modulation Spectrograms 79
while modulation frequencies which distinguish speech from other sounds are
preserved (and estimated). A simple thresholding classifier, referred to as the
relevant response ratio, is proposed, measuring the similarity of sounds to the
compact modulation spectra.
The auditory model of Shamma et al [17] is briefly presented in the next
section. In Section 3 we describe the information theoretic principle, the sequen-
tial information bottleneck procedure applied to auditory features. Finally, we
present some preliminary evaluations of these ideas in section 4.
to I(t, y) is minimal, because its features are mostly irrelevant in this case.
Therefore, we don’t even have to estimate the responses at these locations of the
modulation spectrum. This implies an important reduction in computational
load, still keeping the maximally informative features with respect to the task of
speech-nonspeech discrimination. To find out the identity of the remaining two
clusters, we compute:
p(t, y) = p(c, y)p(t|c) (5)
c
p(t) = p(t, y) (6)
y
p(t, y)
p(y|t) = (7)
p(t)
The cluster that maximizes the likelihood p(y1 |t) contains all relevant features
for y1 ; the other for y2 . We denote, hence, the first cluster as t1 and the latter
as t2 . The typical pattern (3-dimensional distribution) of features relevant for y1
is given by p(c|t = t1 ), while for y2 is given by p(c|t = t2 ). According to Bayes
rule, these are defined as:
p(t = tj |c)p(c)
p(c|t = tj ) = , j = 1, 2 (8)
p(t = tj )
Figure 1 presents an example of the relevant modulation spectrum of each so-
und ensemble, speech and non-speech (music, animal sounds and various noises).
Extraction of Speech-Relevant Information from Modulation Spectrograms 83
(a)
(b)
Fig. 1. p(c|t = t1 ) for non-speech (a) and p(c|t = t2 ) for speech class (b). Cluster t1
holds 37.5% and t2 holds 24.7% of all responses. The remaining 37.8% are irrelevant.
84 M. Markaki, M. Wohlmayer, and Y. Stylianou
to the same increase in frequency band. The harmonic structure due to voiced
speech segments is mainly depicted at the higher spectral modulations (2-6 cy-
cles/octave). Scales lower than 2 cycles/octave represent the spectral envelope
or formants [29]. Temporal modulations in speech spectrograms are the spec-
tral components of the time trajectory of spectral envelope of speech. They are
dominated by the syllabic rate of speech, typically close to 4 Hz, whereas most
relevant temporal modulations are below 8 Hz in the figure. It can also be no-
ticed that the lower frequencies - between 330 and 1047 Hz - are more prominent
than higher ones, in accordance to the analysis in [30], due to the dominance of
voice pitch over these lower frequency bands [29].
Non-speech class consists of quite dissimilar sounds - natural and artificial ones.
Therefore, its modulation spectrum has quite ”flat” structure, rather reflecting
points in the modulation spectrum not occupied by speech: rates lower than 2 Hz in
combination with frequencies lower than 330 Hz and scales less than 1 cycle/octave;
frequency-scale distribution hasn’t any structure as in the case of speech.
Knowledge of such compact modulation patterns allows us to classify new
incoming sounds based on the similarity of their cortical-like representation (the
feature tensor Z) to the typical pattern p(c|t = t1 ) or p(c|t = t2 ). We assess
the similarity (or correlation) of Z to p(c|t = t1 ) or p(c|t = t2 ), by their inner
(tensor) product (a compact one dimensional feature). We propose the ratio of
both similarity measures, denoted as relevant response ratio:
< Ẑ, p(c|t = t2 ) >
R(Ẑ) = ≷λ (9)
< Ẑ, p(c|t = t1 ) >
together with a predefined threshold, λ, for an effective classification of sounds.
Large values of R give strong indications towards target y2 , small values toward y1 .
25 25
0 0
0 0.5 1 1.5 0 0.5 1 1.5
20dB SNR 10dB SNR
50 50
25 25
0 0
0 0.5 1 1.5 0 0.5 1 1.5
0dB SNR −10dB SNR
50 50
25 25
0 0
0 0.5 1 1.5 0 0.5 1 1.5
We calculate the relevant response ratio R for all training examples and noise
conditions. Figure 2 shows the histograms of R computed on speech and non-
speech examples. It is important to note that the histograms form two dis-
tinct clusters, with a small degree of overlap. For the purpose of classification, a
threshold has to be defined such that any sound whose corresponding relevant
response ratio R is above this treshold is classified as speech, otherwise as non-
speech. Obviously, this treshold is highly dependent on the SNR condition under
which the features are extracted. This is especially true for low SNR conditions
(0dB, -10dB).
2.5
1.5
0.5
−0.5
−1
−1.5
−2
0 5 10 15 20 25 30
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 5 10 15 20 25 30
(a)
flag
2
1.5
0.5
−0.5
−1
−1.5
0 5 10 15 20 25 30
reference
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 5 10 15 20 25 30
(b)
(500ms), such that within one such frame speech and nonspeech events might
be concatenated. From each of these frames, a feature tensor Z holding the
cortical responses is extracted. Figure 3 shows the indexing of the concatenated
speech/nonspeech segments using relevant response ratio with a threshold for
these two different SNR conditions.
4 Conclusions
Classical methods of dimensionality reduction seek the optimal projections to
represent the data in a low - dimensional space. Dimensions are discarded based
on the relative magnitude of the corresponding singular values, without testing
if these could be useful for classification. In contrast, an information theoretic
approach enables the selection of a reduced set of auditory features which are
maximally informative in respect to the target - speech or non-speech class in this
case. A simple thresholding classifier, built upon these reduced representations,
could exhibit good performance with a reduced computational load.
The method could be tailored to the recognition of other speech attributes,
such as speech or speaker recognition. We propose to use perceptual grouping
cues, i.e., sufficiently prominent differences along any auditory dimension, which
eventually segregate an audio signal from other sounds - even in cocktail party
settings [31]. A sound source - a speaker, e.g. - could be identified within a time
frame of some hundreds of ms, by the characteristic statistical structure of his
voice (estimated using IB method). The dynamic segregation of the same signal
could proceed using unsupervised clustering and Kalman prediction as in [32].
Hermansky has argued in [30] that Automatic Speech Recognition (ASR)
systems should take into account the fact that our auditory system processes
syllable-length segments of sounds (about 200 ms). Analogously, ASR recognizers
shouldn’t rely on short (tens of ms) segments for phoneme classification, since
phoneme-relevant information is asymmetrically spread in time, with most of
supporting information found between 20 and 80 ms beyond the current frame.
This is also reflected in the prominent rates in speech modulation spectrum [30].
References
1. BA Pearlmutter, H Asari, and AM Zador. Sparse representations for the cocktail-
party problem. unpublished, 2004.
2. H Barlow. Possible principles underlying the transformation of sensory messages,
pages 217–234. MIT, Cambridge, MA, 1961.
3. I Nelken, Y Rotman, and O Bar Yosef. Responses of auditory-cortex neurons to
structural features of natural sounds. Nature, 397:154–7, 1999.
4. PX Joris, CE Schreiner, and A Rees. Neural processing of amplitude-modulated
sounds. J Physiol, 5:257–273, 2004.
5. N Ulanovsky, L Las, and I Nelken. Processing of low-probability sounds by cortical
neurons. Nature Neurosci., 6:391–8, 2003.
6. L Las, E Stern, and I Nelken. Representation of tone in fluctuating maskers in the
ascending auditory system. J Neurosci, 25(6):1503–1513, 2005.
Extraction of Speech-Relevant Information from Modulation Spectrograms 87
1 Introduction
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 89–100, 2007.
c Springer-Verlag Berlin Heidelberg 2007
90 Y. Pantazis and Y. Stylianou
Fig. 1. Flow diagram for measuring the discontinuity of two successive speech units
Since the purpose of unit selection is to locate segments (units) that will make
the synthetic speech to sound natural, much effort has been devoted to finding
the relation between objective distance measures and perceptual impressions.
Searching for an objective distance measure that is able to predict perceptual
discontinuities or is able to measure the variations of allophones, a subjective
measure need to be obtained. For this purpose listeners are asked to decide for the
existence of discontinuity or to judge speech quality, which may be evaluated for
intelligibility, naturalness, voice pleasantness, liveness, friendliness, etc. Because
of the expected variability in the human responses Mean Opinion Score (MOS)
is usually used to determine the quality of the synthetic speech.
N
ci
e(ci , di ) = P [k](k − di )2
i=1 k=ci−1
where P [k] is the power spectrum, ci represents the bounds (bandwidth) and di
denotes the centers of the formants. N determines the number of centroid, which
also depends on the sampling frequency. For example, if the sampling frequency
is equal to 16kHz, four centroid are evaluated [13].
3.2.2 Bispectrum
Speech features obtained by linear prediction analysis as well as by Fourier analy-
sis are determined from the amplitude or power spectrum. Thus, the phase infor-
mation of the speech signal is neglected. However, phase information has been
proven to play an important role in speech naturalness and signal quality in
general. Furthermore, the higher order information is ignored since the power
spectrum is only determined by second order statistics. If speech were a Gaussian
On the Detection of Discontinuities in Concatenative Speech Synthesis 93
process, then the second order statistics would suffice for a complete description.
However, evidence appears to indicate that in general, speech is non-Gaussian. To
take into account phase information as well as higher order statistics bispectrum
as well as Wigner-Ville transform and modified Mellin transform were tested by
Chen et al. [8]. In this paper bispectrum (D7) was also tested, since it has been
shown [8] that it provides high correlation scores.
Bispectrum is defined as a 2-D Fourier transform of 2-lag autocorrelation
function.
∞
∞
S3x (f1 , f2 ) = C3x (k, l)e−j2πf1 k e−j2πf1 l (1)
k=−∞ l=−∞
where ∞
C3x (k, l) = x∗ [n]x[n + k]x[n + l] (2)
n=−∞
L(ni )
h[n] = Ak [n]ej2πkf0 (ni )(n−ni ) (3)
k=−L(ni )
where L(ni ) and f0 (ni ) denote the number of harmonics and the fundamental
frequency respectively, at n = ni , while
where ak (ni ) and bk (ni ) are assumed to be complex numbers which denote the
amplitude of the k th harmonic and the first derivative (slope) respectively.
b. AM&FM Decomposition
The second set of features is based on a technique which tries to decompose
speech signals into Amplitude Modulated (AM) and Frequency Modulated (FM)
components (D9). Teager [16], [17], in his work on nonlinear modeling of speech
production, has used the nonlinear operator known as Teager-Kaiser energy
operator:
Ψ {x[n]} = x2 [n] − x[n − 1]x[n + 1] (5)
on speech signal, x[n]. Based on this operator, Maragos et al. [20] have developed
the Discrete Energy Separation Algorithm(DESA) for separating an AM-FM
modulated signal into its components. An AM-FM modulated signal has the
form
x[n] = a[n]cos(Ω[n])
where Ω[n] is the instantaneous frequency and a[n] is the instantaneous ampli-
tude.
4 Distance Measures
After the computation of features at the concatenated segments, the closeness
of them should be somehow determined. As measures someone can use metrics,
similarity measures and discriminant functions. Here, the following distance mea-
sures were tested.
(a) l1 or absolute difference
(b) l2 or Euclidean distance
(c) Kullback-Leibler divergence
(d) Mahalanobis Distance
(e) Fisher’s linear discriminant
(f) Linear regression
Absolute and Euclidean distance are metrics that belong to the same family.
Their difference rely on the fact that Euclidean distance amplifies more the
difference of specific parameters of the feature vector than absolute distance.
Kullback-Leibler (KL) divergence as well as Mahalanobis distance come from
statistics. Mahalanobis distance is similar to Euclidean with each parameter of
the feature vector being divided by its variance. KL divergence is used to measure
the distance between two probability distributions.
A symmetric version of KL divergence was used to measure the distance be-
tween two spectral envelopes and is given by,
P (ω)
DKL (P, Q) = (P (ω) − Q(ω)) log dω (6)
Q(ω)
y = wT x (7)
and a corresponding set of N samples y1 ,...,yN that is divided into the subsets Y0
and Y1 . This is equivalent to form a hyperplane in d-space which is orthogonal
to w (Fig. 3).
Since Fisher’s linear discriminant projects feature vectors to a line, it can also
be viewed as an operator (FLD) which is defined by
d
F LD{x} = wi xi (8)
i=1
where wi are the elements of w. If xi are real positive numbers, this is a kind
of weighted version of l1 norm (weights can be negative numbers). According
to this method, features which are in different scale can now be combined or
compared.
96 Y. Pantazis and Y. Stylianou
In an other study of Vepa and King [25], linear dynamical model (Kalman
filter) on LSF trajectories has been used for the computation of join cost in unit
selection speech synthesis. The model, after training, could be used to measure
how well concatenated speech segments join together. The objective join cost
is based on the error between model prediction and actual observations. Linear
dynamical model was used also for smoothing the LSF coefficients reducing
the audible discontinuities. An advantage of this method is that the degree and
extent of the smoothing is controlled by the model parameters which are learned
from natural speech.
In this section we briefly present the database used for comparing all the pre-
viously reported methods and features as well as the listening experiment that
was conducted. A more detailed description can be found in Klabbers et al. [24].
It is worth to note that since the same database has already been used on the
same task, useful conclusions may be reached.
Five subjects with backgrounds in psycho-acoustics or phonetics participated
in the listening experiment. The material was composed of 1449 Ci V Cj stimuli,
which were constructed by concatenating diphones Ci V and V Cj excised from
nonsense words of the form C@CV C@ (where C =consonant, V =vowel∈ /a/,
/i/ and /u/ and @ = schwa). The recordings were made of a semiprofessional
female speaker. Speech signals have been sampled at 16kHz.
Preliminary tests showed that discontinuities and other effects in the sur-
rounding consonants would overshadow the effects in the vowel. Hence the sur-
rounding consonants were removed. In addition, the duration of the vowels was
normalized to 200 ms and the signal power of the second diphone was scaled to
equalize the level of both diphones in the boundary. The stimuli were randomized
and the subjects were instructed to ignore the vowel quality and focus on the
diphone transition. Listeners’ task was to make a binary decision about whether
the transition was smooth (0) or discontinuous (1). The experiment was divided
into six blocks, presented in three hourly sessions with a short break between
two blocks. A transition was marked as discontinuous when the majority of the
subjects (3 or more out of 5) perceived it as such.
7 Results
perceived as discontinuous by the listeners. Then the detection rate for that
measure, y, is computed as:
∞
PD (γ) = p(y|1) dy (9)
γ
which means that the false alarm rate was set to 5%.
In Table 1, detection rate of various measure distances are presented. For the
statistical methods such as Fisher linear discriminant and linear regression, the
training was done on the 80% of the database, while the testing was done on
the remaining 20% of the database. Note also that the evaluation is independent
of the phonemes of the database while most of previous studies were phoneme
specific. Phoneme specific approaches [7] [24] [8] provide better results compared
to phoneme independent approaches [11]. This is expected since in the former
case the search space is smaller compared to the space generated in the phoneme
independent analysis case. However, even for these phoneme specific approaches
the prediction score cannot be considered to be sufficiently high.
In the table, the feature sets are represented with numbers (D1, D2, ...),
while the letters (a, b, ...) following the feature set correspond to the distance.
For example, D3d means that LSF coefficients have been used along with the
Mahalanobis distance. It is obvious from the table that none speech represen-
tation passed 50% of detection rate. Spectrum evaluated from FFT (D1), from
LPC coefficients (D2) and Bispectrum (D7) gave small detection rate. LSFs and
MFCCs combined with Fisher’s linear discriminant performed well. Same conclu-
sion can be made for the nonlinear harmonic model and AM&FM decomposition
The latter gave the best detection rate 49.40%. Linear regression gave detection
rates close to Fisher’s linear discriminant as it was expected. These results show
clearly that a lot of works remains to be done despite the considerable effort of
many researchers on searching an optimal distance and feature representation.
From the above it is obvious that using a weighted distance the detection rates
are improved independently of the features. This is explained by the fact that
weights are trained from the same database. Moreover these data-driven weights
can boost some particulate parameters of the feature vector and eliminate some
others.
References
1. Robert E. Donovan. Trainable Speech Synthesis. PhD thesis, Cambridge University,
Engineering Department, 1996.
2. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis sys-
tem using large speech database. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, pages 373–376, 1996.
3. W. N. Campbell and A. Black. Prosody and the selection of source units for
concatenative synthesis. In R. Van Santen, R.Sproat, J.Hirschberg, and J.Olive,
editors, Progress in Speech Synthesis, pages 279–292. Springer Verlag, 1996.
4. M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal. The AT&T
Next-Gen TTS System. 137th meeting of the Acoustical Society of America, 1999.
http://www.research.att.com/projects/tts.
5. Thierry Dutoit. An Introduction to Text-to-Speech Synthesis. Kluwer Academic
Publishers, 1997.
6. T. R. Barnwell S. R. Quackenbush and M. A. Clements. Objective Measures of
Speech Quality. Prentice Hall, 1988.
7. J. Wouters and M. Macon. Perceptual evaluation of distance measures for concate-
native speech synthesis. International Conference on Spoken Language Processing
ICSLP 98, pages 2747–2750, 1998.
8. J.-D. Chen and N. Campbell. Objective distance measures for assessing concate-
native speech synthesis. EuroSpeech99, pages 611–614, 1999.
9. Jerome R. Bellegarda. A novel discontinuity metric for unit selection text-to-speech
synthesis. 5th ISCA Speech Synthesis Worksop, pages 133–138, 2004.
10. E. Klabbers and R. Veldhuis. On the reduction of concatenation artefacts in di-
phone synthesis. International Conference on Spoken Language Processing ICSLP
98, pages 1983–1986, 1998.
11. Y. Stylianou and A. Syrdal. Perceptual and objective detection of discontinuities
in concatenative speech synthesis. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, 2001.
100 Y. Pantazis and Y. Stylianou
12. Robert E. Donovan. A new distance measure for costing spectral discontinuities
in concatenative speech synthesis. The 4th ISCA Tutorial and Research Workshop
on Speech Synthesis, 2001.
13. J. Vepa S. King and P. Taylor. Objective distance measures for spectal disconti-
nuities in concatenative speech synthesis. ICSLP 2002, pages 2605–2608, 2002.
14. F. K. Soong and B. H. Juang. Line spectrum pairs and speech data compression.
ICCASP, pages 1.10.1–1.10.4, 1984.
15. L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall,
1993.
16. H. M. Teager. Some observations on oral air flow during phonation. IEEE Trans.
Acoust., Speech, Signal Processing, Oct 1980.
17. H. M. Teager and S. M. Teager. Evidence for nonlinear sound production mecha-
nism in the vocal tract. Speech Production and Speech Modelling, 55, Jul 1990.
18. Y. Pantazis Y. Stylainou and E. Klabbers. Discontinuity detection in concatenated
speech synthesis based on nonlinear analysis. InterSpeech2005, pages 2817–2820,
2005.
19. Yannis Stylianou. Harmonic plus Noise Models for Speech, combined with Statis-
tical Methods, for Speech and Speaker Modification. PhD thesis, Ecole Nationale
Supérieure des Télécommunications, 1996.
20. P. Maragos J. Kaiser and T. Quatieri. On separating amplitude from frequency
modulations using energy operators. Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, Mar 1992.
21. H. Kawai and M. Tsuzaki. Acoustic measures vs. phonetic measures as predictors
of audible discontinuity in concatenative synthesis. ICSLP, 2002.
22. A. K. Syrdal and A. D. Conkie. Data-driven perceptually based join cost. 5th
ISCA Speech Synthesis Workshop, pages 49–54, 2004.
23. J. Wouters and M. W. Macon. Unit fusion for concatenative speech synthesis.
ICSLP, Oct 2000.
24. E. Klabbers and R. Veldhuis. Reducing audible spectral discontinuities. IEEE
Transactions on Speech and Audio Processing, 9:39–51, Jan 2001.
25. J. Vepa and S. Taylor. Kalman-filter based join cost for unit selection speech
synthesis. Eurospeech, Sep 2003.
Voice Disguise and Automatic Detection:
Review and Perspectives
Abstract. This study focuses on the question of voice disguise and its detection.
Voice disguise is considered as a deliberate action of the speaker who wants to
falsify or to conceal his identity; the problem of voice alteration caused by
channel distortion is not presented in this work. A large range of options are
open to a speaker to change his voice and to trick a human ear or an automatic
system. A voice can be transformed by electronic scrambling or more simply by
exploiting intra-speaker variability: modification of pitch, modification of the
position of the articulators as lips or tongue which affect the formant
frequencies. The proposed work is divided in three parts: the first one is a
classification of the different options available for changing one’s voice, the
second one presents a review of the different techniques in the literature and the
third one describes the main indicators proposed in the literature to distinguish a
disguised voice from the original voice, and proposes some perspectives based
on disordered and emotional speech.
1 Introduction
In the field of disguised voices, different studies have been carried out on some
specific features and some specific kinds of disguise. Voice disguise is the purposeful
change of perceived age, gender, identity, personality of a person. It can be realized
mechanically by using some particular means to disturb the speech production
system, or electronically by changing the sound before it gets to the listener. Several
applications are concerned by voice disguise: forensic science, entertainment, speech
synthesis, speech coding. In the case of forensic science a study [26] reveals that a
voice disguise is used at a level of 52% in the case where the offender uses his voice
and supposes that he is recorded. The challenge is to find indicators to detect the type
of disguise and if possible the original voice. Research into voice disguise could also
provide some possibilities to personalize a voice in the development of modern
synthesisers. The question of voice disguise includes the voice transformation, the
voice conversion and the alteration of the voice by mechanic means. The principle of
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 101 – 117, 2007.
© Springer-Verlag Berlin Heidelberg 2007
102 P. Perrot, G. Aversano, and G. Chollet
disguise consists in modifying the voice of one person to sound differently (mechanic
alteration or transformation), or sound like another person (conversion). There are a
number of different features to be mapped like voice quality, timing characteristics
and pitch register. Voice modality can not be compared to digital fingerprints or DNA
in terms of robustness. This is the reason why it is interesting to study this modality in
order to understand invariable and variable features of the voice. So, the aim of this
paper is, after a classification of voice disguise, to present the different approaches in
the literature and last to indicate some directions of research to establish an automatic
detection.
The difficulty in classifying the voice disguise is to determine what a normal voice is.
Some people speak naturally with a creaky voice, while some others with a hoarse
voice. A disguise is applied when there is a deliberate will to transform one’s voice to
imitate someone or just to change the sound. Distinguish electronic and non
electronic alteration appears as a good method of classification.
Electronic Non-electronic
Voice conversion Vector quantization Imitation
LMR (Linear Multivariate
Regression)
GMM (Gaussian Mixture Model)
Indexation in a client memory
…
Voice Electronic device Mechanic alteration
transformation Voice changer software prosody alteration
In the first category (electronic), there are two kinds of disguise: conversion and
transformation. The first one consists in transforming a source speaker voice to sound
like a target speaker voice, and the second one in modifying electronically some
specific parameters like the frequency, the speaking rate and so on, in order to change
the sound of the voice. There are many free software which offer to modify the
register of his own voice [45].
In the second category (non-electronic), we consider as voice conversion the
imitation by a professional impersonator and as voice transformation the different
possibilities to change one or more parameters of the voice as described below. The
impostor can use a mechanic system like a pen in the mouth for instance, or simply
use a natural modification of his voice by adopting a foreign accent for example.
Some features that could change are presented in table 2 as proposed in [33].
Voice Disguise and Automatic Detection: Review and Perspectives 103
The techniques used for the both categories are detailed with references in 3.
unlimited number of new voices. This offers an easy way to disguise your voice in
voice chat, PC Phone, online gambling, voice mail, voice message and so on.
The best way to change the voice consists in modifying the prosody by using a
modification of the pitch or of the duration of the segments. There are some methods
able to modify the time scale or the frequency range independently. The change of the
time scale offers the possibility to alter the duration of the signal without changing the
frequency properties. The change of the frequency range is the contrary, that is to say
modifying the pitch of a sound without changing the duration, and keeping the
position of formants.
A technique which gives some interesting results is the TD PSOLA (Time Domain
Pitch Synchronous Overlapp and Add) [30] [12]. This method proposes a flexible
creation and modification of high quality speech. Using PSOLA, prosodic changes are
easily performed. The following figure describes this technique:
There are many techniques which have been developed in voice conversion because
of the extended field of applications: speech synthesis, interpreted telephony or very
low rate bit speech coding. Voice conversion is the process of transforming the
characteristics of speech uttered by a source speaker, such that a listener would
believe the speech was pronounced by a target speaker. Different techniques are
possible for a voice conversion:
Source target
Conversion function
DTW
Alignment
Fig. 2. Training step
The historical technique has been proposed by [1] who described the means to
elaborate a mapping codebook between a source and a target voice. The
determination of the codebook is based on vector quantization. The main problem
of this method is the question of discontinuities when moving from one codebook to
another over time.
106 P. Perrot, G. Aversano, and G. Chollet
Source
features features
Analysis F Synthesis
speech
Converted speech
Conversion fonction
then organized by vector quantization into 64 different classes. The training data is
thus automatically labelled; using symbols that correspond to the above classes. A set
of HMMs (Hidden Markov Models) is then trained on this data, providing a stochastic
model for each class. The result of the ALISP (Automatic Language Independent
Speech Processing) training is an inventory of client speech segments, divided into 64
classes according to a codebook of 64 symbols. The encode phase consists in
replacing the segment of the source voice, recognized by HMM, by their equivalent of
the target voice. The results (Fig.4) on an automatic speaker recognition system are
proposed below where we notice a significant degradation of the recognition
performance.
20
Miss probability (in %)
10
0.5
0.2
0.1
Imitation is a well-known case of voice conversion more often used in the field of
advertising. Zetterholm [38] and Mejvaldova [28] studied the different techniques
used by a professional imitator to impersonate some voices. The principle is based on
the impersonation of some specific characteristics of a target voice linked to prosody,
pitch register, voice quality, dialect or the speech style. It is impossible for an
impersonator to imitate all the voice register of the target speaker, but some specific
parameters are enough to disturb the recognition. E. Zetterholm has demonstrated that
the impersonator adapted his fundamental frequency and the position of his formants
to the target voice. In [6] a study presents the difficulties of an automatic recognition
system compared to a case of impersonation. A professional impersonator has
imitated two target speakers: Blomberg and al. show the impersonator’s adjustments,
of the formant positions.
108 P. Perrot, G. Aversano, and G. Chollet
This category of voice disguise includes the alteration of voice by using a mechanic
system like a pen in the mouth and an handkerchief over the mouth, or by changing
the prosody like dialect, accent or pitch register to get a low or high frequency voice
modification, in order to trick the identity perception. Modifications in the rate of
speech, the use of deliberate pause, the removal of extraneous syllables between
words, the removal of frequency vocal fry and the heightening of prosodic variation
(intonation and stress) are basically the more common voice transformation.
In this section we are interested in the deliberated changes of the voice prosody
that is to say the supra segmental characteristics of the signal. Several parameters are
able to influence the prosody of an individual such as:
- the physiological characteristics of the vocal tract (linked to the age and the
gender)
- the emotion
- the social origin (accent)
- region origin (dialect)
- etc …
In theory a subject could modify all of those parameters to disguise his voice by
changing the prosody. A classification of prosody is provided in [4], where three
components could be proposed:
Table 3.
Actually, the easiest prosody features to change a voice are the acoustic parameters
and the perception parameters. Some of them can be easily measured.
Voice Disguise and Automatic Detection: Review and Perspectives 109
second formant as compared to the same vowels produced by the same speaker using
modal phonation.
In an interesting work in [20], the authors analyse the effect of three kinds of
disguise (falsetto speech, lowered voice pitch and pinched nose) on the performance
of an automatic speaker recognition system. The experiment is limited to the
estimation of the performance degradation when the suspect is known as being the
speaker of the disguised speech. The results are depending on the reference
population. If it contains speech data which exhibits the same type of disguise the
influence is marginal on the performance. On the contrary, if the reference population
is assembled with normal speech only, the effects have important consequences on the
performance of the system for the three cases of disguise. Those different works
reveal the lack of a global study on voice disguise; global in the sense of the number
of disguises, the number of features studied, but also in the sense of technical
approaches used.
In order to complete those different works on voice disguise, we have decided to
examine the different parameters studied in the case of the pathological voice [17] and
the analysis of emotion [2] [15] [35] and to determine if we could apply these works
on our study. In addition, we propose to compare the classification of these features to
the classification of features used in automatic speaker or speech recognition, the
MFCC (Mel Frequency Ceptstral Coefficient) applied on voice disguise.
Normal voice
Feature Pattern
extraction classification Disguise voice n°l
Our aim is to compare two kinds of approaches: the first one is based on the
principle of an automatic speaker recognition system by classification of cepstral
coefficients: MFCC (Mel frequency Cepstral Coefficient); the second one regards the
classification of each kind of disguise based on the prosodic parameters. In this
second approach we plan to estimate some boundaries between the disguise (inter
disguise) but also to analyse the main components for each kind of disguise (intra
disguise).
The principle of this method is to build a Universal Background Model on the one
hand for all disguises in order to detect if a voice is disguised or not and on the other
hand for each kind of disguise in order to classify a speech sample in one of the
disguises studied. The first work consists in extracting the features. We choose in this
phase to use the MFCC. The mel frequency cepstral coefficient (MFCC) is one of the
most important features required among various kinds of speech applications. These
features represent the spectral contour of the signal and are captured from short time
frame of the speech signal. Then a Fourier transform is applied to calculate the
magnitude spectrum before quantifying it using a mel spaced filterbank. This last
operation transforms the spectrum of each frame in a succession of coefficients which
characterize the energy in each frequency bandpass of the mel range. A DCT is
applied on the logarithm of those coefficients.
In addition to the MFCC, we calculate the first and the second derivatives of the
MFCC in order to take into account dynamic information which introduces the
temporal structure of the speech signal like for instance the phenomenon of co-
articulation.
The second main step of this approach is the creation of a universal background
model by training for all disguise and for each kind of disguise. The using of the first
UBM (all disguise) is to discriminate a disguised voice from a normal voice. This
UBM will be built on a statistic modelling based on GMM (Gaussian mixture models)
112 P. Perrot, G. Aversano, and G. Chollet
on the entire disguised voice corpus. A Gaussian mixture density is a weigthed sum of
M component densities:
M
f ( x / disgMod ) = ∑ g j N ( x, μ j , Σ j )
j =1
where
x is a D dimensional vector resulting from the feature extraction
j = 1 to M are the component densities
gj are the mixture weigths
disgMod is the disguise model
The different parameters (gj, μ j , Σ j ) of the UBM will be estimated by the EM
(Expectation – Maximisation) algorithm. This algorithm is a general technique for
maximum likelihood estimation (MLE). The first step of the algorithm (E: estimation)
consist in estimating the likelihood on all the data from the last iteration and the
second step (M: maximisation) consists in maximizing the density function of the first
step.
The same principle is used to estimate the specific UBM of each kind of disguise.
The last step of this approach is the verification phase which consists in knowing if
an unknown recording comes from a normal voice of from a disguised voice and what
kind of disguise by calculation of a likelihood ratio as follows.
Hypo.
H2
The second objective consists in detecting the disguise by studying some specific
prosody parameters. Perceptually, it is easy to notice a difference between an original
voice and a voice transformed by such a technique, but the question is what the main
Voice Disguise and Automatic Detection: Review and Perspectives 113
parameters that have been changed are. We can expect that each kind of disguise will
have some specific parameters that will change.
We are using PRAAT [8] for a statistical study on
F0 features:
- min, max, mean, std,
– Jitter: this parameter measures the instability of frequency in each cycle:
1 1
∑ F 0 i +1 − F 0 i
n − 1 i =n −1
F 0 mean
Energy features:
idB = 20 log10(E(x)/Eo).
- mean: Eo
- energy proportion in 5 Frequency bandwidths (0 - 1kHz - 2kHz – 3kHz –
4kHz - 5KHz).
- The shimmer: this parameter measures the instability of amplitude in each
cycle.
1 1
∑ Ei +1 − Ei
n − 1 i = n−1
Eà
Rhythm features:
- Speaking rate based on the calculation of number of phoneme per second
- Voiced/unvoiced region
- Ration of speech to pause time
After this extraction phase, the aim is to organize and classify the different features
in order to find boundaries inter disguise.
Inter disguise
Contrary to the first approach, where we used the MFCC which have the property of
being uncorrelated, a first work in this approach consists in extracting the main
information in the global distribution of the coefficient before classification.
Different methods of data analysis and classification will be used in order to select the
main features which influence a disguise and reduce the dimension via a PCA
(Principal Component Analysis) for instance. The use of PCA allows the number of
variables in a multivariate data set to be reduced, whilst retaining as much as possible
114 P. Perrot, G. Aversano, and G. Chollet
of the variation present in the data set. This reduction is achieved by taking p
variables X1, X2,…, Xp and finding the combinations of these to produce principal
components (PCs) PC1, PC2,…, PCp, which are uncorrelated. These PCs are also
termed eigenvectors. PCs are ordered so that PC1 exhibits the greatest amount of the
variation, PC2 exhibits the second greatest amount of the variation, PC3 exhibits the
third greatest amount of the variation, and so on. The aim is to be able to describe the
original number of variables (X variables) as a smaller number of new variables
(PCs).
We will also apply a LDA (Linear Discriminant Analysis) in order to increase the
class distribution by optimization of the ratio between the intra class the inter class.
After this step of data description and organization, we will structure our database by
applying supervised classification methods based on training.
The classification step planned is based on GMM clustering as previously
explained. The principle is to build a GMM model for each kind of model based on a
training phase and to calculate a distance which will be the likelihood function
between a test vector and the model.
Intra disguise
From the same set of prosodic features, a study in order to evaluate the main
characteristics of each of disguise has been planned. The idea is to extract the more
significant components for each disguise and the influence of some specific features.
The algorithm used is a PCA.
And last, we investigated a measurement of the vocalic triangle [39][40][41]
moving between a normal voice and a disguise n°k voice. The study based on a
limited corpus of voice disguised provides interesting results as shown by the
following figure.
This vocalic triangle is based on a corpus of twenty people who pronouced the
different french vowels in five disguise voice (included normal voice) .
The aim of the PCA and the comparison of vocalic triangle analysis are to find
some specific clues in each kind of disguise.
Voice Disguise and Automatic Detection: Review and Perspectives 115
References
1. M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, “Voice conversion through vector
quantization,” Proc. ICASSP 88, New-York, 1988
2. N. Amir, “Classifying emotions in speech: a comparison of methods” in Proceedings
EUROSPEECH 2001, Scandinavia
3. G. Baudoin, J. Cernocky, F. El Chami, M. Charbit, G. Chollet, D. Petrovska-Delacretaz.
“Advances in Very Low Bit Rate Speech Coding using Recognition and Synthesis
Techniques,” Proc. Of the 5th Text, Speech and Dialog workshop, TSD 2002, Brno, Czech
Republic, pp. 269-276, 2002
4. F. Beaugendre, “Modèle de l’intonation pour la synthèse”.1995 de la parole", in
"Fondements et perspectives en traitement automatique de la parole", Aupelf-Uref (ed.).
5. F. Bimbot, G. Chollet, P. Deleglise, C. Montacié, “Temporal Decomposition and
Acoustic-phonetic Decoding of Speech,” Proc. ICASSP 88, New-York, pp. 445-448, 1988
6. M. Blomberg, Daniel Elenius, E. Zetterholm, “Speaker verification scores and acoustics
analysis of a professional impersonator,” Proc. FONETIK 2004
7. R. Blouet, C. Mokbel, G. Chollet, “BECARS: a free software for speaker recognition,”
ODYSSEY 2004, Toledo, 2004
8. P. Boersma, D. Weenink, “PRAAT: doing phonetics by computer. http://www.praat.org”
9. O. Cappe, Y. Stylianou, E Moulines, “Statistical methods for voice quality
transformation,” Proc. of EUROSPEECH 95, Madrid, 1995
116 P. Perrot, G. Aversano, and G. Chollet
32. P. Perrot, G. Aversano, R. Blouet, M. Charbit, G. Chollet, « voice forgery using ALISP »
Proc. ICASSP 2005, Philadelphie
33. R. Rodman, Speaker Recognition of disguised voices: a program for research, consortium
on Speech Technology Conference on Speaker by man and machine: direction for forensic
applications, Ankara, Turkey, COST 250, 1998
34. H. Valbret, E. Moulines, J.P. Tubach, “Voice trans-formation using PSOLA
technique” Proc. ICASSP 92, San Francisco, 1992
35. I. Shafran, M. Mohri, “A comparison of classifiers for detecting emotion from speech”,
Proc. ICASSP 2005, Philadelphie
36. Y. Stylianou, O. Cappe, “A system for voice conversion based on probabilistic
classification and a harmonic plus noise model,” Proc ICASSP 98, Seattle, WA, pp. 281-
284, 1998
37. Y. Stylianou, O. Cappe, E. Moulines “Continuous probalistic transform for voice
conversion,” IEEE Trans. Speech and Audio Processing, 6(2):131-142, March 1998.
38. Zetterholm, E. Voice Imitation. A phonetic study of perceptual illusions and acoustic
success. Dissertation. Department of Linguistics and Phonetics, Lund University. 2003
39. Rostolland D., 1982a, "Acoustic features of shouted voice", Acustica 50, pp. 118-125.
40. Rostolland D., 1982b, "Phonetic structure of shouted voice", Acustica 51, pp. 80-89.
41. Rostolland D., 1985, "Intelligibility of shouted voice", Acoustica 57, pp. 103-121
42. Abboud B. Bredin H, Aversano G., Chollet G. “Audio visual forgery in identity
verification” Workshop on Nonlinear Speech Processing, Heraklion, Crete, 20-23 Sept.
2005
43. B.S. Atal, “Automatic speaker recognition based on pitch contours”, Journal of Acoustical
Society of America, 52:1687-1697, 1972.
44. J. Zalewski, W. Maljewski, H. Hollien, “Cross correlation between Long-term speech
Spectra as a criterion for speaker identification, Acoustica 34:20-24, 1975
45. http://www.zdnet.fr/telecharger/windows/fiche/0,39021313,11009007s,00.htm
Audio-visual Identity Verification:
An Introductory Overview
1 Introduction
Identity verification based on talking face biometrics is getting more and more at-
tention. As a matter of fact, a talking face offers many features to achieve a ro-
bust identity verification: it includes speaker verification, face recognition and their
multimodal combination. Moreover, whereas iris or fingerprint biometrics might
appear intrusive and need user collaboration, a talking face identity verification
system is not intrusive and can even be achieved without the user noticing it.
Though it can be very robust thanks to the complementarity of speaker and
face recognition, these two modalities also share a common weakness: an impos-
tor can easily record the voice or photograph the face or his/her target without
him/her noticing it ; and thus fool a talking face system with very little effort.
Moreover, higher effort impostors might perform both voice conversion and face
animation in order to perform impersonation that is even more difficult to de-
tect or to mask his/her identity (see Voice Disguise and Automatic Detection:
Review and Perspectives by Perrot et al. in this volume). We will show that
explicit talking face modeling (i.e. the coupled modeling of acoustic and visual
synchronous features) is an effective way to overcome these weaknesses.
The remaining of this chapter is organized as follows. After a short review of
the most prominent methods used for speaker verification and face recognition,
low and high forgery scenarios are described, including simple replay attacks,
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 118–134, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Audio-visual Identity Verification: An Introductory Overview 119
voice conversion and face animation. Finally, the particular task of replay attacks
detection is addressed, based on the detection of a lack of synchrony between
voice and lip motion.
N
1 1
wi exp − (x − μi )T Γi−1 (x − μi ) (1)
i=1 (2π)d Γi 2
Many databases are available to the research community to help evaluate multi-
modal biometric verification algorithms, such as BANCA [17], BT-DAVID [18],
XM2VTS [19] and BIOMET [20]. Different protocols have been defined for
122 B. Abboud et al.
evaluating biometric systems on each of these databases, but they share the
assumption that impostor attacks are zero-effort attacks. For example, in the
particular framework of the BANCA database, each subject records one client
access and one impostor access per session. However, the only difference between
the two is the particular message that the client utters—their name and address
in the first case; the target’s name and address in the second. Thus the imper-
sonation takes place without any knowledge of the target’s face, age, and voice.
These zero-effort impostor attacks are unrealistic—only a fool would attempt to
imitate a person without knowing anything about them. In this work we adopt
more realistic scenarios in which the impostor has more information about the
target.
Natural talking faces synthesis is a very challenging task, since a synthetic face
has to be photo-realistic and represent subtle texture and shape variations that
are vital to talking faces representation and recognition, in order to be considered
natural.
Many modeling techniques exist which achieve various degrees of realism
and flexibility. The first class of techniques uses 3D meshes to model the face
shape [24,25]. To obtain natural appearance a 3D scan image of the subject face
is texture-mapped on the 3D parameterized deformable model.
An alternative approach is based on morphing between 2D images. This tech-
nique produces photo-realistic images of new shapes by performing
interpolation between previously seen shapes and is successfully combined with
geometric 3D transformations to create realistic facial models from photos and
construct smooth transitions between different facial expressions [26]. Using the
same technique, multi-dimensional deformable models [27] can generate inter-
mediate video-realistic mouth movements of a talking face from a small set of
manually selected mouth samples. Morphing is also used in the context of the
video-rewrite to change the identity of a talking face [28].
Face Tracking. It has already been shown that the active appearance model
[30] is a powerful tool for object synthesis and tracking. It uses Principal Com-
ponent Analysis (PCA) to model both shape and texture variations seen in a
training set of visual objects. After computing the mean shape s̄ and aligning all
shapes from the training set by means of a Procrustes analysis, the statistical
shape model is given by (2)
si = s̄ + Φs bsi (2)
where si is the synthesized shape, Φs is a truncated matrix describing the princi-
pal modes of shape variations in the training set and bsi is a vector that controls
the synthesized shape.
It is then possible to warp textures from the training set of faces onto the
mean shape s̄ in order to obtain shape-free textures. Similarly, after computing
the mean shape-free texture t̄ and normalizing all textures from the training
set relatively to t̄ by scaling and offset of the luminance values, the statistical
texture model is given by (3)
ti = t̄ + Φt bti (3)
124 B. Abboud et al.
ti = t̄ + Qt ci (5)
where Qs and Qt are truncated matrices describing the principal modes of com-
bined appearance variations in the training set, and ci is a vector of appearance
parameters simultaneously controlling both shape and texture.
Given the parameter vector ci , the corresponding shape si and shape-free
texture ti can be computed respectively using (4) and (5). The reconstructed
shape-free texture is then warped onto the reconstructed shape in order to obtain
the full appearance. Displacing each modes of the mean appearance vector c̄
changes both the texture and shape of the coded synthetic faces.
Furthermore, in order to allow pose displacement of the model, it is necessary
to add to the appearance parameter vector ci a pose parameter vector pi allowing
control of scale, orientation and position of the synthesized faces.
While a couple of appearance parameter vector c and pose parameter vector
p represents a face, the active appearance model can automatically adjust those
parameters to a target face by minimizing a residual image r(c, p) which is the
texture difference between the synthesized faces and the corresponding mask of
the image it covers as shown in (6) and (7).
In the following, the appearance and pose parameters obtained by this opti-
mization procedure will be denoted respectively as cop and pop .
cop = argmin |r[(c + δc), p]|2 (6)
−1
∂r T ∂r ∂r T
Rt = (11)
∂p ∂p ∂p
These linear relationships are then used to determine the optimal appearance
and pose vectors cop and pop using a gradient descent algorithm [31].
Hence adapting the active appearance model to each frame of a video sequence
showing a speaking face allows to track the facial movements of the face as
shown on Fig. 1. The experiments are conducted on the BANCA database which
Face Animation. Lip motion is defined by the position of the MPEG-4 com-
patible feature points on each frame of the tested sequence. A set of 18 features
Fig. 2. MPEG-4 compatible feature points located at the inner and outer contours of
the training lips
Fig. 3. Appearance model adaptation and automatic feature point placement on the
target face
Audio-visual Identity Verification: An Introductory Overview 127
Fig. 4. Lip motion cloning from the driving sequence (left) to animate the static target
image (right)
128 B. Abboud et al.
points were selected at key positions on the outer and inner lip contours as shown
on Fig. 2. This motion can be injected to any target image showing an unknown
face using the following procedure.
First the lip pixels on the target image are detected using the lip localization
method described in Sec. 3.4, as shown on Fig. 1. Then, starting from this rough
position, an appearance model is initialized and adapted to the target lip using
the gradient descent algorithm. This procedure allows to automatically place the
feature points at the correct positions of the target lips as shown on Fig. 3. An
artificial motion is then obtained by displacing these feature points to match
each frame of the driving sequence. A Delaunay triangulation coupled with a
piecewise affine transform is used to interpolate pixels color values. An example
of lip motion cloning of a driving sequence on an unknown target face is shown
on Fig. 4.
Visual Speech. Raw pixels are the visual equivalent of the audio raw energy. In
[35] and [40], the intensity of gray-level pixels is used as is. In [21], their sum over
the whole region of interest is computed, leading to a one-dimensional feature.
Holistic methods consider and process the region of interest as a whole source
of information. In [39], a two-dimensional discrete cosine transform (DCT) is
applied on the region of interest. In [44], the authors perform a projection of
the region of interest on vectors resulting from a principal component analysis
(PCA): they call the principal components eigenlips. Lip-shape methods consider
and process lips as a deformable object from which geometrical features can be
derived, such as height, width openness of the mouth, position of lip corners,
etc. Mouth width, mouth height and lip protrusion are computed in [45]. In
[46,47], a deformable template composed of several polynomial curves follows
the lip contours: it allows the computation of the mouth width, height and area.
In [41], the lip shape is summarized with a one-dimensional feature: the ratio of
lip height and lip width.
4.2 Measures
Once audio-visual speech features are extracted, the next step is to measure their
degree of correspondence. The following paragraphs overview what is proposed
in the literature.
cov (X, Y )
R(X, Y ) = (12)
σX σY
In [35], the authors compute the Pearsons product-moment coefficient between
the average acoustic energy X and the value Y of the pixels of the video to
determine which area of the video is more correlated with the audio. This allows
to decide which of two people appearing in a video is talking.
In information theory, the mutual information M I(X, Y ) of two random vari-
ables X and Y is a quantity that measures the mutual dependence of the two
variables. In the case of X and Y are discrete random variables, it is defined as
in equation (13).
p(x, y)
M I(X, Y ) = p(x, y) log (13)
p(x)p(y)
x∈X y∈Y
In [35,51,39,40], the mutual information is used to locate the pixels in the video
which are most likely to correspond to the audio signal: the face of the person
who is speaking clearly corresponds to these pixels.
N
p(z|λ) = wi N (z; μi , Γi ) (16)
i=1
1
T
Cλ (X, Y ) = p([xt , yt ]|λ) (17)
T t=1
The authors of [39] propose to model audio-visual speech with Hidden Markov
Models (HMMs). Two speech recognizers are trained: one classical audio only
recognizer [52], and an audiovisual speech recognizer as described in [34]. Given
a sequence of audiovisual samples ([xt , yt ], t ∈ [1, T ]), the audio only system gives
a word hypothesis W . Then, using the HMM of the audiovisual system, what
the authors call a measure of plausibility P (X, Y ) is computed as follows:
5 Conclusion
can also be used by impostors. This is the case of replay attacks, which can
be detected by measuring a possible lack of synchrony between voice and lip
motion.
Acknowledgments
This work was partially supported by the European Commission through our
participation to the SecurePhone project (http://www.secure-phone.info/)
and the BioSecure Network of Excellence (http://www.biosecure.info/)
References
1. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification using Adapted
Gaussian Mixture Models. Digital Signal Processing 10 (2000) 19 – 41
2. Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data
via the EM Algorithm. J. of Royal Statistical Society 39(1) (1977) 1 – 22
3. Blouet, R., Mokbel, C., Mokbel, H., Sanchez, E., Chollet, G.: BECARS: a Free
Software for Speaker Verification. In: ODYSSEY 2004. (2004) 145 – 148
4. Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Frame-
work. In: IEEE Transactions on Speech and Audio Processing. Volume 9. (2001)
342 – 357
5. Brunelli, R., Poggio, T.: Face recognition: Features versus templates. IEEE Trans.
on Pattern Analysis and Machine Intelligence 15(10) (1993) 1042–1052
6. Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Face recognition
by elastic bunch graph matching. In: Intl. Conference on Computer Analysis of
Images and Patterns. Number 1296, Heidelberg, Springer-Verlag (1997) 456–463
7. Abboud, B., Davoine, F., Dang, M.: Expressive face recognition and synthesis. In:
IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human
Computer Interaction, Madison, U.S.A. (2003)
8. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuro-
science 3(1) (1991) 71–86
9. Moghaddam, B., Pentland, A.: Beyond euclidean eigenspaces: Bayesian matching
for visual recognition. In: Face Recognition: From Theories to Applications, Berlin,
Springer-Verlag (1998)
10. Li, S., Lu, J.: Face recognition using the nearest feature line method. IEEE
Transactions on Neural Networks 10 (1999) 439–443
11. Vapnik, V. In: Statistical Learning Theory. Wiley (1998)
12. Bartlett, M.S., Littlewort, G., Fasel, I., Movellan, J.R.: Real time face detection
and facial expression recognition: Development and applications to human com-
puter interaction. In: IEEE CVPR workshop on Computer Vision and Pattern
Recognition for Human Computer Interaction, Madison, U.S.A. (2003)
13. Heisele, B., Ho, P., J.Wu, Poggio, T.: Face recognition: Component-based versus
global approaches. In: Computer Vision and Image Understanding. Volume 91.
(2003) 6–21
14. Padgett, C., Cottrell, G., Adolphs, R.: Categorical perception in facial emotion
classification. In: Proceedings of the Eighteenth Annual Cognitive Science Confer-
ence., San Diego, CA (1996) 249–253
Audio-visual Identity Verification: An Introductory Overview 133
15. Lien, J., Zlochower, A., Cohn, J., Li, C., Kanade, T.: Automatically recognizing
facial expressions in the spatio temporal domain. In: Proceedings of the Workshop
on Perceptual User Interfaces, Alberta, Canada (1997)
16. Bredin, H., Dehak, N., Chollet, G.: GMM-based SVM for Face Recognition. In-
ternational Conference on Pattern Recognition (2006)
17. Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J.,
Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA
Database and Evaluation Protocol. In: Lecture Notes in Computer Science. Volume
2688. (2003) 625 – 638
18. BT-DAVID: (http://eegalilee.swan.ac.uk/)
19. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Ex-
tended M2VTS Database. Audio- and Video-Based Biometric Person Authentica-
tion (1999) 72 – 77
20. Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Jardins, J.L., Lunter,
J., Ni, Y., Petrovska-Delacretaz, D.: BIOMET: a Multimodal Person Authentica-
tion Database including Face, Voice, Fingerprint, Hand and Signature Modalities.
Audio- and Video-Based Biometric Person Authentication (2003) 845 – 853
21. Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in
Audiovisual Identity Verification. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing. (2006)
22. Stylianou, Y., Cappé, O., Moulines, E.: Statistical Methods for Voice Quality
Transformation. European Conference on Speech Communication and Technology
(1995)
23. Perrot, P., Aversano, G., Chollet, G., Charbit, M.: Voice Forgery Using ALISP:
Indexation in a Client Memory. In: ICASSP 2005. (2005)
24. Romdhani, S., Vetter, T.: Efficient, robust and accurate fitting of a 3D morphable
model. In: IEEE Intl. Conference on Computer Vision, Nice, France (2003)
25. Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using
physical and anatomical models. IEEE Trans. on Pattern Analysis and Machine
Intelligence 15(6) (1993) 569–579
26. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.: Synthesizing realistic
facial expressions from photographs. In: Siggraph proceedings. (1998) 75–84
27. Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. In:
ACM Siggraph, San Antonio, Texas (2002)
28. Bregler, C., Covel, M., Slaney, M.: Video rewrite: Driving visual speech with audio.
In: Siggraph proceedings. (1997) 353–360
29. Ahlberg, J.: An active model for facial feature tracking. EURASIP Journal on
applied signal processing 6 (2002) 566–571
30. Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis
based on an appearance model. Signal Processing: Image Communication 10(8)
(2004) 723–740
31. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. on
Pattern Analysis and Machine Intelligence 23(6) (2001) 681–685
32. Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J.,
Matas, J., Messer, K., Popovici, V., Pore, F., Ruiz, B., Thiran, J.P.: The BANCA
database and evaluation protocol. In: 4th International Conference on Audio- and
Video-Based Biometric Person Authentication, AVBPA, Springer-Verlag (2003)
33. Bredin, H., Chollet, G.: Measuring Audio and Visual Speech Synchrony: Methods
and Applications. In: International Conference on Visual Information Engineering.
(2006)
134 B. Abboud et al.
34. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: 10. In: Audio-Visual Automatic
Speech Recognition: An Overview. MIT Press (2004)
35. Hershey, J., Movellan, J.: Audio-Vision: Using Audio-Visual Synchrony to Locate
Sounds. Neural Information Processing Systems (1999)
36. Fisher, J.W., Darell, T.: Speaker Association With Signal-Level Audiovisual Fu-
sion. IEEE Transactions on Multimedia 6(3) (2004) 406–413
37. Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchroniza-
tion of Video Facial Images and Audio Tracks. Neural Information Processing
Society 13 (2000)
38. Cutler, R., Davis, L.: Look Who’s Talking: Speaker Detection using Video and
Audio Correlation. International Conference on Multimedia and Expo (2000) 1589–
1592
39. Nock, H., Iyengar, G., Neti, C.: Assessing Face and Speech Consistency for Mono-
logue Detection in Video. Multimedia’02 (2002) 303–306
40. Iyengar, G., Nock, H., Neti, C.: Audio-Visual Synchrony for Detection of Mono-
logues in Video Archives. International Conference on Acoustics, Speech, and
Signal Processing (2003) 329–332
41. Chetty, G., Wagner, M.: “Liveness” Verification in Audio-Video Authentication.
Australian International Conference on Speech Science and Technology (2004) 358–
363
42. Sugamura, N., Itakura, F.: Speech Analysis and Synthesis Methods developed at
ECL in NTT–From LPC to LSP. Speech Communications 5(2) (1986) 199–215
43. Yehia, H., Rubin, P., Vatikiotis-Bateson, E.: Quantitative Association of Vocal-
Tract and Facial Behavior. Speech Communication (28) (1998) 23–43
44. Bregler, C., Konig, Y.: ”Eigenlips” for Robust Speech Recognition. International
Conference on Acoustics, Speech, and Signal Processing 2 (1994) 19–22
45. Goecke, R., Millar, B.: Statistical Analysis of the Relationship between Audio
and Video Speech Parameters for Australian English. International Conference on
Audio-Visual Speech Processing (2003)
46. Eveno, N., Besacier, L.: Co-Inertia Analysis for ”Liveness” Test in Audio-Visual
Biometrics. International Symposium on Image and Signal Processing Analysis
(2005) 257–261
47. Eveno, N., Besacier, L.: A Speaker Independent Liveness Test for Audio-Video
Biometrics. 9th European Conference on Speech Communication and Technology
(2005)
48. Chibelushi, C.C., Mason, J.S., Deravi, F.: Integrated Person Identification Using
Voice and Facial Features. IEE Colloquium on Image Processing for Security
Applications (4) (1997) 1–5
49. Smaragdis, P., Casey, M.: Audio/Visual Independent Components. International
Symposium on Independent Component Analysis and Blind Signal Separation
(2003) 709–714
50. Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying
Species-Environment Relationships. Freshwater Biology 31 (1994) 277–294
51. Fisher, J.W., Darrell, T., Freeman, W.T., Viola, P.: Learning Joint Statistical
Models for Audio-Visual Fusion and Segregation. Advances in Neural Information
Processing Systems (2001)
52. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. In: IEEE. Volume 77. (1989) 257–286
53. Bengio, S.: An Asynchronous Hidden Markov Model for Audio-Visual Speech
Recognition. Advances in Neural Information Processing Systems (2003)
Text-Independent Speaker Verification:
State of the Art and Challenges
1 Introduction
Speaker verification consists of verifying a person’s claimed identity. It is a sub-
field of the more general speaker recognition field. In this chapter an overview of
recent developments in text-independent speaker verification is presented, with
a special focus on emerging techniques using high-level information. High-level
information could be obtained using well known speech recognition methods, or
using data-driven speech segmentation techniques.
Gaussian Mixtures Models (GMM) are widely used statistical models for text-
independent speaker verification experiments. They have the advantage to repre-
sent well-understood statistical models, to be computationally inexpensive, and
Supported by the Swiss National Fund for Scientific Research, No. 200020-108024
and NoE BioSecure, IST-2002-507634.
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 135–169, 2007.
c Springer-Verlag Berlin Heidelberg 2007
136 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
testing phase, the same words or sentences as those pronounced during train-
ing, (b)text-prompted, when the speaker must pronounce, during the test,
words and sentences imposed by the system or (c) text-independent when the
speaker can speak freely during the training (enrollment) and testing phases.
The choice of text dependence is related to the foreseen application. There
are different classes of applications. They can be classified in various families,
such as on-site access control applications, remote applications, forensic appli-
cations or applications where only telephone speech data of the target speaker
is available. As pointed out in [62] text-dependent (also called fixed-text) sys-
tems have a much higher level of performance then text-independent (also called
free-text) systems. Text-independent systems are needed in forensic and surveil-
lance applications where the user is not cooperative and often not aware of the
task. This chapter is focused on recent developments in text-independent speaker
verification experiments.
– speech parametrization,
– speaker modeling, and
– decision making.
The performance of the overall system is dependent on each of the steps listed
above. The choice of the feature extraction method and classifier depends on
the constraints of the planned application. In the following section the speech
parametrization methods commonly used to extract relevant parameters from
speech data are summarized.
Most of the speech parametrization used for speaker recognition systems rely
on a cepstral representation of speech. In the following section Linear Prediction
Coding and filter-bank cepstral parameters are summarized.
p
s(t) = ai .s(t − i) + e(t) (1)
i=1
Another crucial step in any speaker verification system is the selection of the
frames conveying speaker information. The problem is to separate the incoming
signal into speech portions versus silence and background noise. One way to sep-
arate speech from non-speech or background noise is to compute a bi-Gaussian
model of the log–energy coefficients (see for example [57]). Normally, the useful
frames will belong to the gaussian with the highest mean. A threshold dependent
on the parameters of this gaussian, could then be used for the selection of the
frames.
Another possibility to select the useful frames is to use Hidden Markov Models.
Such models are implemented in current speech recognizers, and can be used to
separate speech from non-speech and noisy data. This method is used for the
data-driven experiments described in the case study experiments in Sect. 7.3.
Once the feature vectors are extracted and the irrelevant frames corresponding
to non-speech or noisy data are removed, the remaining data could be used
for extracting speaker specific information. This step is denoted as training (or
enrollment) step of the recognition system and the associated speech data used
to build the speaker model are called training (or enrollment) data. During the
Text-Independent Speaker Verification 141
test phase, the same feature vectors are extracted from the waveform of a test
utterance and matched against the relevant model.
As the focus of this chapter is on text-independent speaker verification, tech-
niques based on measuring distances between two patterns are only mentioned,
leaving more emphasis to statistical modeling techniques. The “distance based”
methods were mainly used in systems developed at the beginning of the speaker
recognition history. In the beginning (1930–1980) methods like template match-
ing and Dynamic Time Warping and Vector Quantization were used and eval-
uated on small size databases recorded in “clean conditions”, with few speech
data and few speakers. With the evolution of storage and computers’ calcu-
lating capacities, methods using statistical models such as Gaussian Mixture
Models, as well as Neural Networks, began to be more widely used. Such meth-
ods are needed in order to cope with more realistic and difficult conditions of
the continuously changing telecommunication environments. They cover the pe-
riod between 1980–2003. Last years developments combine the well developed
statistical models with methods that extract high-level information for speaker
characterization.
There are several ways to build speaker models using statistical methods.
They can be divided into two distinct categories: generative and discriminative
models. Generative models include mainly Gaussian Mixture Models (GMM)
and Hidden Markov Models (HMM). These methods are probability density
estimators that model the acoustic feature vectors. The discriminative models
are optimized to minimize the error on a set of training samples of the target
and non-target (impostors) classes. They include Multilayer Perceptrons (MLP)
and Support Vector Machines (SVM).
M
p(xt |λ) = wi N (xt , μi , Σi ) (2)
i=1
142 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
T
log p(X|λ) = log p(xt |λ) (3)
t=1
where p(xt |λ) is computed from (2). The log p(xt |λ) is used instead of p(xt |λ) in
order to avoid computer precision problems. Note that the average log–likelihood
value is used to normalize out the duration effects.
For speaker recognition applications, each state of the HMMs may represent
phones or other longer units of speech. Temporal information is encoded by
moving from one state to another regarding the allowed transitions. In this case
the HMM method consists in determining for each speaker the best alignment
between the sequence of test speech vectors and the Hidden Markov Model
associated with the pronounced word or phrase. The probabilities of the HMM
process are accumulated to obtain the utterance likelihood, in similar fashion to
the GMM.
2
Usually, the feature vectors are assumed independent, so only diagonal covariance
matrices are used.
Text-Independent Speaker Verification 143
could be mentioned. Despite their discriminant power, the MLP present some
disadvantages. The main ones are that their optimal configuration is not easy to
select and a lot of data is needed for the training and the cross-validation steps.
That could explain the fact that MLP are not widely used in speaker verification
task as it can be remarked in recent NIST speaker recognition evaluations.
4 Decision Making
The speaker verification task is taking the decision if the speech data collected
during the test phase, belongs to the claimed model or not. Given a speech
segment X and a claimed identity S the speaker verification system should
choose one of the following hypothesis:
HS : X is pronounced by S
HS̄ : X is not pronounced by S
Text-Independent Speaker Verification 145
The decision between the two hypothesis is usually based on a likelihood ratio
given by:
p(X|HS ) > Θ accept HS
Λ(X) = (4)
p(X|HS̄ ) < Θ accept HS̄
where p(X|HS ) and p(X|HS̄ ) are the probability density functions (called also
likelihoods) associated with the speaker S and non-speakers S̄, respectively. Θ is
the threshold to accept or reject HS .
In practice, HS is represented by a model λS , estimated using the training
speech from the hypothesized speaker S, and HS̄ is represented by a model λS̄ ,
estimated using a set of other speakers, that cover as much as possible the space
of the alternative hypothesis. There are two main approaches to select this set
of other speakers.
The first one consists of choosing, for each speaker S a set of speakers S¯1 , ..., S¯N ,
called cohort [85]. In this case each speaker will have a corresponding non-speaker
model.
A more practical approach consists of having a unique or gender dependent
set of speakers representing the non-speakers, where a single (or gender depen-
dent) non-speaker model is trained for all speakers [82]. This model is usually
trained using speech samples from a large number of representative speakers.
This model is referred in the literature as world model or Universal Background
Model(UBM) [83]. This last approach is the most commonly used in speaker
verification systems. It has the advantage of using a single non-speaker model
for all the hypothesized speakers, or two gender-dependent world (background)
models.
The likelihood ratio in (4) is then rewritten as:
p(X|λS )
Λ(X) = (5)
p(X|λS̄ )
Often the logarithm of this ratio is used. The final score is then:
The values of the likelihoods p(X|λS ) and p(X|λS̄ ) are computed using one
of the techniques described in Sect. 3.
Once the model is trained, the speaker verification system should make a deci-
sion to accept or reject the claimed identity. This last step consists of comparing
the score of the test utterance with a decision threshold.
Setting the threshold appropriately for a specific speaker verification applica-
tion is still a challenging task. The threshold is usually chosen during the de-
velopment phase, and is speaker-independent. However these kind of thresholds
do not really reflect the speaker peculiarities and the intra-speaker variability.
Furthermore, if there is a mismatch between development and test data, the
optimal operating point could be different from the pre-set threshold.
There are two main approaches to deal with the problem of automatic thr-
eshold estimation. The first one consists of setting a priori speaker-dependent
146 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
threshold [55], in such a way the threshold is adjusted for each speaker in order
to compensate the score variability effects. In the second approach, score normal-
ization techniques (see Sect. 6.1) make the speaker-independent threshold more
robust and easy to set.
The performance of any speaker recognition system is evaluated in function
of the error rate. There are two types of errors that occur in the verification
task: the false acceptance when the system accepts an impostor and the false
rejection when the system rejects a valid speaker. Both types of error depend on
the decision threshold. With a high threshold, the system will be highly secured.
In other words, the system will make very few false acceptances but a lot of
false rejections. If the threshold is fixed to a low value the system will be more
convenient to the users. Thus, it will make few false rejections and lots of false
acceptances. The rates of false acceptance, RF A , and false rejection, RF R , define
the operating point of the system. They are calculated as follows:
number of f alse acceptances
RF A = (7)
number of impostors access
where CF R is the cost of false rejection, CF A is the cost of false acceptance, Ptar
and Pimp are the a priori probabilities of targets and impostors respectively. The
DCF should be normalized by the value: CF R Ptar + CF A Pimp .
The DCF is the most used measure to evaluate the performances of opera-
tional speaker verification systems. Smaller is the value of the DCF better is the
system for the given application and conditions. Thus, the decision threshold
should be optimized in order to minimize the DCF. This optimization is often
done during the development of the system on a limited set of data.
The measures presented above, evaluate the performances of the system at
a single operating point. However, representing the performance of the speaker
verification system over the whole range of operating points is also useful. A
way for doing this is to use a performance curve. The Detection Error Trade-
off (DET) curve [59], a variant of the Receiver Operating Characteristic (ROC)
curve [26], has been widely used for this purpose. In the DET curve the RF A is
plotted as a function of the RF R and the axis follow a normal deviate scale. This
representation allows an easy comparison of the performances of the systems at
different operating points. The Equal Error Rate (EER) appears directly on this
curve as the intersection of the DET curve with the first bisectrix.
Experimental results described in Sect. 7 will be reported in terms of DET
curves and Equal Error Rates.
Text-Independent Speaker Verification 147
There are different factors that affect the performance of a speaker verifica-
tion system. They can be evaluated using appropriately designed corpora that
reflect the reality of the speaker characteristics and the complexity of the fore-
seen application. Measuring real progress achieved with new research methods
and pinpointing the unsolved problems is only possible when relevent publicly
available evaluation databases and protocols are associated with a benchmark-
ing framework [18]. Such benchmarking framework should be composed of open-
source state-of-the-art algorithms defined on benchmarking publicly available
databases and protocols.
5 Evaluation Paradigm
5.1 Performance Factors
Speaker verification performance is dependent upon many different factors, that
could be grouped in the following categories:
Intra-Speaker Variability. Usually the speaker model is obtained using a lim-
ited speech data, that characterizes the speaker at a given time and situation.
However, the voice could change in time due to aging, illness and emotions. For
these reasons the speaker model may not be representative of the speaker and
may not cover all these variabilities, which affect negatively the performance
of the speaker verification systems. To deal with this problem, incremental en-
rollment techniques could be used, in order to include the short and long-term
evolution of the voice [7].
Amount of Speech Data. The amount of training data available to built the
speaker models has also a large impact on the accuracy of the systems.
Mismatch Factors. The mismatch in recording conditions between the training
and testing is one of the major challenges for automatic speaker recognition over
the telephone communication channels. Differences in the telephone handset, in
the transmission channel, and in the recording devices can introduce discepancies
and decrease the accuracy of the system. Beside this, most of the statistical mod-
els do not capture only the speaker characteristics but also the environmental
ones. Hence, the system decision may be biased if the verification environment
is different from the enrollment. The normalization techniques (of features and
scores) presented in Sect. 6.1 are useful to make speaker modeling more robust to
mismatched train/test conditions. The high-level features introduced in Sect. 6.2
are also important because they are supposed to be more robust to mismatched
conditions.
noises due to transmission channel effects are reduced with this method. It could
also be performed over a sliding window in order to take into account the lin-
ear channel variation inside the same recording. CMS could compensate linear
channel variations, however, under additive noise conditions it is less effective.
It is sometimes supplemented with variance normalization [94,51]. The coeffi-
cients are transformed to fit a zero mean and a unit variance distribution, over
all speech frames or using a sliding window. The mean and variance values are
usually estimated only on frames belonging to speech.
RASTA (RelAtive SpecTrA), introduced by [44], is a generalization of
CMS. It addresses the problem of a slowly time-varying linear channel in contrast
to the time-invariant channel removed by CMS. Its essence is a cepstral lifter
that removes low and high modulation frequencies.
Feature Warping was proposed by [69] in order to construct a more robust
representation of each cepstral feature distribution. This is achieved by mapping
the individual cepstral feature distributions such that they follow a normal dis-
tribution over a window of speech frames. In [97] a variant of the feature warping
is presented, called “short-time gaussianization”. It applies a linear transforma-
tion to the features before mapping them to a normal distribution. The purpose
of this linear transformation is to make the resulting features independent. Re-
sults in [69,6] have shown that feature warping slightly outperforms mean and
variance normalization but it is computationally more expensive.
The Feature Mapping approach, introduced in [79], is another normaliza-
tion method, based on the idea to minimize the channel variability in the fea-
ture domain. This method focuses on mapping features from different channels
into a common channel independent feature space. The feature mapping proce-
dure is achieved through the following steps. First, a channel independent GMM
is trained using a pool of data from many different channels. Then, channel
dependent GMM are trained by adapting the channel-independent GMM us-
ing channel dependent data. Finally, the feature-mapping functions are learned
by examining how the channel independent model parameters change after the
adaptation. During the normalization step, and for each utterance, the most
likely channel dependent model is first detected and then each feature vector
in the utterance is mapped to the channel independent space using the corre-
sponding feature-mapping functions. This method showed good performances in
channel compensation but its disadvantage is that a lot of data is needed to train
the different channel models.
Joint Factor Analysis is another successful approach for modeling channel
variability [47,46]. This method is quite similar to the feature mapping with the
exception that in joint factor analysis the channel effect is modeled by a normal
distribution of the GMM supervector space whereas in the feature mapping this
effect is modeled by a discrete distribution.
The basic assumption in the joint factor analysis is that a speaker and channel-
dependent supervector can be decomposed into a sum of two supervectors, one
of which depends on the speaker and the other on the channel, and that speaker
supervectors and channel supervectors are both normally distributed.
150 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
M =s+c (10)
ΛX (λ) − μS
Λnorm (λ) = (11)
X
σS
found in [63]. In contrary to the Znorm, the Tnorm has to be performed online
during testing.
Dnorm, introduced by Ben. et al. [8], deals with the problem of pseudo-
impostor data availability, by generating data using the world (background) model.
Target and pseudo-impostor data are generated using Monte-Carlo methods.
Variants of Score Normalization were introduced in order to reduce mi-
crophone and transmission channel effects. Among the variants of Znorm, is the
handset normalization (Hnorm) [84] and channel normalization (Cnorm). Hand-
set or channel dependent normalization parameters are estimated by testing
each speaker model against handset or channel dependent set of impostors. Dur-
ing testing, the type of handset or channel related to the test utterance is first
detected and then the corresponding set of parameters are used for score nor-
malization. The HTnorm, a variant of Tnorm, uses basically the same idea as
the Hnorm. The handset-dependent normalization parameters are estimated by
testing each test utterance against handset-dependent impostor models. Finally,
and in order to improve speaker verification performance, different normaliza-
tion methods may be combined, such as ZTnorm (Znorm followed by Tnorm)
or TZnorm (Tnorm followed by Znorm).
Besides the score normalization techniques described above, another way to
improve text-independent speaker verification results is to use speech recognition
methods in order to extract high-level information.
Acoustic features reflect the spectral properties of speech and convey infor-
mation about the configuration of the vocal tract. They are extracted from
short time overlapping windows (typically 10-20 ms).
Prosodic information is derived from the acoustic characteristics of speech and
includes pitch, duration and intensity.
152 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
Segmental features are related to phonetic and data driven speech units, de-
noted here as “pseudo-phonetic” units. The phoneme is the smallest speech
unit that distinguishes two words. The phonetic characteristics are related
to the way the speaker pronounces different phonemes.
Idiolectal information is unique to each speaker and characterizes her/his
speaking style. It is manifested by patterns of word selection and grammar.
Every individual has an idiolect, however it could depend on the context in
which the speaker is talking (such as the person to whom we are talking, the
subject or the emotional situation).
Dialogic features are based on the conversational patterns and define the man-
ner in which the speaker communicates. Every speaker could be characterized
by the frequency and the duration of his turns in a conversation. These traits
are also dependent on the conversational context.
Semantic characteristics of the speech are related to the signification of the
utterance. The kind of subject frequently discussed by the speaker could
also provide information on her/his identity.
7 Case Studies
In this section, some examples of factors influencing the results and presenting
the difficulties of text-independent speaker verification systems are given. The
evaluation databases and reference evaluation protocols underlying these exam-
ples are presented in Sect. 7.1. A reference (benchmarking) system using well
defined development data is also presented. Such reference system is useful to
point out the importance of different parameters. Two tasks are used in our case
studies. The first task is denoted as 1conv4w-1conv4w, and uses one conversa-
tion of approximately five minutes total duration to build the speaker model and
another conversation for testing. For the second task, denoted here as long train-
ing/short test conditions, or 8conv4w-1conv4w task, approximately 40 minutes
total duration are used to build the speaker model and another conversation for
testing.
The experimental results presented in this section are grouped in two parts.
The first part, presented in Sect. 7.2, illustrates the difficulties of tuning a variety
of parameters in order to achieve well performing GMM based systems. The
short train/test task of the NIST–2005–SRE evaluation data is used in this
part. The second part, presented in Sect. 7.3 shows the improvements that could
be achieved when high level information extracted with data-driven methods
are combined with well known baseline GMM systems. For these experiments,
Text-Independent Speaker Verification 155
the best performing GMM from the first part are combined with data-driven
high-level information systems. The results are shown on NIST–2005 and 2006
evaluation data, but using the long training/short test task, which is more suited
for extracting high-level speaker-specific information.
4
http://www.ldc.upenn.edu
156 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
English trials only. The degradation of the results when multilingual data are
used is shown in Fig. 7.
The first series of experiments, reported in Sect. 7.2, are carried out on the
core task of the NIST–2005 speaker recognition evaluation with a total of 646
speaker models (372 females and 274 males). In this task denoted as 1conv4w-
1conv4w, one two-channel (4-wire) conversation, of approximately five minutes
total duration is used to build the speaker model and another conversation for
testing. There are 31 243 (2771 targets’ access + 28 472 impostors’ access) trials
in this task, including 20 907 (2148 targets’ access + 18 759 impostors’ access)
English trials.
For the second set of experiments, related to using data-driven speech seg-
mentation to extract high-level speaker information, NIST–2005 and NIST–2006
evaluation data are used. These experiments are illustrated on the 8conv4w-
1conv4w task, where eight two-channel (4-wire) conversations, of approximately
40 min. total duration are used to build the speaker model and another conver-
sation for testing. In this task, 700 (402 females and 298 males) speaker models
for the NIST–2006 and 497 (295 females and 202 males) for NIST–2005 are
present. The systems are evaluated on 16 053 traials (1 671 targets’ access +
14 382 impostors’ access) for NIST–2005 and on 17 387 trials (2 139 targets’
access and15 248 impostors’ access) for NIST–2006.
In this part more details and illustrative experimental results with the reference
benchmarking system are given. For the evaluation data, the short train/testing
task of NIST–2005–SRE is used. The proposed reference system is composed
of state-of-the-art open-source ALIZE GMM based system, the LIA–SpkDet
tools [12], and well defined development and evaluation data (see Sect. 7.1).
In this section we will first summarize the reported figures and their main
characteristics, giving more details about each experimental configuration later.
In Fig. 1 the performance of the baseline reference system, using the LFCC+δ
features, the reference world model, and a GMM with 512 gaussian mixtures is
reported, with an Equal Error Rate (EER) of 10.3%.
The influence of different feature configurations is shown in Fig. 2, where bet-
ter results than the reference system are obtained using also the δ–energy coef-
ficients in addition to the LLFCC+δ, improving the results from 10.3% to 9.9%.
Text-Independent Speaker Verification 157
The influence of using relevant speech data for building the world(background)
models is illustrated in Fig. 3, leading to improved EER of 9%, when additional
Fisher data are added for building the background (world) models, over the
reported 9.9% of the reference system.
The influence of different score normalization methods is illustrated in Fig. 5,
leading to improved EER of 8.6% (compared with the improved world model
using Fisher data). As these normalization techniques are computationally ex-
pensive, we will not use them for the remaining comparisons.
The influence of the number of gaussians in the mixture is shown in Fig. 4.
Using 2048 “mixtures” leads to improved EER of 8.5%, to be compared with
the 9% of the reference system described in Fig. 3.
Using all the trials (and not only the pooled English trials) degrades the
performance, as shown in Fig. 7.
The influence of the duration of training data is shown in Fig. 6, where the
results using NIST–2005 evaluation tasks with short and long training data are
reported.
Figure 8 illustrates the influence of mismatched train/test conditions. NIST–
2006 evaluation data were used for these experiments, because the relevant labels
were available only for this data set.
More details about the above cited experiments that influence the performance
of baseline GMM reference system are given below.
N = w1 + (g ∗ α ∗ w2 ) (12)
Table 1. Percentage of frames belonging to speech after applying the frame removal
procedure. The corresponding speaker verification systems performances are plotted in
Fig. 1.
Fig. 1. DET plot showing the effect of frames selection. The reference GMM system is
defined as LFCC+δ features, a frame removal parameter with α = 0, a reference world
model, and a GMM with 512 gaussians. The results are reported on NIST–2005 using
only the English trials (20907 trails: 2148 targets access and 18759 impostors access).
6
http://www.nist.gov/speech/tests/spk/2005/sre-05 evalplan-v6.pdf
Text-Independent Speaker Verification 159
Fig. 2. DET plot showing the effect of the feature selection. The GMM system (512
gaussians) is evaluated on NIST–2005 using only the English trials.
Fig. 3. DET plot showing the effect of background model speech data. The baseline
reference system is defined with LFCC+δ features, a frame removal parameter with
α = 0, a reference world model, and a GMM with 512 gaussians. Results are reported
on NIST–2005 using only the English trials.
vectors degrades clearly the results. One possible explanation of this result is
that the log–energy is more sensible to the channel effect and some kind of
normalization should be used to eliminate this effect.
Fig. 4. DET plot showing the comparison of performances using 512 and 2048 gaussians
in the GMM, on NIST–2005 evaluation data using only the English trials
Fig. 5. DET plot showing the effect of the different score normalization techniques. The
GMM system with 512 gaussians are evaluated on NIST–2005 using only the English
trials.
Fig. 6. DET plot showing the comparison of performances using one and eight conver-
sations to build the speaker model. The GMM system (2048 gaussians) is evaluated on
NIST–2005 using only the English trials.
162 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
Fig. 7. DET plot showing language effects. The GMM systems (2048 gaussians) are
evaluated on NIST–2005.
Fig. 8. DET plot showing the comparison of performances by transmission type. The
GMM systems (2048 gaussians) are evaluated on NIST–2006 using only English trials.
Text-Independent Speaker Verification 163
Fig. 9. DET plot showing the fusion results of low-level acoutic GMM with high-level
systems based on ALISP speech segmentation. The fusion results are shown on the
8conv4w-1conv4w task of NIST–2005 using English trials.
The systems described here use in the first stage a data-driven Automatic Lan-
guage Independent Speech Processing (ALISP) tools [19] for the segmentation
step. This technique is based on units acquired during a data-driven segmenta-
tion, where no phonetic transcription of the corpus is needed. In this work we
use 65 ALISP classes. Each class is modeled by a left-to-right HMM having three
emitting states and containing up to 8 gaussians each. The number of gaussians
7
http://htk.eng.cam.ac.uk/
8
they can asily be used with the HTK toolkit
164 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
Fig. 10. DET plot showing the fusion results of the same systems as in Fig. 9 with the
same development data, but on the 8conv4w-1conv4w task of NIST–2006 using English
trials
References
1. A. Adami, R. Mihaescu, D. A. Reynolds, and J. J. Godfrey. Modeling prosodic
dynamics for speaker recognition. In Proc. ICASSP, April 2003.
2. W. Andrews, M. Kohler, J. Campbell, and J. Godfrey. Phonetic, idiolectal, and
acoustic speaker recognition. Speaker Odyssey Workshop, 2001.
3. R. Auckenthaler, M. J. Carey, and H. Llyod-Thomas. Score normalization for
text-independent speaker verification systems. Digital Signal Processing, 10, 2000.
4. R. Auckenthaler, E. S. Parris, and M. J. Carey. Improving a GMM speaker verifi-
cation system by phonetic weighting. Proc. ICASSP, 1999.
5. B. Baker, R. Vogt, and S. Sridharan. Gaussian mixture modelling of broad phonetic
and syllabic events for text-independent speaker verification. Proc. Eurospeech,
september 2005.
6. C. Barras and J. L. Gauvain. Feature and score normalization for speaker verifi-
cation of cellular data. In Proc. ICASSP, April 2003.
7. C. Barras, S. Meignier, and J. L. Gauvain. Unsupervised online adaptation for
speaker verification over the telephone. Proc. Odyssey, 2004.
8. M. Ben, R. Blouet, and F. Bimbot. A monte-carlo method for score normaliza-
tion in automatic speaker verification using kullback-leibler distances. 2002 IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
2002.
9. F. Bimbot, M. Blomberg, L. Boves, D. Genoud, H.-P. Hutter, C. Jaboulet, J.W.
Koolwaaij, J. Lindberg, and J.-B. Pierrot. An overview of the cave project research
activities in speaker verification. Speech Communication, 31:158–180, 2000.
10. F. Bimbot, J.F. Bonastre, C.Fredouille, G. Gravier, I. Magrin-Chagnolleau,
S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska-Delacretaz, and D. A.
Reynolds. A tutorial on text-independent speaker verification. Eurasip Journal
On Applied Signal Processing, 4:430–451, 2004.
11. K. Boakye and B. Peskin. Text-constrained speaker recognition on a text-
independent task. Proc. Odyssey, page 129134, June 2004.
12. J.-F. Bonastre, F. Wilsand, and S. Meignier. Alize, a free toolkit for speaker recog-
nition. In Proceedings of the 2005 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP2005), 1, March 2005.
13. R.N. Bracewell. The Fourier Transform and Its Applications. McGraw-Hill, New
York, NY USA, 1965.
14. J. Campbell and D. Reynolds. Corpora for the evaluation of speaker recognition
systems. Proc. ICASSP, 1999.
15. J. Campbell, D. Reynolds, and R. Dunn. Fusing high- and low level features for
speaker recognition. In Proc. Eurospeech, 2003.
166 D. Petrovska-Delacrétaz, A. El Hannani, and G. Chollet
16. W. Campbell, D. Sturim, and D. Reynolds. Support vector machines using gmm
supervectors for speaker verification. IEEE Signal Processing Letters, 13:5, 2006.
17. W. M. Campbell, J. P Campbell, D. Reynolds, D. A. Jones, and T. R. Leek.
Phonetic speaker recognition with support vector machines. In Proc. Neural In-
formation Processing Systems Conference in Vancouver, pages 361–388, 2003.
18. G. Chollet, G. Aversano, B. Dorizzi, and D. Petrovska-Delacrétaz. The first biose-
cure residential workshop. 4th International Symposium on Image and Signal
Processing and Analysis-ISPA2005, pages 198–212, September 2005.
19. G. Chollet, J. Černocký, A. Constantinescu, S. Deligne, and F. Bimbot. Towards
ALISP: a proposal for Automatic Language Independent Speech Processing. In
Keith Ponting, editor, NATO ASI: Computational models of speech pattern process-
ing Springer Verlag, 1999.
20. S.B. Davis and P. Mermelstein. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. In Proc. ICASSP,
ASSP-28, no.4:357–366, August 1980.
21. N. Dehak and G. Chollet. Support vector gmms for speaker verification. Proc.
Odyssey, June 2006.
22. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data
via the em algorithm. J. Roy. Stat. Soc., page 138, 1977.
23. G. Doddington. Speaker recognition based on idiolectal differences between speak-
ers. Eurospeech, 4:2517–2520, 2001.
24. X. Dong and W. Zhaohui. Speaker recognition using continuous density support
vector machines. ELECTRONICS LETTERS, 37(17), 2001.
25. J.P. Eatock and J.S. Mason. A quantitative assessment of the relative speaker
discriminant properties of phonemes. Proc. ICASSP, 1:133–136, 1994.
26. J. Egan. Signal detection theory and ROC analysis. Academic Press, 1975.
27. A. El Hannani and D. Petrovska-Delacrétaz. Segmental score fusion for alisp-based
gmm text-independent speaker verification. In the book, Advances in Nonlinear
Speech Processing and Applications, Edited by G. Chollet, A. Esposito, M. Faundez-
Zanuy, M. Marinaro, pages 385–394, 2004.
28. A. El Hannani and D. Petrovska-Delacrétaz. Exploiting high-level information
provided by alisp in speaker recognition. Non Linear Speech Processing Workshop
(NOLISP 05), 19-22 April 2005.
29. A. El Hannani and D. Petrovska-Delacrétaz. Improving speaker verification system
using alisp-based specific GMMs. In proc. of Conference on Audio- and Video-
Based Biometric Person Authentication (AVBPA), July 20 - 22 2005.
30. A. El Hannani, D. T. Toledano, D. Petrovska-Delacrétaz, A. Montero-Asenjo, and
Jean Hennebert. Using data-driven and phonetic units for speaker verification.
In proc. of ODYSSEY06, The Speaker and Language Recognition Workshop, 28-30
June 2006.
31. Gunar Fant. Acoustic Theory of Speech Production. Mouton , The Hague, The
Netherlands, 1970.
32. L. Ferrer, K. Sönmez, and S. Kajarekar. Class-dependent score combination for
speaker recognition. Proc. Interspeech, September 2005.
33. S. Furui. Cepstral analysis technique for automatic speaker verification. IEEE
Transactions on Acoustics, Speech and Signal Processing, 29(2):254–272, 1981.
34. S. Furui. Comparison of speaker recognition methods using static features and
dynamic features. IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 29(3):342–350, 1981.
Text-Independent Speaker Verification 167
76. A. Preti, N. Scheffer, and J.-F. Bonastre. Discriminant approches for gmm based
speaker detection systems. In the workshop on Multimodal User Authentication,
2006.
77. Thomas F. Quatieri. Speech Signal Processing. Prentice Hall Signal Processing
Series, 2002.
78. L. Rabiner and B.H. Juang. Fundamentals of Speech Recognition. Prentice Hall,
1993.
79. D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin,
D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, J. Jones, and B. Xiang. The
supersid project: Exploiting high-level information for high-accuracy speaker recog-
nition. In Proc. ICASSP, April 2003.
80. D. A. Reynolds. A gaussian mixture modeling approach to text-independent
speaker identification. Ph.D. Thesis, Georgia Institute of Technology, 1992.
81. D.A. Reynolds. Experimental evaluation of features for robust speaker identifica-
tion. IEEE Transactions on Speech and Audio Processing, 2(3):639–643, 1994.
82. D.A. Reynolds. Automatic speaker recognition using gaussian mixture speaker
models. Lincoln Lab. Journal 8, (2):173–191, 1995.
83. D.A. Reynolds. Comparison of background normalization methods for text-
independent speaker verification. Proc. Eurospeech, pages 963–966, 1997.
84. D.A Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using adapted
gaussian mixture models. DSP, Special Issue on the NIST’99 evaluations, vol.
10(1-3):19–41, January/April/July 2000.
85. A. E. Rosenberg, J. DeLong, C. H. Lee, B. H. Juang, and F. K. Soong. The use of
cohort normalized scores for speaker verification. In International Conference on
Speech and Language Processing, page 599602, November 1992.
86. M. Schmidt and H. Gish. Speaker identification via support vector machines. In
Proc. ICASSP, 1996.
87. B. Schölkopf and A.J. Smola. Learning with kernels: Support vector machines,
regularization, optimization and beyond. MIT pres, 2001.
88. Y. A. Solewicz and M. Koppel. Enhanced fusion methods for speaker verification.
Proc. SPECOM’2004: 9th Conference Speech and Computer, pages 388–392, 2004.
89. A. Solomonoff, C. Quillen, and W. Campbell. Channel compensation for svm
speaker recognition. Proc. Odyssey, 2004.
90. K. Sönmez, E. Shriberg, L. Heck, and M. Weintraub. Modeling dynamic prosodic
variation for speaker verification. Proc. ICSLP98, 1998.
91. D. Sturim, D. Reynolds, R. Dunn, and T. Quatieri. Speaker verification using
text-constrained gaussian mixture models. Proc. ICASSP, vol. 1:677680, 2002.
92. N. Tishby. On the application of mixture ar hidden markov models to text indepen-
dent speaker recognition. IEEE Transactions on Signial Processing, vol. 39(3):563–
570, March 1991.
93. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.
94. O. Viikki and K. Laurila. Cepstral domain segmental feature vector normalization
for noise robust speech recognition. Speech Communication, vol. 25:133–147, 1998.
95. V. Wan and S. Renals. SVMSVM: Support vector machine speaker verification
methodology. Proc. IEEE ICASSP, 2:221–224, 2003.
96. V. Wan and S. Renals. Speaker verification using sequence discriminant support
vector machines. IEEE Trans. on Speech and Audio Processing, 13:203–210, 2005.
97. B. Xiang, U. Chaudhari, J. Navratil, G. Ramaswamy, and R. Gopinath. Short-time
gaussianization for robust speaker verification. Proc. ICASSP, vol. 1:681684, 2002.
Nonlinear Predictive Models: Overview and Possibilities
in Speaker Recognition
1 Introduction
Recent advances in speech technologies have produced new tools that can be used to
improve the performance and flexibility of speaker recognition While there are few
degrees of freedom or alternative methods when using fingerprint or iris identification
techniques, speech offers much more flexibility and different levels to perform recog-
nition: the system can force the user to speak in a particular manner, different for each
attempt to enter. Also, with voice input, the system has other degrees of freedom,
such as the use of knowledge/codes that only the user knows, or dialectical/semantical
traits that are difficult to forge.
This paper offers an overview of the state of the art in speaker recognition, with
special emphasis on the pros and cons, and the current research lines based on non-
linear speech processing. We think that speaker recognition is far away from being a
technology where all the possibilities have already been explored.
1.1 Biometrics
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 170 – 189, 2007.
© Springer-Verlag Berlin Heidelberg 2007
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 171
compared with other biometric traits, such as fingerprint, iris, hand geometry, face,
etc. This is because it can be seen as a mixture of physical and learned traits. We can
consider physical traits those which are inherent to people (iris, face, etc.), while
learned traits are those related to skills acquired along life and environment (signa-
ture, gait, etc.). For instance, your signature is different if you have been born in a
western or an Asian country, and your speech accent is different if you have grown up
in Edinburgh or in Seattle, and although you might speak the same language, probably
prosody or vocabulary might be different (i.e. the relative frequency of the use of
common words might vary depending on the geographical or educational back-
ground).
2 Speaker Recognition
Speaker recognition can be performed in two different ways:
Speaker identification: In this approach no identity is claimed from the speaker. The
automatic system must determine who is talking. If the speaker belongs to a prede-
fined set of known speakers, it is referred to as closed-set speaker identification.
However, for sure the set of speakers known (learnt) by the system is much smaller
than the potential number of users than can attempt to enter. The more general situa-
tion where the system has to manage with speakers that perhaps are not modeled
inside the database is referred to as open-set speaker identification. Adding a “none-
of-the-above” option to closed-set identification gives open-set identification. The
system performance can be evaluated using an identification rate. For open set identi-
fication, it is expected that some users do not belong to the database and an additional
decision is necessary: “Not a known person”. In this case, a threshold is needed in or-
der to detect this situation. However, most of the published systems refer to the
“closed set” identification (the input user is certainly in the database).
Speaker verification: In this approach the goal of the system is to determine whether
the person is who he/she claims to be. This implies that the user must provide an
identity and the system just accepts or rejects the users according to a successful or
unsuccessful verification. Sometimes this operation mode is named authentication or
172 M. Faundez-Zanuy and M. Chetouani
detection. The system performance can be evaluated using the False Acceptance Rate
(FAR, those situations where an impostor is accepted) and the False Rejection Rate
(FRR, those situations where a speaker is incorrectly rejected), also known in detec-
tion theory as False Alarm and Miss, respectively. This framework gives us the possi-
bility of distinguishing between the discriminability of the system and the decision
bias. The discriminability is inherent to the classification system used and the dis-
crimination bias is related to the preferences/necessities of the user in relation to the
relative importance of each of the two possible mistakes (misses vs. false alarms) that
can be done in speaker identification. This trade-off between both errors has to be
usually established by adjusting a decision threshold. The performance can be plotted
in a ROC (Receiver Operator Characteristic) or in a DET (Detection error trade-off)
plot [3]. DET curve gives uniform treatment to both types of error, and uses a scale
for both axes, which spreads out the plot and better distinguishes different well per-
forming systems and usually produces plots that are close to linear. Note also that the
ROC curve has symmetry with respect to the DET, i.e. plots the hit rate instead of
the miss probability, and uses a logarithmic scale that expands the extreme parts of
the curve, which are the parts that give the most information about the system per-
formance. For this reason the speech community prefers DET instead of ROC plots.
Figure 1 shows an example of DET of plot, and figure 2 shows a classical ROC plot.
In both cases (identification and verification), speaker recognition techniques can
be split into two main modalities:
Text independent: This is the general case, where the system does not know the text
spoken by person. This operation mode is mandatory for those applications where the
user does not know that he/she is being evaluated for recognition purposes, such as in
forensic applications, or to simplify the use of a service where the identity is inferred
in order to improve the human/machine dialog, as is done in certain banking services.
This allows more flexibility, but it also increases the difficulty of the problem. If nec-
essary, speech recognition can provide knowledge of spoken text. In this mode one
can use indirectly the typical word co-occurrence of the speaker, and therefore it also
characterizes the speaker by a probabilistic grammar. This co-occurrence model is
known as n-grams, and gives the probability that a given set of n words are uttered
consecutively by the speaker. This can distinguish between different cul-
tural/regional/gender backgrounds, and therefore complement the speech information,
even if the speaker speaks freely. This modality is also interesting in the case of
speaker segmentation, when there are several speakers present and there is an interest
in segmenting the signal depending on the active speaker.
Text dependent: This operation mode implies that the system knows the text spoken
by person. It can be a predefined text or a prompted text. In general, the knowledge of
the spoken text lets to improve the system performance with respect to previous cate-
gory. This mode is used for those applications with strong control over user input, or
in applications where a dialog unit can guide the user.
One of the critical facts for speaker recognition is the presence of channel variabil-
ity from training to testing. That is, different signal to noise ratio, kind of microphone,
evolution with time, etc. For human beings this is not a serious problem, because of
the use of different levels of information. However, this affects automatic systems in a
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 173
40
EER
High
20 security
5
Balance
Decreasing
2 threshold
1
0.5 user
comfort
0.2 Better
0.1 performance
0.1 0.2 0.5 1 2 5 10 20 40
False Alarm probability (in %)
Fig. 1. Example of a DET plot for a speaker verification system (dotted line). The Equal Error
Rate (EER) line shows the situation where False Alarm equals Miss Probability (balanced per-
formance). Of course one of both errors rates can be more important (high security application
versus those where we do not want to annoy the user with a high rejection/ miss rate). If the
system curve is moved towards the origin, smaller error rates are achieved (better perform-
ance). If the decision threshold is reduced, we get higher False Acceptance/Alarm rates.
User comfort
0.8
Decreasing
Balance threshold
True positive
0.6
High security
0.4 Better
performance
0.2
EER
0
0 0.2 0.4 0.6 0.8 1
False positive
Fig. 2. Example of a ROC plot for a speaker verification system (dotted line). The Equal Error
Rate (EER) line shows the situation where False Alarm equals Miss Probability (balanced per-
formance). Of course one of both errors rates can be more important (high security application
versus those where we do not want to annoy the user with a high rejection/ miss rate). If the
system curve is moved towards the upper left zone, smaller error rates are achieved (better per-
formance). If the decision threshold is reduced, higher False Acceptance/Alarm rates
are achieved. It is interesting to observe that comparing figures 1 and 2 we get True positive =
(1 – Miss probability) and False positive = False Alarm.
174 M. Faundez-Zanuy and M. Chetouani
DIALOGIC
IDIOLECTAL
PHONETIC
PROSODIC
EASY TO
AUTOMATICALLY SPECTRAL LOW-LEVEL CUES
EXTRACT (PHYSICAL TRAITS)
human ear, improve recognition rates. This is due to the fact that the speaker gener-
ates signals in order to be understood/recognized, therefore, an analysis tailored to the
way that the human ear works yields better performance.
Prosodic: Prosodic features are stress, accent and intonation measures. The easiest
way to estimate them is by means of pitch, energy, and duration information. Energy
and pitch can be used in a similar way than the short-term characteristics of the previ-
ous level with a GMM model. Although these features by its own do not provide as
good results as spectral features, some improvement can be achieved combining both
kinds of features. Obviously, different data-fusion levels can be used [6]. On the other
hand, there is more potential using long-term characteristics. For instance, human be-
ings trying to imitate the voice of another person usually try to replicate energy and
pitch dynamics, rather than instantaneous values. Thus, it is clear that this approach
has potential. Figure 4 shows an example of speech sentence and its intensity and
pitch contours. This information has been extracted using the Praat software, which
can be downloaded from [10]. The use of prosodic information can improve the ro-
bustness of the system, in the sense that it is less affected by the transmission channel
than the spectral characteristics, and therefore it is a potential candidate feature to be
used as a complement of the spectral information in applications where the micro-
phone can change or the transmission channel is different from the one used in the
training phase. The prosodic features can be used at two levels, in the lower one, one
can use the direct values of the pitch, energy or duration, at a higher level, the system
might compute co-occurrence probabilities of certain recurrent patterns and check
them at the recognition phase.
Phonetic: It is possible to characterize speaker-specific pronunciations and speaking
patterns using phone sequences. It is known that same phonemes can be pronounced
in different ways without changing the semantics of an utterance. This variability in
the pronunciation of a given phoneme can be used by recognizing each variant of
each phoneme and afterwards comparing the frequency of co-occurrence of the pho-
nemes of an utterance (N-grams of phone sequences), with the N-grams of each
speaker. This might capture the dialectal characteristics of the speaker, which might
include geographical and cultural traits. The models can consist of N-grams of
176 M. Faundez-Zanuy and M. Chetouani
Waveform
0
-0.9998
0 4.00007
Time (s)
85.28
Intensity (dB)
16.55
0 4.00007
Time (s)
200
150
Pitch (Hz)
100
70
50
0 4.00007
Time (s)
Fig. 4. Speech sentence “Canada was established only in 1869” and its intensity and pitch con-
tour. While uttering the same sentence, different speakers would produce different patterns, i.e.
syllable duration, profile of the pitch curve.
phone sequences. A disadvantage of this method is the need for an automatic speech
recognition system, and the need for modelling the confusion matrix (i.e. the probabil-
ity that a given phoneme is confused by another one). In any case, as there are
available dialectical databases [11] for the main languages, the use of this kind of in-
formation is nowadays feasible.
Idiolectal (synthactical): Recent work by G. Doddington [12] has found useful
speaker information using sequences of recognized words. These sequences are called
n-grams, and as explained above, they consist of the statistics of co-occurrence of n
consecutive words. They reflect the way of using the language by a given speaker.
The idea is to recognize speakers by their word usage. It is well known that some per-
sons use and abuse of several words. Sometimes when we try to imitate them we do
not need to emulate their sound neither their intonation. Just repeating their “favorite”
words is enough. The algorithm consists of working out n-grams from speaker train-
ing and testing data. For recognition, a score is derived from both n-grams (using for
instance the Viterbi algorithm). This kind of information is a step further than classi-
cal systems, because we add a new element to the classical security systems (some-
thing we have, we know or we are): something we do. A strong point of this method
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 177
is that it does not only take into account the use of vocabulary specific to the user, but
also the context, and short time dependence between words, which is more difficult to
imitate.
Dialogic: When we have a dialog with two or more speakers, we would like to seg-
mentate the parts that correspond to each speaker. Conversational patterns are useful
for determining when speaker change has occurred in a speech signal (segmentation)
and for grouping together speech segments from the same speaker (clustering).
The integration of different levels of information, such as spectral, phonological,
prosodic or syntactical is difficult due to the heterogeneity of the features. Different
techniques are available for combining different information with the adequate
weighting of evidences and if possible the integration has to be robust with respect to
the failure of one of the features. A common framework can be a bayesian modeling
[13], but there are also other techniques such as data fusion, neural nets, etc.
In the last years, improvements in technology related to automatic speech recogni-
tion and the availability of a wide range of databases have given the possibility of in-
troducing high level features into speaker recognition systems. Thus, it is possible to
use phonological aspects specific to the speaker or dialectical aspects which might
model the region/background of the speaker as well as his/her educational back-
ground. Also the use of statistical grammar modelling can take into account the dif-
ferent word co-occurrence of each speaker. An important aspect is the fact that these
new possibilities for improving speaker recognition systems have to be integrated in
order to take advantage of the higher levels of information that are available nowa-
days.
Next sections of this paper will be devoted to feature extraction using non-linear
features. For this goal, we will start with a short overview of nonlinear speech proc-
essing from a general point of view.
The applications of the nonlinear predictive analysis have been mainly focussed
on speech coding, because it achieves greater prediction gains than LPC. The first
proposed systems were [20] and [21], which proposed a CELP with different nonlin-
ear predictors that improve the SEGSNR of the decoded signal.
Three main approaches have been proposed for the nonlinear predictive analysis of
speech. They are:
a) Nonparametric prediction: it does not assume any model for the nonlinearity.
It is a quite simple method, but the improvement over linear predictive meth-
ods is lower than with nonlinear parametric models. An example of a non-
parametric prediction is a codebook that tabulates several (input, output) pairs
(eq. 1), and the predicted value can be computed using the nearest neighbour
inside the codebook. Although this method is simple, low prediction orders
must be used. Some examples of this system can be found in [20], [22-24].
( x [n − 1] , xˆ [n ]) (1)
28
26
24 MLP 10x2x1
MLP 10x4x1
LPC-10
22 LPC-25
Gp
20
18
16
14
12
0 20 40 60 80 100
Inicialización
Fig. 5. Prediction gain for a given frame using 100 different random initializations for Multi-
Layer Perceptrons and linear analysis
Histogram
150
10x2x1
10x4x1
100
50
0
12 14 16 18 20 22 24 26 28 30
Gp
Fig. 6. Prediction gain (Gp) histograms for 500 random initializations for the MLP 20x2x1 and
MLP 10x4x1 architectures
However, there is a strong limitation when trying to compare two different set of
features. While the comparison of two linear sets such as (2) and (3) is straightfor-
ward (4), this is not the case with nonlinear sets.
P
x [ n ] ≅ xˆ [ n ] = ∑ ak x [ n − k ] = a1 x [ n − 1] + a2 x [ n − 2] (2)
k =1
P
x [ n ] ≅ xˆ ′ [ n ] = ∑ ak ′ x [ n − k ] = a1′ x [ n − 1] + a2′ x [ n − 2] (3)
k =1
180 M. Faundez-Zanuy and M. Chetouani
⎣⎢ { }
d ⎡{a1 , a2 }, a1′ , a2′ ⎤
⎦⎥
(4)
For instance, figure 7 represents two MLP, which have been trained using the same
speech frame. It is clear that both networks perform exactly the same transfer func-
tion. However, a direct comparison will reveal that they differ. Obviously, much more
analogous examples can be stated.
output output
ω5 ω6 ω6 ω5
ω1 ω3 ω2 ω4
ω4 ω3
ω2 ω1
input layer input layer
inputs inputs
Fig. 7. Example of two different MLP, which perform the same transfer function. It is interest-
ing to observe that the hidden layer neurons order has been interchanged. Thus a direct weight
comparison would reveal that they are different.
With a nonlinear prediction model based on neural nets it is not possible to compare
the weights of two different neural nets in the same way that we compare two differ-
ent LPC vectors obtained from different speech frames. This is due to the fact that
infinite sets of different weights representing the same model exist, and direct com-
parison is not feasible. For this reason one way to compare two different neural net-
works is by means of a measure defined over the residual signal of the nonlinear
predictive model. For improving performance upon classical methods a combination
with linear parameterization must be used.
The main reason of the difficulty for applying the nonlinear predictive models
based on nets in recognition applications is that it is not possible to compare the
nonlinear predictive models directly. The comparison between two predictive models
can be done alternatively in the following way: the same input is presented to several
models, and the decision is made based on the output of each system, instead of the
structural parameters of each system.
For speaker recognition purposes we propose to model each speaker with a code-
book of nonlinear predictors based on MLP. This is done in the same way as the clas-
sical speaker recognition based on vector quantization [30]. Figure 8 summarizes the
proposed scheme, where each person is modeled by a codebook. This VQ system has
been recently applied to online signature recognition, and is specially suitable for
situations with a small amount of training vectors [42].
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 181
VECTOR dist1
QUANTIZER
G
⎡ x1 ⎤ CB 1
⎢ xG ⎥
Decision
⎢ 2⎥
⎢" ⎥ VECTOR dist2
⎢G ⎥
⎣⎢ xL ⎦⎥
min{·}
QUANTIZER
FEATURE
EXTRACTOR
CB 2
···
distP
VECTOR
QUANTIZER
CB P
Fig. 8. Proposed scheme for identification based on VQ. There is one codebook per person.
We obtained [28-29] that the residual signal is less efficient than the LPCC coeffi-
cients. Both in linear and nonlinear predictive analysis the recognition errors are
around 20%, while the results obtained with LPCC are around 6%. On the other hand
we have obtained that the residual signal is uncorrelated with the vocal tract informa-
tion (LPCC coefficients) [28]. For this reason both measures can be combined in or-
der to improve the recognition rates.
We proposed:
a) The use of an error measure defined over the LPC residual signal, (instead of a
parameterization over this signal) combined with a classical measure defined over
LPCC coefficients. The following measures were studied:
measure 1 (1): Mean Square Error (MSE) of the LPCC.
measure 2 (2): Mean Absolute difference (MAD) of the LPCC.
b) The use of a nonlinear prediction model based on neural nets, which has been
successfully applied to a waveform speech coder [19]. It is well known that the LPC
model is unable to describe the nonlinearities present in the speech, so useful informa-
tion is lost with the LPC model alone. The following measures were studied, defined
over the residual signal:
measure 3 (3): MSE of the residue.
measure 4 (4): MAD of the residue.
measure 5 (5): Maximum absolute value (MAV) of the residue.
measure 6 (6): Variance (σ) of the residue.
Our recognition algorithm is a Vector Quantization approach. That is, each speaker
is modeled with a codebook in the training process. During the test, the input sentence
is quantized with all the codebooks, and the codebook which yields the minimal ac-
cumulated error indicates the recognized speaker.
182 M. Faundez-Zanuy and M. Chetouani
The codebooks are generated with the splitting algorithm. Two methods have been
tested for splitting the centroids:
a) The standard deviation of the vectors assigned to each cluster.
b) A hyperplane computed with the covariance matrix.
The recognition algorithm in the MLP’s codebook is the following:
1. The test sentence is partitioned into frames.
For each speaker:
2. Each frame is filtered with all the MLP of the codebook (centroids) and it is
stored the lowest Mean Absolute Error (MAE) of the residual signal. This process is
repeated for all the frames, and the MAE of each frame is accumulated for obtaining
the MAE of the whole sentence.
3. The step 2 is repeated for all the speakers, and the speaker that gives the lowest
accumulated MAE is selected as the recognized speaker.
This procedure is based on the assumption that if the model has been derived from
the same speaker of the test sentence then the residual signal of the predictive analysis
will be lower than for a different speaker not modeled during the training process.
Unfortunately the results obtained with this system were not good enough. Even
with a computation of a generalization of the Lloyd iteration for improving the code-
book of MLP.
Efficient algorithm
In order to reduce the computational complexity and to improve the recognition rates
a novel scheme that consists of the pre-selection of the K speakers nearest to the test
sentence was proposed in [28-29]. Then, the error measure based on the nonlinear
predictive model was computed only with these speakers. (In this case a reduction of
3.68% in error rate upon classical LPC cepstrum parameterization was achieved).
The LPCC used for clustering the frames is used as a pre-selector of the recognized
speaker. That is, the input sentence is quantized with the LPCC codebooks and the K
codebooks that produce the lowest accumulated error are selected. Then, the input
sentence is quantized with the K nonlinear codebooks, and the accumulated distance
of the nonlinear codebook is combined with the LPCC distance.
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 183
As mentioned previously, neural predictive models are not easily comparable in terms
of parameters (i.e. the weights). However, it is well-known that the neural weights can
be considered as a representation of the input vector even if it is not a singular one. To
overcome this limitation, we propose a model optimized by original constraints: The
Neural Predictive Coding (NPC) model [31, 32].
The Neural Predictive Coding (NPC) model is basically a non-linear extension of
the well-known LPC encoder. As in the LPC framework with the Auto-Regressive
(AR) model, the vector code is estimated by prediction error minimization. The
main difference relies in the fact that the model is non-linear (connectionist based):
y = F ( y k ) = ∑ a jσ ( w T y k ) (5)
k
j
Where F is the prediction function realized by the neural model. y is the predicted
k
( x1, x 2,..., x n ) = ∑ g f ⎛⎜ ∑ λ iQ j ( x i ) ⎞⎟
2 n +1 n
f (6)
j =1 ⎝ i =1 ⎠
{ }
2 n +1
{λ i}
n
Where are universal constants that no depend on f . Q j are universal
1
1
Description
The NPC model is a Multi-Layer Perceptron (MLP) with one hidden layer. Only the
output layer weights are used as the coding vector instead of all the neural weights.
For that, we assume that the function F realized by the model, under convergence
184 M. Faundez-Zanuy and M. Chetouani
The learning phase is realized in two stages (cf. figure 9). First, the parameteriza-
tion phase involves the learning of all the weights by the prediction error minimiza-
tion criterion:
( y k − y k ) = ∑ ( y k − F ( y k ))
K 2 K 2
Q=∑
k =1 k =1
(8)
With y k the speech signal, y k the predicted speech signal, k the samples index and
K the total number of samples.
In this phase, only the first layer weights w (which are the NPC encoder parame-
ters) are kept. Since the NPC encoder is set up by the parameters defined in the pre-
vious phase, the second phase, called the coding phase, involves computation of the
output layer weights a : representing the phoneme coding vector. This is done also by
prediction error minimization but only the output layer weights are updated. One can
notice that the output function is linear (cf. equation 5), so it can use simple LMS-like
algorithms. Here, for consistency with the parameterization phase, it is done by the
backpropagation algorithm.
Fig. 9. The Neural Predictive Coding (NPC) model. The learning is realized in two times
namely the parameterization and the coding phases.
The initialized weights are determined by a simple LPC analysis with an λ order .
Once the initialization is accomplished, the coding process (prediction error minimi-
zation) proceeds as well as the original NPC coding phase by the backpropagation al-
gorithm.
Initialization of non-linear models by linear ones has been already investigated
[39] but with matrices decomposition methods (SVD, QR,...). The main limitation re-
lies in the multiplicity of the solution. Ref. [40] proposes a new method which guar-
antees a single solution.
Fig. 10. Linear initialization process. The LPC vector code is used as an initial feature vector
for the NPC coding phase.
Tab. 1 reports the speaker identification results obtained with the Covariance Ma-
trix method for different parameterization methods [32]. The results highlight of both
non-linear modeling (LPC vs NPC) and the improvement of data-driven initialization
of non-linear models such as the NPC (random vs linear).
4 Conclusions
Acknowledgement
This work has been supported by FEDER and the Spanish grant MCYT TIC2003-
08382-C05-02. I want to acknowledge the European project COST-277 “nonlinear
speech processing”, that has been acting as a catalyzer for the development of nonlin-
ear speech processing since middle 2001. I also want to acknowledge Prof. Enric
Monte-Moreno for the support and useful discussions of these years.
References
1. Faundez-Zanuy M. “On the vulnerability of biometric security systems” IEEE Aerospace
and Electronic Systems Magazine Vol.19 nº 6, pp.3-8, June 2004
2. Faundez-Zanuy M. “Biometric recognition: why not massively adopted yet?”. IEEE Aero-
space and Electronic Systems Magazine. Vol.20 nº 8, pp.25-28, August 2005
3. Martin A., Doddington G., Kamm T., Ordowski M., and Przybocki M., “The DET curve in
assessment of detection performance”, V. 4, pp.1895-1898, European speech Processing
Conference Eurospeech 1997
4. Furui S., Digital Speech Processing, synthesis, and recognition. Marcel Dekker, 1989.
5. Campbell J. P., Reynolds D. A. and Dunn R. B. “Fusing high- and low-level features for
speaker recognition”. Eurospeech 2003 Geneva.
6. Faundez-Zanuy M. “Data fusion in biometrics”. IEEE Aerospace and Electronic Systems
Magazine Vol. 20 nº 1, pp.34-38. January 2005.
7. Faundez-Zanuy M., Monte-Moreno E., IEEE Aerospace and Electronic Systems Maga-
zine. Vol. 20 nº 5, pp 7-12, May 2005.
8. Reynolds D. A., Rose R. C. “Robust text-independent speaker identification using Gaus-
sian mixture speaker models”. IEEE Trans. On Speech and Audio Processing, Vol. 3 No 1,
pp. 72-83 January 1995
9. Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge
University Press, (2000).
10. http://www.praat.org
11. Ortega-García J., González-Rodríguez J. and Marrero-Aguiar V. "AHUMADA: A Large
Speech Corpus in Spanish for Speaker Characterization and Identification". Speech com-
munication Vol. 31 (2000), pp. 255-264, June 2000
12. Doddington G., “Speaker Recognition based on Idiolectal Differences between Speakers,”
Eurospeech, vol. 4, p. 2521-2524, Aalborg 2001
13. Manning C. D., Schtze H. Foundations of Statistical Natural Language Processing, MIT
Press; 1st edition (June 18, 1999).
14. Thyssen, J., Nielsen, H., Hansen S.D.: Non-linear short-term prediction in speech coding.
IEEE ICASSP 1994, pp.I-185 , I-188.
15. Townshend, B.: Nonlinear prediction of speech. IEEE ICASSP-1991, Vol. 1, pp.425-428.
16. Teager, H.M.: Some observations on oral air flow vocalization. IEEE trans. ASSP, vol.82
pp.559-601, October 1980
188 M. Faundez-Zanuy and M. Chetouani
17. Kubin, G.: Nonlinear processing of speech. Chapter 16 on Speech coding and synthesis,
editors W.B. Kleijn & K.K. Paliwal, Ed. Elsevier 1995.
18. Thyssen, J., Nielsen, H., Hansen, S.D.: Non-linearities in speech. Proceedings IEEE work-
shop Nonlinear Signal & Image Processing, NSIP'95, June 1995
19. Faundez-Zanuy M., “Nonlinear speech processing: Overview and possibilities in speech
coding”, Lecture Notes in Computer Science LNCS Vol. 3445, pp.16-45. G. Chollet et al.
Ed. 2005
20. Kumar, A., Gersho, A.: LD-CELP speech coding with nonlinear prediction. IEEE Signal
Processing letters Vol. 4 Nº4, April 1997, pp.89-91
21. Wu, L., Niranjan, M., Fallside, F.: Fully vector quantized neural network-based code-
excited nonlinear predictive speech coding. IEEE transactions on speech and audio proc-
essing, Vol.2 nº 4, October 1994.
22. Wang, S., Paksoy E., Gersho, A.: Performance of nonlinear prediction of speech. Proceed-
ings ICSLP-1990, pp.29-32
23. Lee Y.K., Johnson, D.H.: Nonparametric prediction of non-gaussian time series. IEEE
ICASSP 1993, Vol. IV, pp.480-483
24. Ma, N., Wei, G.:Speech coding with nonlinear local prediction model. IEEE ICASSP 1998
vol. II, pp.1101-1104.
25. Pitas, I., Venetsanopoulos, A. N.: Non-linear digital filters: principles and applications.
Kluwer ed. 1990
26. Lippmann, R. P.,: An introduction to computing with neural nets. IEEE trans. ASSP, 1988,
Vol.3 Nº 4, pp.4-22
27. Jain, A.K., Mao, J.: Artificial neural networks: a tutorial. IEEE Computer, March 1996,
pp. 31-44.
28. Faundez-Zanuy M., Rodriguez D., “Speaker recognition using residual signal of linear and
nonlinear prediction models”. 5th International Conference on spoken language process-
ing. Vol.2 pp.121-124. ICSLP’98, Sydney 1998
29. Faundez-Zanuy M., “Speaker recognition by means of a combination of linear and nonlin-
ear predictive models “. Vol. 2 pp. 763-766. EUROSPEECH’99, Budapest 1999
30. Soong F. K., Rosenberg A. E., Rabiner L. R. and Juang B. H. " A vector quantization ap-
proach to speaker recognition". pp. 387-390. ICASSP 1985
31. B. Gas, J.L. Zarader, C. Chavy, M. Chetouani, “Discriminant neural predictive coding ap-
plied to phoneme recognition“, Neurocomputing, Vol. 56, pp. 141-166, 2004.
32. M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, “Non-linear speech feature ex-
traction for phoneme classification and speaker recognition“, Lecture Notes in Computer
Science LNCS Vol. 3445, pp.340-350. G. Chollet et al. Ed. 2005.
33. W.B. Kleijn, “Signal processing representations of speech”, IEICE Trans. Inf. And Syst.,
E86-D, 3, 359-376, March (2003).
34. A.N. Kolmogorov, “On the representation of continuous functions of several variables by
superposition of continuous functions of one variable and addition,” Dokl, 679-681
(1957).
35. V. Kurkova,”Kolmogorov’s theorem is relevant,” Neural Computation, 3(4), pp. 617-622.
36. R. Hecht-Nielsen, “Kolmogorov’s mapping neural network existence theorem“, Proc. of
International Conference on Neural Networks, pp. 11-13 (1987).
37. C. Bishop, “Neural Networks for Pattern Recognition”, Oxford University Press (1995).
38. B. Gas, M. Chetouani, J.L. Zarader and F. Feiz, “The Predictive Self_Organizing Map:
application to speech features extraction,” WSOM’05 (2005).
Nonlinear Predictive Models: Overview and Possibilities in Speaker Recognition 189
39. T.L. Burrows, “Speech processing with linear and neural networks models”, PhD Cam-
bridge, 1996.
40. M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, “A new nonlinear speaker param-
eterization algorithm for speaker identification“, Speaker Odyssey’04: Speaker Recogni-
tion Workshop, May 2004, Toledo, Spain.
41. Arun A. Ross, Karthik Nandakumar, and Anil K. Jain Handbook of multibiometrics.
Springer Verlag 2006
42. M. Faundez-Zanuy “On-line signature recognition based on VQ-DTW”. Pattern Recogni-
tion 40 (2007) pp.981-992. Elsevier. March 2007.
SVMs for Automatic Speech
Recognition: A Survey
1 Introduction
Hidden Markov Models (HMMs) are, undoubtedly, the most employed core
technique for Automatic Speech Recognition (ASR). During the last decades,
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 190–216, 2007.
c Springer-Verlag Berlin Heidelberg 2007
SVMs for Automatic Speech Recognition: A Survey 191
research in HMMs for ASR has brought about significant advances and, con-
sequently, the HMMs are currently very accurately tuned for this application.
Nevertheless, we are still far from achieving high-performance ASR systems. One
of the most relevant problems of the HMM-based ASR technology is the loss of
performance due to the mismatch between training and testing conditions, or,
in other words, the design of robust ASR systems.
A lot of research efforts have been dedicated to tackle the mismatch problem;
however, the most successful solution seems to be using larger databases, trying
to embed in the training set all the variability of speech and speakers. At the same
time, speech recognition community is aware of the HMM limitations, but the few
attempts to move toward other paradigms did not work out. In particular, some al-
ternative approaches, most of them based on Artificial Neural Networks (ANNs),
were proposed during the late eighties and early nineties ([1, 2, 3, 4] are some ex-
amples). Some of them dealt with the ASR problem using predictive ANNs, while
others proposed hybrid ANN/HMM approaches. Nowadays, however, the prepon-
derance of HMMs in practical ASR systems is a fact.
In this chapter we review some of the new alternative approaches to the ASR
problem; specifically, those based on Support Vector Machines (SVMs) [5, 6].
One of the fundamentals reasons to use SVMs was already highlighted by the
ANN-based proposals: it is well known that HMM are generative models, i.e.,
the acoustic-level decisions are taken based on the likelihood that the currently
evaluated pattern had been generated by each of the models that comprise the
ASR system. Nevertheless, conceptually, these decisions are essentially classifi-
cation problems that could be approached, perhaps more successfully, by means
of discriminative models. Certainly, algorithms for enhancing the discrimination
abilities of HMMs have also been devised. However, the underlying model keeps
being generative.
There are other reasons to propose the use of SVMs for ASR. Some of them
will be discussed later; now, we focus on their excellent capacity of general-
ization, since it might improve the robustness of ASR systems. SVMs rely on
maximizing the distance between the samples and the classification boundary.
Unlike others, such as neural networks or some modifications of the HMMs that
minimize the empirical risk on the training set, SVMs minimize also the struc-
tural risk [7], which results in a better generalization ability. In other words,
given a learning problem and a finite training database, SVMs properly weight
the learning potential of the database and the capacity of the machine.
The maximized distance, known as the margin, is the responsible of the out-
standing generalization properties of the SVMs: the maximum margin solution
allows the SVMs to outperform most nonlinear classifiers in the presence of
noise, which is one of the longstanding problems in ASR. In a noise-free system,
this margin is related to the maximum distance a correctly classified sample
should travel to be considered as belonging to the wrong class. In other words,
it indicates the noise that added to the clean samples is allowed into the system.
Nevertheless, the use of SVMs for ASR is not straightforward. In our opinion,
three are the main difficulties to overcome, namely: 1) SVMs are originally static
192 R. Solera-Ureña et al.
Other approaches for speech recognition use Predictive Neural Networks, one
per class, to predict a certain acoustic vector given a time window of observations
centered in the current one [2], [3]. This way Predictive Neural Networks capture
the temporal correlations between acoustic vectors.
Finally, in the hybrid ANN/HMM system proposed in [14], ANNs are trained
to estimate phone posterior probabilities and these probabilities are used as fea-
ture vectors for a conventional GMM-HMM recognizer. This approach is called
Tandem Acoustic Modeling and it achieves good results in context-independent
systems.
Numerous studies show that hybrid systems achieve comparable recognition
results than equivalent (with a similar number of parameters) HMM-based sys-
tems or even better in some tasks and conditions. Also, they present a better
behavior when a little amount of training data is available. However, hybrid
ANN/HMM have not been yet widely applied to speech recognition, very likely
because some problems still remain open, for example: the design of optimal
network architectures or the difficulty of designing a joint training scheme for
both, ANNs and HMMs.
3 SVM Fundamentals
3.1 SVM Formulation
A SVM is essentially a binary nonlinear classifier capable of guessing whether
an input vector x belongs to a class 1 (the desired output would be then y =
+1) or to a class 2 (y = −1). This algorithm was first proposed in [15] in
1992, and it is a nonlinear version of a much older linear algorithm, the optimal
hyperplane decision rule (also known as the generalized portrait algorithm),
which was introduced in the sixties.
Given a set of separable data, the goal is to find the optimal decision function.
It can be easily seen that there is an infinite number of optimal solutions for
this problem, in the sense that they can separate the training samples with
zero errors. However, since we look for a decision function able to generalize
for unseen samples, we can think on an additional criterion to find the best
solution among those with zero errors. If we knew the probability densities of
the classes, we could apply the maximum a posteriori (MAP) criterion to find
the optimal solution. Unfortunately, in most practical cases this information is
not available, so we can adopt another simpler criteria: among those functions
without training errors, we will choose that with the maximum margin, being
this margin the distance between the closest sample and the decision boundary
defined by that function. Of course, optimality in the sense of maximum margin
does not imply necessarily optimality in the sense of minimizing the number
of errors in test, but it is a simple criterion that yields to solutions which, in
practice, turn out to be the best ones for many problems [16].
SVMs for Automatic Speech Recognition: A Survey 195
1
1
0[j1 I(xi)
rx
[k>1
1
min w 2 ,
2
w,b
1 N
min LP = w 2 + C ξi ,
w,b,ξi 2 i=1
subject to yi (wT · φ(xi ) + b) ≥ 1 − ξi ,
ξi ≥ 0, for i = 1, · · · , N, (3)
n
1
n n
max LD = αi − yi yj αi αj φT (xi )φ(xj ),
αi
i=1
2 i=1 j=1
n
subject to αi yi = 0 and 0 ≤ αi ≤ C. (4)
i=1
N
f (x) = αi yi K(xi , x) + b. (6)
i=1
SVMs for Automatic Speech Recognition: A Survey 197
It is worth mentioning that there are some conditions that a function should
accomplish to be used as a kernel. These are often denominated KKT (Karush-
Kuhn-Tucker) conditions [17] and can be reduced to check the kernel matrix is
symmetrical and positive semi-definite.
As already discussed in the Introduction and in the previous section, SVMs are
state-of-the-art tools for solving classification problems that seems to be very
promising from the speech recognition perspective. They offer a discriminative
solution to the pattern classification problem involved in ASR. Furthermore, the
maximum margin SVM solution exhibits an excellent generalization capability,
what might notably improve the robustness of ASR systems.
SVMs for Automatic Speech Recognition: A Survey 199
In fact, the improved discrimination ability of SVMs has attracted the atten-
tion of many speech technologists. Though this paper focuses on speech recog-
nition, it is worth noticing that SVMs have already been employed in speaker
identification [19] and verification [20], or to improve confidence measurements
that can help in dialogue systems [21], among other applications.
However, its application to ASR is by no means straightforward. Here follows
a review of the most important problems that has motivated the structure of
the present section.
• The variable time duration of the speech utterances: The Automatic Speech
Recognition involves the solution of a pattern classification problem. How-
ever, the variable time duration of the speech signals has prevented the ASR
from being approached as a simple static classification problem. In fact, this
has been for many decades one of the fundamental problems faced by the
speech processing community and the main responsible for the success of the
HMMs. The main problem stems from the fact that conventional kernels can
only deal with (sequences of) vectors of fixed length. Standard parameteri-
zation techniques, on the other hand, generate variable length sequences of
feature vectors depending on the time duration of each speech utterance.
Different approaches have been proposed to deal with the variable time
duration of the acoustic speech units. Basically, solutions can be divided into
three groups: 1) the ones that aim at performing a previous dimensional
(time) normalization to fit the SVM input; 2) those that explore string-
related or normalizing kernels [5] to adapt the SVMs to make them able
to use variable dimension vectors as inputs; and 3) those that avoid this
problem by working in a framewise manner. As we will see later in section
4.2, the latter is specially well suited for continuous speech recognition while
the first two are more appropriate for lower complexity tasks and will be
addressed in section 4.1.
• Multiclass SVMs: ASR is a multiclass problem, whereas in the original for-
mulation an SVM is a binary classifier. Although some of the proposed ap-
proaches to multiclass SVMs make a reformulation of the SVM equations to
consider all classes at once, this option is very computationally expensive.
A more usual approach to cope with this limitation involves combining a
number of binary SVMs to achieve the multiclass classifier by means of a
subsequent voting scheme. Two different versions of this method are usu-
ally considered. The first consists of comparing each class against all the
rest (1-vs-all ), while in the second each class is confronted against all the
other classes separately (1-vs-1 ). Although the number of SVMs is greater
for the 1-vs-1 approximation (namely, k(k−1) 2 vs. k SVMs, with k denoting
the number of classes), the size of the training set needed for each SVM in
the 1-vs-1 solution leads to a smaller computational effort with comparable
accuracy rates [22].
• The size of the databases: most SVM implementations do not allow to deal
with the huge databases typically used in medium- and high-complexity ASR
task.
200 R. Solera-Ureña et al.
Having reviewed the fundamental challenges we now devote the next subsec-
tions to the exposition of the main solutions described in the literature, from
the most simple tasks, such as isolated phonemes, letters or words recognition
(low-complexity ASR tasks) to approaches to connected digits and continuous
speech recognition (medium-complexity ASR tasks).
high and rising. A fixed number of measures of the pitch evolution is chosen
in this case. However for the classification of Thai vowels they also divide
each vowel into three regions.
• In [25], the authors evaluate the performance of SVMs showing advantages
when compared with GMM (Gaussian Mixture Models) in both vowel-only
and phone classification tasks. It is worth noting that a significant difference
is observed in the problem of length adaptation between these two tasks. In
the vowel case, it is acknowledged that regardless of the duration of each ut-
terance, the acoustic representations are almost constant. Therefore simple
features as the formant frequencies or LPC coefficients corresponding to any
time window are representative of the whole sequence. However, the repre-
sentation of the variations taking place in non-vowel utterances is essential
for obtaining an adequate input to SVMs. Thus, again the triphone model
approach has been applied in this case, segmenting the number of frames ob-
tained for each phone into three regions in the ratio 3-4-3 and subsequently
averaging the features corresponding to the resulting regions.
• Similar distinctions have been observed in [26], where a comparison between
the performance of classical HMMs and SVMs as sub-word units recognition
is assessed for two different languages: 41 monophone units are classified in a
Japanese corpus and 86 consonant-vowel units are considered for an Indian
language. In this case, two different strategies have been devised to provide
the SVMs with a fixed-length input: for the Japanese monophones, a similar
technique to that proposed in [25] has been used. The frames comprising each
monophone have been divided into a fixed number of segments. An averaged
feature vector is then obtained for each segment. Each feature vector is sub-
sequently concatenated to those resulting from other segments to form input
vector for the SVM classifier. For the Indian consonant-vowel classification,
however, a different approach has been designed to account for the varia-
tions of the acoustic characteristics of the signal during the consonant-vowel
transition. In this case the fixed length patterns are obtained by linearly
elongating or compressing the feature sequence duration. For both Indian
and the previously mentioned Japanese tasks the SVMs have shown a better
performance than HMMs with the standard MFCCs (Mel-Frequency Cep-
stral Coefficients) [27] plus energy and delta and acceleration coefficients.
explained variable window size method using different window lengths based on
the duration of the phoneme being classified. Therefore they concatenate 5 win-
dows of the same size chosen from the set {32, 64, 128, 256, 400} covering the
whole phoneme.
Another possible solution is showed in [28, 30, 31], where the non-uniform
distribution of analysis instants provided by the internal states transitions of
an HMM with a fixed number of states and a Viterbi decoder is used for di-
mensional normalization. The rationale behind this proposal is that the uniform
resampling methods are produced without any consideration about the infor-
mation (or lack of information) that speech analysis segments were providing.
Selecting the utterance segments in which the signal is changing, it is hoped that
a bigger amount of information is preserved in the feature vector.
Related to the previous approach, in [32] they acknowledge the fact that the
classification error patterns from SVM and HMM classifiers can be different and
thus their combination could result in a gain in performance. They assess this
statement on a classification task of consonant-vowel units of speech in several
Indian languages obtaining a marginal gain by using a sum rule combination
scheme of the two classifiers evidences. As for feature length normalization they
select segments of fixed duration around the vowel onset point, i.e., the instant
at which the consonant ends and the vowel begins.
m O
Time
…
L
i-1,j
j L i,j
E
…
i,j-1
1 H
H E E L L O O
1 … i … n
Time
of the sequences in the instant k, we must find the solution to the new inner
product :
1
L
KDT A (X, Y ) = X ◦ Y = max m(k)xTψI (k) · yψJ (k) ,
ψI ,ψJ Mψ
k=1
subject to 1 ≤ ψI (k) ≤ ψI (k + 1) ≤ |X|,
1 ≤ ψJ (k) ≤ ψJ (k + 1) ≤ |Y |, (11)
the DTAK solution automatically finds the best reference templates using the
max-margin criterion.
Effectively, if we look at equation (5) in section 3, we see that only those
templates with an associated αi = 0 will be relevant and will contribute to
determine the separating boundary. Only a few templates will have a non-zero
αi , and these will be the closest to the decision function. Now, the support vectors
are support sequences or templates.
Furthermore, the algorithm not only selects those appropriate templates that
define the decision boundary but the number of them that minimise the struc-
tural risk, and this is accomplish by giving an appropriate value to the parameter
C in equation (3). Unfortunately, we do not have a method to calculate the best
value for this parameter a priori, so we must resort to cross-validation.
With the previous formulation it is now easy to consider the generalization
that allows us to find the separating border in a higher dimension space (the
feature space) by means of a non-linear kernel like an RBF. We have said that,
basically, DTAK consists in using DTW as the kernel of an SVM. However, a
generalization consisting in performing the time-warping in the feature space can
be considered. In other words, in equation (11), we could use a kernel function
(for example, an RBF) instead of a conventional dot product and the DTAK
kernel would have the following form:
KsDT A (X,Y )
1
L
max
= φ(X) ◦ φ(Y ) =ψI ,ψJ m(k)KRBF (xψI (k) , yψJ (k) ). (14)
Mψ
k=1
ut Ks u = ut (K(1) + · · · + K(L) )u
= ut K(1) u + · · · + ut K(L) u
≥ 0, (17)
Experimental Results
Results for the well-known SpeechDat-4000 database are presented in [31]. The
whole database is not used: specifically, only the isolated-digit utterances are
used for the experiments. The reported results depend on the noise level: on
the one hand, the DTAK-based system achieves excellent performance, clearly
superior to that achieved by the HMM-based system, for low SNRs; on the other
hand, it incurs in some performance losses for high SNRs. The improvements
due to DTAK (using either linear or RBF kernels) with respect to HMMs are
statistically significant for white noise at 3, 6 and 9 dB and for F16-plane noise
at 3 and 6 dB. On the contrary, the HMM-based system outperforms the DTAK-
based one in clean conditions and several noisy cases at 12 dB. In the remaining
conditions, which correspond to medium SNRs, the system performances do not
exhibit statistically significant differences. Please refer to [31] for more details
about the DTAK-based system and the experimental setup.
In summary, the reported results [31] show that SVMs exhibit a robust be-
haviour, as expected. In particular, the DTAK-based system turns out to be
effective in noisy scenarios. In fact, the advantage due to the DTAK algorithm is
higher as the noise conditions worsen. On the other hand, direct application of
DTAK-based systems to continuous speech recognition is by no means straight-
forward, as the length of the sequences in eq. (13) must be known. In our opinion,
some alternative segmentation techniques such as that proposed in [36], should
be revisited to deal with this limitation.
samples that a SVM can deal with. Nevertheless, the very valuable character-
istics of SVM classifiers have encouraged several authors to try to solve these
problems.
As briefly mentioned in a previous section, some authors [42] have tried to
overcome the problem of the variability of duration of speech utterances using
HMMs to perform a time segmentation prior to classification. Other works cope
with the variability of duration of speech utterances by embedding either an
HMM [38] or a Dynamic Time Warping algorithm [33] in the kernel of the SVM.
It is not easy, however, to apply these last two techniques to the problem of
continuous speech because a previous word (or phoneme) segmentation of the
utterance is still required. Another solution to overcome the mentioned difficul-
ties is proposed in [43]. This method consists in classifying each frame of voice
as belonging to a basic class (a phone) and using the Token Passing algorithm
[44] to go from the classification of each frame to the word chain recognition.
This is a similar approach to that presented in [45] by Cosi. The main difference
is that Cosi uses Neural Networks (NNs) instead of SVMs.
In this section the approaches due to Ganapathiraju [42], who proposed a
hybrid HMM/SVM system, and Padrell [43], who presented a pure SVM-based
ASR system, are explained in detail.
Although it will not be described in this Chapter, it is worth to briefly mention
a segmentation method for continuous speech presented in [46]. In particular,
articulatory features are used to segment speech into broad manner classes using
the probability-like outputs of SVMs to perform the classification every 5 ms
over a 10 ms duration frame. They found that for this task, SVMs perform
significantly better than HMM.
Variable lenght
Speech Frames
40%
30% 30% mean
1 2 3
concatenation
3
Fixed lenght
to SVM classification
2
where fij (x) is the distance between the vector x and the hyperplane used to
classify between class i and class j. This estimation can be improved using
a softmax function as follows:
exp(Si (x)/k)
Ŝi (x) = , (20)
j exp(Sj (x)/k)
e
Token
r Accumulated Prob.
State Prob.’o’
+
t t Tr.Prob.’uno’−’tres’
o
o uno
n ...
u
Frames (time)
i j MFCCs Stream
Fig. 4. An illustration of the Token Passing Algorithm for a very simple grammar
SVMs for Automatic Speech Recognition: A Survey 211
modify this Link are represented by solid-lines, while those that do not mod-
ify it are represented by dashed-lines. Proceeding as usually in the Viterbi
algorithm, only the path leading to the highest probability for every node is
kept.
When the Viterbi algorithm has explored all the frames, the Token with
a higher accumulated probability is chosen and its Link to the (sequence of)
word-ends provides us the sequence of recognized words.
The number of training samples that the system is able to use becomes a
practical problem for the SVM system, for both training and testing. In the
training process, typically, all the Kernels (or a high percentage of them)
should be allocated in the computer memory. This limits the number of
training samples in function of the available memory. A large training set
also implies a high computational cost from the classification (test) point of
view, since the number of Support Vectors (SV) increases linearly with the
number of training samples.
The Multiclass problem. In order to solve it, the 1 − vs − 1 approach is used.
This method allows to train all the system using a maximum number of
different samples for each class, and to keep limited the use of computer
memory. For 18 classes, this method implies to train and use 18·(18−1)
2 = 153
SVMs, where each SVM classifies each frame between two of the possible
phones, deciding the winning class by voting.
The definition of classes. When each class is a phone the time variation typ-
ically exhibited by actual phones is not taken into account. Some time vari-
ation can be embedded through the delta parameters, but better solutions
should be considered; for example: either extending the time-window covered
by the parameterization (for example, considering for each time instant the
concatenation of two or three consecutive features vectors) or changing the
definition of classes considered to deal with parts of phones.
The last alternative has been chosen because it helps to deal with another
SVM-related problem: the practical limitation of the number of samples for
training a single SVM. Increasing the number of classes and maintaining
constant the number of samples used to train each SVM, the total number
of samples used to train the whole system is effectively increased. The natural
choice consists in defining a class for the beginning of the phone, a class for
the center of the phone, and finally, a class for the end of the phone. This
new approach transforms the 18 initial classes into 18 ·3 = 54. Therefore, the
number of SVM classifiers to perform the 1−vs−1 multiclass implementation
moves from 153 to 1431.
To use these new classes, an allowed-transition matrix should be included
to actually constrain the class transitions allowed during the Viterbi-based
exploration. Furthermore a probability-transition matrix can be used instead
of the previously mentioned allowed-transition matrix. The transition prob-
abilities, aij , can be estimated from the number of transitions from i to j
occurring when considering the samples in the available training set.
The results using this approach shows that SVMs can become a competitive
alternative to HMMs in continuous speech recognition [43]. With a very small
212 R. Solera-Ureña et al.
Phoneme
Phoneme
a) b)
Fig. 5. a) Identifying an SVM class per phone; b) Identifying an SVM class as a part
of a phone: three SVM classes per phone
In this Chapter we have reviewed the research work dealing with SVMs for ASR.
We have started by reviewing the reasons reported in the literature to support the
use of SVMs for ASR. Thus, we have explained the characteristics of SVMs that
make them valuable from the ASR point of view: first, SVMs are discriminative
models, thus more appropriate for classification problems; second, as opposed to
ANNs, they have the advantage of being capable to deal with samples of a very
higher dimensionality; and third, they exhibit an excellent generalization ability
that makes them especially suitable to deal with noisy speech.
After motivating the theme of research, we have described the problems that
the attempt to use SVMs in this context has arisen: first, SVMs were originally
formulated to process fixed-dimension input vectors and consequently it is not
straightforward to manage the speech time variability; second, the SVMs were
formerly devised as binary classifiers while the ASR problem is multiclass; and
third current SVM training algorithms are not able to manage the huge databases
typically used in ASR.
The larger Section of the Chapter is dedicated to present an overview of the
solutions that have been proposed to the previously mentioned problems. The
exposition have been organized into two subsections, depending of the complexity
of the tackled ASR tasks: low- and medium-complexity ASR tasks.
Within the low-complexity tasks subsection, the most relevant alternatives to
overcome the problem of speech time variability have been described, highlight-
ing the one based on DTAK-SVMs, that is a genuine SVM-based system able
to manage the variable input dimension. The experimental results reveal that the
SVMs for Automatic Speech Recognition: A Survey 213
Acknowledgement
This work has been partially supported by the regional grant (Comunidad
Autónoma de Madrid-UC3M) UC3M-TEC-05-059.
References
[21] C. Ma, M.A. Randolph, and J. Drish. A support vector machines-based rejection
technique for speech recognition. In Proceedings of the International Conference
on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 381–384,
Salt Lake City, Utah (USA), 2001.
[22] C.W. Hsu and C.J. Lin. A Comparison of Methods for Multi-class Support Vector
Machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
[23] A. Ganapathiraju, J.E. Hamaker, and J. Picone. Applications of support vector
machines to speech recognition. IEEE Transactions on Signal Processing, 52:2348–
2355, 2004.
[24] N. Thubthong and B. Kijsirikul. Support vector machines for thai phoneme recog-
nition. International Journal of Uncertainty, Fuzziness and Knowledge-Based
Systems, 9:803–13, 2001.
[25] P. Clarkson and P.J. Moreno. On the use of support vector machines for phonetic
classification. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), volume 2, pages 585–588, Phoenix, Arizona (USA), 1999.
[26] C. Sekhar, W.F. Lee, K. Takeda, and F. Itakura. Acoustic modelling of subword
units using support vector machines. In Workshop on spoken language processing,
Mumbai, India, 2003.
[27] S. Young. HTK-Hidden Markov Model Toolkit (ver 2.1). Cambridge University,
1995.
[28] J.M. Garcı́a-Cabellos, C. Peláez-Moreno, A. Gallardo-Antolı́n, F. Pérez-Cruz, and
F. Dı́az-de-Marı́a. SVM Classifiers for ASR: A Discusion about Parameterization.
In Proceedings of EUSIPCO 2004, pages 2067–2070, Wien, Austria, 2004.
[29] A. Ech-Cherif, M. Kohili, A. Benyettou, and M. Benyettou. Lagrangian support
vector machines for phoneme classification. In Proceedings of the 9th International
Conference on Neural Information Processing (ICONIP ’02), volume 5, pages
2507–2511, Singapore, 2002.
[30] D. Martı́n-Iglesias, J. Bernal-Chaves, C. Peláez-Moreno, A. Gallardo-Antolı́n, and
F. Dı́az-de-Marı́a. Nonlinear Analyses and Algorithms for Speech Processing, vol-
ume LNAI 3817 of Lecture Notes in Computer Science, chapter A Speech Recog-
nizer based on Multiclass SVMs with HMM-Guided Segmentation, pages 256–266.
Springer, 2005.
[31] R. Solera-Ureña, D. Martı́n-Iglesias, A. Gallardo-Antolı́n, C. Peláez-Moreno, and
F. Dı́az-de-Marı́a. Robust ASR using Support Vector Machines. Speech Commu-
nication, Elsevier (submitted), 2006.
[32] S.V. Gangashetty, C. Sekhar, and B. Yegnanarayana. Combining evidence from
multiple classifiers for recognition of consonant-vowel units of speech in multiple
languages. In Proceedings of the International Conference on Intelligent Sensing
and Information Processing, pages 387–391, Chennai, India, 2005.
[33] H. Shimodaira, K.I. Noma, M. Nakai, and S. Sagayama. Support vector machine
with dynamic time-alignment kernel for speech recognition. In Proceedings of
Eurospeech, pages 1841–1844, Aalborg, Denmark, 2001.
[34] H. Shimodaira, K. Noma, and M. Nakai. Advances in Neural Information Process-
ing Systems 14, volume 2, chapter Dynamic Time-Alignment Kernel in Support
Vector Machine, pages 921–928. MIT Press, Cambridge, MA (USA), 2002.
[35] L. R. Rabiner, A.E. Rosenberg, and S.E. Levinson. Considerations in Dynamic
Time Warping Algorithms for Discrete Word Recognition. IEEE Transactions on
Acoustics, Speech and Signal Processing, 26(6):575–582, 1978.
[36] J. R. Glass. A probabilistic framework for segment-based speech recognition.
Computer Speech and Language, 17:137–152, 2003.
216 R. Solera-Ureña et al.
Abstract. This paper deals with the problem of enhancing the quality of speech
signals, which has received growing attention in the last few decades. Many dif-
ferent approaches have been proposed in the literature under various configura-
tions and operating hypotheses. The aim of this paper is to give an overview of
the main classes of noise reduction algorithms proposed to-date, focusing on the
case of additive independent noise. In this context, we first distinguish between
single and multi channel solutions, with the former generally shown to be based
on statistical estimation of the involved signals whereas the latter usually em-
ploy adaptive procedures (as in the classical adaptive noise cancellation
scheme). Within these two general classes, we distinguish between certain sub-
families of algorithms. Subsequently, the impact of nonlinearity on the speech
enhancement problem is highlighted: the lack of perfect linearity in related
processes and the non-Gaussian nature of the involved signals are shown to
have motivated several researchers to propose a range of efficient nonlinear
techniques for speech enhancement. Finally, the paper summarizes (in tabular
form) for comparative purposes, the general features, list of operating assump-
tions, the relative advantages and drawbacks, and the various types of non-
linear techniques for each class of speech enhancement strategy.
Keywords: Single-channel/Multi-channel Speech Enhancement, Noise Reduc-
tion, Noise Cancellation, Microphone array, Non-linear techniques.
1 Introduction
The goal of speech enhancement systems is either to improve the perceived quality of
the speech, or to increase its intelligibility [1-3]. There is a large variety of real world
applications for speech enhancement in audio signal processing – for example, we
experience the presence of degraded speech both in military and commercial commu-
nications, induced by different transmission channels (telephony) or produced in vari-
ous noisy environments (vehicles, home-office etc.). Due to the growing interest in
this subject, numerous efforts have been made over the past 20 years or so by the sci-
entific community in order to find an effective solution for the speech enhancement
problem. The different nature of interfering sounds, the admissible assumptions on the
generating process of speech degradation and on the available observables, and the
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 217 – 248, 2007.
© Springer-Verlag Berlin Heidelberg 2007
218 A. Hussain et al.
The degradation of the speech signal can be modeled, in a quite general manner, as
follows:
y [k ] = h [k ] ∗ s [k ] + n [k ] = sh [k ] + n [k ] (1)
Spectral Subtraction (SS) [6] is probably the earliest and most well-known technique
for single channel speech enhancement: it is often still used due to its efficacy and
simplicity. In its most basic form (involving subtraction/filtering of power spectral
density/amplitudes), the noise power spectral density is estimated, but the method in-
troduces musical noise and other distortions in the recovered signal [1],[3]. Some in-
teresting solutions involving nonlinear techniques have been proposed in the literature
to overcome such drawbacks [1], [7]. However, as widely agreed, the best algorithm
from this perspective is the one proposed by Ephraim and Malah [8-10] that is closely
related to the pioneering work of McAulay and Malpass [11]. This is based on the
minimum mean square error (MMSE) estimation of the speech spectrum in the loga-
rithmic domain; and it is a natural extension of the one in the linear domain [8]. Fur-
ther improvements have been achieved through the employment of better performing
MMSE estimators, as in Xie and Compernolle [12], or by making the spectral subtrac-
tion procedure dependent on the properties of the human auditory system [13].
Another interesting derivative of SS is the signal subspace approach [14], [15]
based on an estimation of the clean speech, as also done in the case of the Bayesian
approach for speech enhancement using Hidden Markov Models (HMM) [16], [17].
Since this approach in its basic form is essentially linear and thus out of the intent of
this work, it is not described in the following. HMM have also been successfully im-
plemented in nonlinear estimation frameworks [18], [19] where some speech data is
assumed available for training. Other methods relying on the availability of a suitable
training set have also been developed. Among these, we can cite the time domain and
transform domain nonlinear filtering methods employing neural networks [20-26].
On the other hand, unsupervised single-channel speech enhancement techniques
have received significant attention recently. Examples here include the Extended
Kalman Filtering [15] [27-28] Monte-Carlo simulations [4], Particle filtering [4], [21],
[30] and the Noise-Regularized Adaptive Filtering [15], [31] approaches, that can
220 A. Hussain et al.
enable significant noise reduction even in difficult situations (involving noise non-
gaussianity or system nonlinearity).
where m is the sensor index. When m = 2 we refer to the so called binaural case if the
spacing between the microphones is comparable to that between human’s ears.
In the last years the scientific community has particularly focused its attention on
multichannel techniques, as they virtually provide remarkable outcomes on the single
channel ones. As highlighted in some recent works [3], using a single channel it is not
possible to improve both intelligibility and quality of the recovered signal at the same
time. Quality can be improved at the expense of sacrificing intelligibility. A way to
overcome this limitation is to add some spatial information to the time/frequency in-
formation available in the single channel case. We can get this additional information
using two or more channel of noisy speech.
Adaptive noise cancellation [32-33] can be viewed as a particular case of the multi-
channel speech enhancement problem. Indeed, we have two observables, the noisy
speech and the reference noise, and the goal is to get an enhanced output speech adap-
tively according to the scheme in Fig.1. Classical methods based on full-band multi-
microphone noise cancellation implementations can produce excellent results in
anechoic environments with localized sound radiators, however performance deterio-
rates in reverberant environments. Adaptive sub-band processing has been found to
overcome these limitations [34]. The idea of involving sub-band diverse processing to
take account of the coherence between noise signals from multiple sensors has been
implemented as part of the so-called Multi-Microphone Sub-Band Adaptive
(MMSBA) speech enhancement system [35-38].
The main limitation of these linear approaches is that they are not able to deal ef-
fectively with non-gaussianity of the involved signals or the non-linear distortions
arising from the electro-acoustic transmission systems. As a result, several nonlinear
approaches have been proposed to-date mainly employing Neural Networks (NN) and
Volterra Filtering (VF), see for example,[39-42], [25]. Such non-linear processing ap-
proaches have also been successfully implemented within the MMSBA architecture,
as will be highlighted later on.
If available, more than two microphones (resulting in a microphone array) can be
used in order to achieve better performance for noise reduction. The most common
approaches here are represented by the delay-and-sum array and the adaptive beam-
former [43]. Among the large variety of linear approaches that have appeared in the
literature so far for speech enhancement, some nonlinear microphone arrays have also
been proposed [44-46], which seem to exhibit relevant performance improvements
with respect to their linear counterparts. Another interesting nonlinear approach in the
microphone array area is represented by the idea of estimating the log spectra of in-
volved signals (as in the single channel case), taking advantage of the availability of
more sensors [47-51].
Nonlinear Speech Enhancement: An Overview 221
n o is e F o u rrie
ie r a m p litu d e D ig ita l
T ra
r a n ssfo
fo rrm
m F ilte r
−
a m p litu d e
n o is y F o u rrie
ie r In v e rse
sp eech T rra
a n ssfo
fo rrm
m o u tp u t
F o u rie r
p h a se sp eech
T ra n sfo rm
Fig. 1. Spectral Subtraction when two microphones are available (subtraction of amplitudes)
⎧⎪ Y (ω ) 2 − Nˆ (ω ) 2 if Y (ω ) 2 > Nˆ (ω ) 2
Sˆ (ω ) = ⎨
2
(3)
⎪⎩0 otherwise
where Nˆ (ω )
2
is the noise power spectral estimate. The phase of noisy speech is left
unchanged, so the enhanced signal in time domain is obtained as:
sˆ ( k ) = IFFT ⎡ Sˆ (ω ) e ( ) ⎤ .
j arg Y (ω )
(4)
⎣ ⎦
Subtractive-type algorithms can be studied using a second approach termed filter-
ing of noisy speech, involving the use of a time-varying linear filter dependent on the
characteristics of the noisy signal spectrum and on the estimated noise spectrum. The
noise suppression process becomes a product of the short-time spectral magnitude of
the noisy speech Y (ω ) with a gain function G ( ω ) as follows:
222 A. Hussain et al.
(
G ( ω ) = 1 − Nˆ (ω )
2
Y (ω )
2
)= R post (ω ) / (1 + R post ( ω ) ) (6)
that is constrained to be null if the estimated noise power level is superior to that of
(
the noisy speech. R post (ω ) = Y (ω ) / N (ω )
2 2
) −1 is the a posteriori SNR. In other
words, such a subtractive scheme results in emphasizing the spectral components pro-
portionally to the amount by which they exceed noise. As can be seen in (6), G ( ω )
can be written as a function of the a-posteriori SNR, and many different rules, namely
suppression curves, have been proposed so far. Their aim is to make the application of
G ( ω ) more flexible in order to reduce the effect of musical noise that is characteris-
tic of the classical SS approach [1-2], [6]. From this perspective, an interesting
solution has been proposed as the so-called nonlinear SS [7], according to which a
nonlinear estimation of noise power spectral density Nˆ (ω ) is used in (3) as
2
nl
follows:
Nˆ (ω )
2
nl
= Φ ⎛⎜ max
⎝ over M frames ( )
Nˆ (ω ) , R post ( ω ) , Nˆ ( ω ) ⎞⎟
2 2
⎠
(7)
where Φ (.) is the nonlinearity involved in the estimation process. A possible formu-
lation for this is:
( Nˆ (ω ) )
2
( ), R
max
Φ ⎛⎜ max Nˆ (ω ) (ω ) ⎞⎟ =
2 over M frames (8)
⎝ over M frames ⎠ 1 + γ R post (ω )
post
with γ being a design parameter. Equation (8) says that as the SNR decreases the out-
put of the nonlinear estimator approaches the maximum value of noise spectrum over
M frames, and as SNR increases it approaches zero. One can consider more compli-
cated Φ (.) , depending also on Nˆ (ω ) , which can be useful if one is interested in
2
The Ephraim Malah algorithm [8-10] has received much attention by the scientific
community. This is mainly due to its ability to achieve a highly satisfying overall
quality of the enhanced speech which is appreciatively artifacts-free, and these char-
acteristics makes it suitable for practical implementations in digital hearing aids. Such
an approach has been down to outperform the conventional SS schemes as it is based
on an estimation of the short-time spectral amplitude (STSA) of the speech signal.
The same is also the case with the Soft-Decision Noise Suppression filter of McAulay
Nonlinear Speech Enhancement: An Overview 223
and Malpass [11] where the STSA estimator is derived from an optimal (in the
Maximum-Likelihood sense) variance estimator. In Ephraim and Malah (1985), an
MMSE (minimum mean square error) STSA estimator is derived and applied in a SS
scheme. The basic assumptions are the statistical independence of speech and noise,
along with the spectral components of each of these two processes considered as zero
mean statistically independent Gaussian random variables.
As pointed out in several papers, the main difference between the two STSA based
approaches, i.e. [8] and [11], is that the former is able to yield colourless residual
noise, whereas musical noise is still present after processing the observable through
the latter procedure. In the following only the main formulae constituting the Eph-
raim-Malah noise suppressor are reported. Omitting the time and frequency indexes
( l , ω ) in order to shorten the notation, the suppression curve G ( l , ω ) to be applied to
the short-time spectrum value Y ( l , ω ) can be expressed as:
π ⎛ 1 ⎞⎛ R prio ⎞ ⎛ ⎛ R prio ⎞⎞
G (l,ω ) = ⋅ ⎜
⎜ ⎟⎜
⎟⎜ ⎟⎟ ⋅ M ⎜ (1 + R post ) ⎜⎜ ⎟⎟ ⎟ (9)
⎜ ⎟
2 ⎝ 1 + R post ⎠⎝ 1 + R prio ⎠ ⎝ ⎝ 1 + R prio ⎠⎠
where M (.) is the nonlinearity based on 0th and 1st order Bessel functions:
⎛ θ ⎞⎡ ⎛θ ⎞ ⎛ θ ⎞⎤
M (θ ) = exp ⎜ − ⎟ ⎢(1 + θ ) I 0 ⎜ ⎟ + θ I1 ⎜ ⎟ ⎥ . (10)
⎝ 2 ⎠⎣ ⎝2⎠ ⎝ 2 ⎠⎦
The formulations of the a-priori SNR and a-posteriori SNR respectively (for each
value of the time and frequency indexes) are given below:
(
R post ( l , ω ) = Y ( l − 1, ω )
2
Nˆ (ω )
2
) −1
G ( l − 1, ω ) Y ( l − 1, ω )
2
(11)
R prio ( l , ω ) = (1 − α ) P ⎡⎣ R post ( l , ω ) ⎤⎦ + α
Nˆ (ω )
2
An interesting generalization to the rule described by (9) has also been derived to
address this problem. However it is not reported here, as it has been shown to behave
similarly to the log-spectral estimator developed in [9]. Such an approach comprises a
nonlinear spectral estimator performing the MMSE of the log-spectra. The underlying
motivation is that a distortion measure based on the MSE of the log spectra is more
subjectively meaningful than the counterpart based on the MSE of the common spec-
tra. The spectral gain Glog ( l , ω ) of the MMSE log spectral amplitude estimator is
R prio ⎡ 1 +∞ e − t ⎤
Glog ( l , ω ) = ⋅ exp ⎢ ∫ dt ⎥ (12)
1 + R prio ⎣ 2 κ (ω ) t ⎦
where R prio and R post are defined as above and the following holds:
R prio
κ (ω ) =
1 + R prio
(1 + R ) .
post (13)
As observed by the authors, the rule (12) allows higher noise suppression, leaving
unchanged the quality of the output speech with respect to the gain function in (9).
Further improvement in the performance achievable through this approach has
been demonstrated in [12], who employ an empirical approach to yield a numerical
solution to the MMSE estimate in the log spectral domain. Assuming that the speech
and noise log spectra have normal distributions, it can be shown that the MMSE esti-
mate of the speech log spectrum at certain time instant and frequency bin ( l , ω ) is a
function of noisy observations and the probabilistic model parameter (mean and vari-
ance {μ s , σ s , μ n , σ n } ). Such a function must be approximated, and the authors in [12]
propose the novel use of a multi-layer perceptron (MLP) neural network. Monte Carlo
simulations are used to get an adequate input/output training set for the network under
the assumed statistics; and the approximation problem then turns out to be a curve fit-
ting one by considering the MMSE estimation as a gain function. Considering the
presence of a VAD to ensure the calculation of noise statistics during silence periods
(even in slowly time-varying environments), assuming fixed and known the variance
of the speech log spectra, and reformulating the parameter model after proper nor-
malization, we can formulate the scheme of approximation of MMSE estimation as
shown in Fig.2.
Other important nonlinear methods for single channel speech enhancement are pro-
posed and analyzed in this section. These generally provide a suitable estimation of
the clean speech signal, by means of nonlinear models in order to take into account
the nonlinearities within the dynamic process determining the speech signal produc-
tion. We shall consider here some techniques assuming the availability of a clean
speech training data for the underlying nonlinear model. The classical techniques us-
ing Neural Networks as nonlinear filters mapping the noisy speech to clean speech in
the time domain or in different domains [20], allow to get good estimations only
Nonlinear Speech Enhancement: An Overview 225
G ( p, ω )
VAD Y ( p, ω ) NEURAL
NEWTORK
Y ′ ( p, ω )
μ y ( p, ω )
Normalization μ ′y ( p , ω )
σ n ( p, ω ) process
σ n′ ( p , ω )
μn ( p,ω )
statistics
Fig. 2. Approximation process of the MMSE estimation in the log spectral domain
assuming speech and noise stationarity. A time variant model can be achieved by cre-
ating different fixed models for corresponding dynamical regimes of the signals and
switching between these models during the speech enhancement process.
We start therefore from a straightforward neural extension of the work by Ephraim
[16-17] which is represented by the principled switching method proposed by Lee
[18], that incorporates the extended Kalman filtering approach (which will be dis-
cussed later). HMMs have been shown to be an effective tool in presence of signal
uncertainty [16], due to their capability of dividing the received speech signal into
various classes automatically. With reference to [18], each HMM state provides a
maximum-likelihood estimate ŝ ( k ) under the assumption that the windowed obser-
vation vector y ( k ) belongs to class i. The overall estimate is given by
( )
sˆ ( k ) = ∑ p classi y ( k ) ⋅ ⎡⎣ sˆ ( k ) y ( k ) , classi ⎤⎦
i
(14)
( )
where p classi y ( k ) is the probability of being in class i given the window of noisy
observations y ( k ) and the second term in the sum represents the maximum-
likelihood estimate of the speech given class i and the data. The posterior class
( )
probability p classi y ( k ) is easily calculated using standard forward-backward re-
cursive formulas for HMMs. Alternatively, the estimate ŝ ( k ) may be simply taken
as the estimate for the single filter whose posterior class probability is maximum:
sˆ ( k ) = ⎡⎣ sˆ ( k ) classm ⎤⎦ with ( )
p classm y ( k ) ≥ p classi y ( k ) ( ) ∀i . (15)
autoregressive neural models, whereas the speech innovations variance σ n2 can be es-
timated from the clean speech for each class.
A recent variant has been proposed [19] to the above approach of Lee et al [18]:
wherein the nonlinear prediction model is based on a Recurrent Neural Network
(RNN). The enhanced speech is the output of an architecture, namely RNPHMM (Re-
current Neural Predictive Hidden Markov Model), resulting from the combination of
RNN and HMM. Similar to the previous approach [18], the unknown parameters are
estimated by a learning algorithm derived from the Baum-Welch and RNN back-
propagation algorithms.
As previously outlined Neural Networks can also be used as non-linear time do-
main filters, fed with the noisy speech signal to yield the estimate of the clean speech.
The training is performed by using clean speech (from a known database) artificially
corrupted to create noisy input data and presented to the network sliding the observa-
tion window over the available signal. The Tamura approach [22-23] is one of the
oldest and most representative of this category: a four-layered neural network is used
and trained for hetero-association, employing noisy speech signal patterns at the input
and the corresponding noise free signal patterns at the output. Obtained results have
been compared to those obtained with spectral subtraction through subjective listen-
ing tests, concluding that most listeners preferred the neural network filtered speech.
Another classical scheme is the one used in [24] where the noise signal is filtered
through a feedforward network with a M-unit hidden layer and a single output unit,
whose notation is used on e following. For each time instant k , the hidden unit com-
putes the weighted sum of its input and subsequently applies a compressor function
f : \→ \ to produce its output activation. It can be shown that for every desired in-
put-output mapping in the form of a real valued continuous function f : x ∈ \K → \
d
and, for a non constant bounded and monotonically increasing activation function
f ( ⋅) at all hidden elements, an integer M , an M × K matrix U = ⎡⎣uij ⎤⎦ and M -
dimensional vectors v = ⎡⎣v j ⎤⎦ and b = ⎡⎣b j ⎤⎦ exist such that
noise filtering applications, the mapping of the noise signal to the corresponding clean
signal is not usually a mathematical function fd ( ⋅) , and this violates one of the exis-
tence conditions of the above-stated theorem. For the filter adaptation, a back-
propagation approach is usually used.
Neural network structures can also be successfully used in the transformed domain
[28] to carry out the enhancement process, following a suitable training phase. The
approach followed is generally based on a multistage architecture, comprising:
Nonlinear Speech Enhancement: An Overview 227
The problem of finding the maximum likelihood estimates of the speech and the
model parameters, given the noisy data, has been successfully addressed by Wan and
Nelson [16] [28] using neural autoregressive models and the Extended Kalman Filter-
ing (EKF) method. The speech model in the time domain is the following non-linear
autoregressive model:
s ( k ) = f ( s ( k − 1) ," , s ( k − K ) , w ) + v ( k )
(17)
y (k ) = s (k ) + n (k )
where v ( k ) is the process noise in state equation, usually assumed to be white, and K
is the model time length. A different model is used for each frame into which the
noisy signal is segmented. The EKF method is able to yield the ML optimal estimate
if the model is known. However, if no suitable data set for training is provided, the
model parameters have to be learnt from the available observable sequence. Kalman
Filter theory can be directly applied to the autoregressive model above, if we rewrite
it in the state-space form and f (.) is assumed to be linear:
s ( k ) = F ⎡⎣s ( k − 1) ⎤⎦ + Bv ( k )
(18)
y ( k ) = Cs ( k ) + n ( k )
s ( k ) = ⎡⎣ s ( k ) ," , s ( k − K + 1) ⎤⎦
T
C = [1 0 " 0] B = CT
228 A. Hussain et al.
sˆ − ( k ) = F ⎡⎣sˆ ( k − 1) , w
ˆ ( k − 1) ⎤⎦
∂F ⎡⎣sˆ ( k − 1) , w
ˆ ⎤⎦
Psˆ− ( k ) = APsˆ ( k − 1) AT + Bσ v2 ( k ) BT A=
∂sˆ ( k − 1)
G ( k ) = Psˆ− ( k ) C T ( CPsˆ− ( k ) C T + σ n2 ( k ) )
−1 (20)
Psˆ ( k ) = ( I − G ( k ) C ) Psˆ− ( k )
sˆ ( k ) = sˆ − ( k ) + G ( k ) ( y ( k ) − Csˆ − ( k ) )
However, note that one cannot exclusively rely on such a procedure to get what is
required, i.e. a simultaneous estimation of the speech model and speech signal. As a
result, a new set of state-space equations for neural networks weights w (used for
nonlinearity parameterization) are formulated as follows:
w ( k ) = w ( k − 1) + α ( k )
. (21)
y ( k ) = f ( s ( k − 1) , w ( k ) ) + v ( k ) + n ( k )
sˆ ( k − 1) sˆ − ( k ) sˆ ( k )
Measurement
update EKFx
measurement
y (k )
Fig. 3. The Dual Extended Kalman Filter method for speech enhancement
Nonlinear Speech Enhancement: An Overview 229
present state estimate xˆ ( k ) is used by the weight filter EKFw and the present weight
ˆ ( k ) feeds the state filter EKFx.
estimate w
This approach, as is the case with SS, does not depend on the type of signal that we
are dealing with and requires a suitable estimation of noise statistics. The main draw-
back is represented by the high computational cost occurring in training neural net-
works on-line, and some partial solutions have been proposed in order to reduce the
computational complexity and obtain a faster convergence.
The Noise-Regularized Adaptive Filtering (NRAF) approach for speech enhance-
ment [28] [31] involves a window based and iterative process that is similar to the
dual EKF method, but does not use an AR model for the speech. It can be considered
as a direct time-domain mapping filter (in the sense developed in [31]) avoiding the
need for a clean dataset to train the network.
The objective of direct filtering approaches is to map the noisy vector y(k) to an es-
timate of the speech signal sˆ ( k ) = f ( y ( k ) ) . The neural network performing the
mapping is trained by minimizing the mean-square error (MSE) cost function:
f {
min E ⎡⎣ s ( k ) − f ( y ( k ) ) ⎤⎦
2
}. (22)
We will now show how to minimize such a quantity without assuming that the
clean signal s(k) is known. Consider the expansion:
{
E ⎡⎣ s ( k ) − f ( y ( k ) ) ⎤⎦
2
} = E {⎡⎣ y ( k ) − f ( y ( k ))⎤⎦ } + 2E {n ( k ) f ( y ( k ))} +
2
(23)
−2 E { y ( k ) n ( k )} + E {n 2 ( k )}
Since the last two terms are independent of f(.), it suffices to minimize the follow-
ing alternative cost function to get the optimal solution:
f {{
min E ⎡⎣ y ( k ) − f ( y ( k ) ) ⎤⎦
2
} + 2E {⎡⎣n ( k ) f ( y ( k ))⎤⎦ }} .
2
(24)
A relevant advantage arises: the clean speech is not needed. Indeed the first term
(corresponding to the cost associated with filtering the noisy signal itself) only de-
pends on the observables, whereas the second term (namely the regularization term)
on the noise statistics. An approximate solution is typically used for the latter. It is ob-
tained by using the Unscented Transformation (UT), a method for calculating the sta-
tistics of a random variable going through a nonlinear transformation. Then, at each
time instant, the network input is a suitable set of K vectors carrying the information
related to the first and second order signal statistics, whereas the corresponding output
is a weighted sample mean. A standard gradient based algorithm like back-
propagation can be used to accomplish the minimization. The effectiveness of the
method relies on the assumption that the accuracy of the second-order UT based
approximation to the regularization term is good enough to achieve the network
230 A. Hussain et al.
convergence to the true minimum MSE. Moreover, as in the Dual EKF approach, also
in NRAF the speech non-stationarity can be dealt with by windowing the noisy data
into short overlapping frames with a new filter for each frame.
In addition, Monte-Carlo simulation based approaches for audio signal enhance-
ment have been recently proposed in some scientific works [4] [29-30]. Here their ba-
sic principles shall be discussed. Considering the clean and noisy speech as sequences
of scalar random variables, we can assume they satisfy some kind of time-varying
state-space equations, as previously done in (17). With a superior degree of general-
ity, we can characterize our system by three deterministically nonlinear transition
functions, here denoted as f , g , h . Function f is also dependent on discrete time k
(therefore it will be represented as f k ), whereas the other two are in general depend-
ent on the system parameter vector w ( k ) (and we will denote them as g wk , hw k
respectively). It follows that the state space equations are:
w ( k ) = f k ( w ( k − 1) , u ( k ) )
s ( k ) = g wk ( s ( k − 1) , v ( k ) ) (25)
y ( k ) = hwk ( s ( k ) , n ( k ) )
p w k y1:k v ³ p y k w1:k , y1:k 1 p w k w k 1 p w1:k 1 y1:k 1 dw1:k 1 (26)
where y1:k stands for { y (1) , " , y ( k )} , and accordingly for other variables occurring
(
with same notation. The quantity p w ( k ) y1:k ) is the filtering distribution, namely
the objective of our estimation problem. Now, let us suppose to have an estimate of
( )
p w1:k −1 y1:k −1 at time instant k − 1 . This probability density function (pdf) can be
Nonlinear Speech Enhancement: An Overview 231
where δ (.) denotes the Dirac function. For each i, we can get the N samples
{w ( k ) , i = 1,", N }
i (
from a proposal distribution π w i ( k ) w1:i k −1 , y1:k , namely the )
importance distribution. Then we can use the latter samples to augment the former
and generate the new sample paths at time instant k: {w1:i k , i = 1," , N } . A typical as-
sumption is to set:
( )
π w i ( k ) w1:i k −1 , y1:k = p ( w ( k ) w ( k − 1) ) (28)
( )
It must be said that p w ( k ) w ( k − 1) is fixed once we have chosen to apply a
constrained Gaussian random walk in the TVAR coefficient domain.
Equation (27) can be substituted into (26), resulting in:
( )
p w ( k ) y1:k ≈ ∑ θ i ( k ) δ ( w1:k − w1:i k )
N
(29)
i =1
( ) ⎡ f k k ( s ( k ) , w ik ) ⎤
N
IˆN f k k ∑ θi0:k E p s k (30)
i =1
(( ) w1:k , y1:k ) ⎣ ⎦
(
where p s ( k ) w1:k , y1:k ) is a Gaussian distribution whose parameters may be com-
puted using the Kalman filter and E is the expectation operator. According to the
principle of Sequential Importance Sampling (SIS), satisfied by our choice of the pro-
posal distribution, a recursive evaluation of the importance weights is allowed, which
232 A. Hussain et al.
The classical scheme for the Adaptive Noise Cancellation (ANC) was originally pro-
posed by Widrow et al. [32], and has been the subject of numerous studies involving a
wide range of applications. In contrast to other enhancement techniques, no a priori
knowledge of signal or noise is required for the method to be applied, but this advan-
tage is paid for by the need of a secondary or reference input. This reference input
should contain little or no signal but it should contain a noise measurement which is
correlated, in some unknown way, with the noise component of the primary input. An
important step in ANC is obtaining a reference signal which satisfies the above men-
tioned requirements. Referring to Fig.4, given a noisy speech (primary) signal y [ k ] ,
and assuming that s [ k ] is uncorrelated with n1 [ k ] and n2 [ k ] , and that n2 [ k ] is
processed by a linear filter h [ k ] (generally non-causal), it is easy to show that
E {e 2 [ k ]} is minimized when v [ k ] = n1 [ k ] , so that the output speech e [ k ] = s [ k ] is
the desired clean signal. Hence, the adaptive filter in classical linear methods is de-
signed to minimize E {n1 [ k ] − v [ k ]} , using standard algorithms, like the least mean
squares (LMS) technique.
unknown
s [k ] + y [k ] + e [k ] = sˆ[k ]
Σ Σ
+ −
n1 [k ]
H
v [k ]
W
x [k ] = n2 [k ]
Linear adaptive filtering, previously described, with the mean squared error (MSE)
criterion is a standard signal processing method, and the reason for its success is the
Nonlinear Speech Enhancement: An Overview 233
relative simplicity of design and ease of implementation. Nevertheless it can not often
realize the Bayes conditional mean, which is the optimal filter for the MSE criterion,
and generally a nonlinear function of the observed data. An important exception is if
the observed data and the data to be estimated are jointly Gaussian: in this case the
Bayes filter is a linear function. Since many real world signal processing applications
have to deal with non-Gaussian signals, the use of a linear finite impulse response
(FIR) or infinite impulse response (IIR) filter does not permit to obtain an acceptable
level of noise or interference cancellation, because it can not efficiently approximate
the nonlinear mapping between the known reference and the unknown interference
signal. With reference to Fig.4, we say that the reference noise is related to the inter-
ference signal by an unknown nonlinear operator H , approximated by a nonlinear
feed-forward network. The objective is to determine the unknown nonlinear operator
H by a nonlinear filter W , so that we can optimally estimate the noise n1 [ k ] and
subtract it from the signal y [ k ] . In this way the primary source signal can be esti-
mated. In the literature, a number of different techniques to design the filter W can
be found which can be conveniently grouped in three principal classes: higher order-
statistic filters, polynomial filters (in particular Volterra filters) and different kinds of
neural networks. Higher order statistics (HOS) filters are based on ordering properties
of input signals. A well-known member of this family is the Median filter, that is use-
ful in removing impulsive noise, but poor in case of Gaussian noise. In [41] third-
order statistics are used to derive novel design techniques which are more insensitive
to corruption of the primary signal by additive Gaussian noise, compared to the sec-
ond-order statistics ones. Referring to Fig.4, under the hypothesis that all signals are
zero mean and stationary, and that s [ k ] is independent of both n1 [ k ] and n2 [ k ] and
that n1 [ k ] and n2 [ k ] are someway correlated, the optimal filter W can be deter-
mined using the third-order moment by solving the following
q
∑ w [i ] R( ) ( m + i, l + i ) = R( ) ( m, l )
i =0
3
3
n2
3
yn2 (31)
where
Rn(2 ) ( m, l ) E {n2 [ k ] n2 [ k + n ] n2 [ k + l ]}
3
(32)
( 3)
Ryn 2
( m.l ) E { y [ k ] n2 [ k + n ] n2 [ k + l ]}
and different estimates can be obtained for different values of ( m, l ) . Furthermore, if
n1 is linearly related to n2 (i.e. if H in Fig.4 can be modeled by a linear time invari-
ant (LTI) filter) then theoretically w3 [ k ] obtained from (31) is equivalent to those
obtained by classical MSE methods, and it leads to complete cancellation of the inter-
ference, by identifying the true H filter. In practice, the theoretical auto- and cross-
correlations are substituted by consistent sample estimator computed from the
available data. The Volterra Filter (VF) has the important property to be linear in its
parameters. So the identification of vector H in the MMSE sense can be obtained
through the resolution of a linear equation. To find the optimal filters, we can operate
234 A. Hussain et al.
both in the time and in the frequency domain [53]. An adaptive resolution of this
equation is the RLS algorithm, which is based on the recursive calculation of the co-
variance matrix of the input signal of the filter. For the application of VF to the prob-
lem of noise cancellation, we refer to Fig.4 [42]. If n2 [ k ] and s [ k ] are independent
and zero mean, then the previously mentioned algorithms can be used, if we replace
x [ k ] with n2 [ k ] and y [ k ] by s [ k ] + n1 [ k ] .
Next, a number of selected approaches to the interference cancellation problem us-
ing neural network filters are briefly analyzed, though other techniques can of course
be found in the literature too.
As previously noticed, the problem of noise filtering can be viewed as the problem
of finding the mapping of noisy signal patterns y [ k ] to the corresponding noise-free
signal patterns s [ k ] . According to this perspective, different kinds and topologies of
neural networks can be used relating to the different relations between n1 [ k ] and
n2 [ k ] . Since a two layer feed-forward network has been proven capable of approxi-
mating any continuous non-linear mapping, assuming there are a sufficient number of
hidden units, various implementations of this structure (with different contrast func-
tions and number of hidden units) can be found in the literature. In [42], for example,
a perceptron with one hidden layer and one output unit is used for the filter W (refer-
encing to Fig.4 for the notation).
Denoting n 2 [ k ] = ⎡⎣ n2 [ k ] , n2 [ k − 1] , " , n2 [ k − K ]⎤⎦ , the mapping is described by
v [ k ] = ∑ cm tanh ( wTm n 2 [ k ] − bm )
M
(33)
m =1
where M is the number of hidden units, cm and the vectors w m are the weights coef-
ficients, and bm are the biases. The training is performed using the classical back-
propagation technique. No method currently exists to precisely determine the optimal
solution. Performance depends on the initial weights, the learning rate and the amount
of training, but for small K from (33) the perceptron seems to perform a good ap-
proximation of the optimum Bayes filter.
The last kind of neural network analyzed in this work for the problem of noise can-
cellation is the Hyper Radial Basis Function (HRBF) neural network, following the
approach described in [40]. The main idea is to consider the mapping W in Fig.4 to
be approximated as the sum of various radial basis functions, each one with its own
prior. Defining f m , m = 1," , M as these functions, the function to minimize is:
2
K
⎛M ⎞ M
L ( n1 ) = ∑ ⎜ ∑ f m ( n2 [ k ]) − n1 [ k ] ⎟ + ∑ γ m Pm f m
2
(34)
k =1 ⎝ m =1 ⎠ m(1)
n1 = ∑∑ wmj G mj ( n2 , q mj )
M K
(35)
m =1 j =1
where wmj are weight parameters and G mj are Green’s functions. Choosing a set of
stabilizers whose Green’s functions are Gaussian, the HRBS neural network becomes
formally equivalent to a two layer neural network the hidden layer of which realizes
an adaptive nonlinear transformation (with adjustable weight and center parameters).
Some researchers have looked to the human hearing system as a source of engineering
models to approach the enhancement problem, with some modelling the cochlea and
others utilizing a model of the lateral inhibition effect. Two or more relatively closely
spaced microphones have been used in an adaptive noise cancellation scheme [35], to
identify a differential acoustic path transfer function during a noise only period in in-
termittent speech. The extension of this work, termed the Multi-Microphone Sub-
band Adaptive (MMSBA) speech enhancement system, applies the method within a
set of sub-bands provided by a filter bank. The filter bank can be implemented using
various orthogonal transforms or by a parallel filter bank approach. The idea of em-
ploying multi-band processing for speech enhancement has also been considered in
other contributions focusing on the spectral subtraction technique [54-55]. In the
MMSBA approach [36-38], the sub-bands are distributed non-linearly according to a
cochlear distribution, as in humans, following the Greenwood model [56]. The con-
ventional MMSBA approach considerably improves the mean squared error (MSE)
convergence rate of an adaptive multi-band LMS filter compared to both the conven-
tional wideband time-domain and frequency domain LMS filters, as shown in
[36-38]. It is assumed that the speaker is close enough to the microphones so that en-
vironmental acoustic effects on the speech are insignificant, that the noise signal at
the microphones may be modelled as a point source modified by two different acous-
tic path transfer functions, and that an effective voice activity detector (VAD) is avail-
able. In practice, the MMSBA based speech-enhancement systems have been shown
to give the important benefit of supporting adaptive diverse parallel processing in the
sub-bands, namely Sub-band Processing (SBP), allowing signal features within the
sub-bands, such as the noise power, the coherence between the in-band signals from
multiple sensors and the convergence behaviour of an adaptive algorithm, to influence
the subsequent processing within the respective frequency band. The SBP can be ac-
complished with no processing, intermittent coherent noise canceller, or incoherent
noise canceller. In the conventional MMSBA approach, linear FIR filtering is per-
formed within the SBP unit and the LMS algorithm is used to perform the adaptation.
In the non-linear MMSBA, Volterra Filtering based SBP has been applied (together
with the RLS algorithm), leading to a significant improvement of results, especially in
real noisy environments. The Magnitude Squared Coherence (MSC) has been applied
by [58] to noisy speech signals for noise reduction and also successfully employed as
a VAD for the case of spatially uncorrelated noises. A modified MSC has been used
for selecting an appropriate SBP option within the MMSBA system [36].
In the newly proposed modified MMSBA architecture [38], Wiener filtering (WF)
operation has been applied in two different ways: at the output of each sub-band
236 A. Hussain et al.
adaptive noise canceller, and at the global output of the original MMSBA scheme.
The employment of such post-processing (WF) within the MMSBA allows to deal
with residual incoherent noise components that may result from the application of
conventional MMSBA schemes, similar to the approach adopted in [57]. In both the
proposed architectures, the role of WF is to further mitigate the residual noise effects
on the original signal to be recovered, following application of MMSBA noise-
cancellation processing.
Finally, the MMSBA framework also allows incorporation of cross-band effects to
mimic human lateral inhibition effects. One possibility seems to extend the recently
reported promising work of Bahoura and Rouat [59], who have shown that non-linear
masking of a time-space representation of speech can be used to achieve simulated
noise suppression for the monaural case, by discarding or masking the undesired
(noise) signals and retaining the desired (speech) signals. They have demonstrated
that this non-linear masking can enhance single-sensor or monaurally recorded speech
by performing non-linear filtering with adaptive thresholding (based on the Teager
Energy operator Bahoura and Rouat [60]) on a time-frequency (multi-band) represen-
tation of the noisy signal. In [61] the MMSBA system with linear filtering and two
different adaptive sub-band binaural structures have been compared in the noise re-
duction problem.
Y ( ) ( l , ω ) = 2S0 ( l , ω ) + ∑ ( ga d ( l , ω ) − ha d ( l , ω ) ) ⋅ N d ( l , ω )
p
d ∈Ω
(36)
Y ( ) ( l , ω ) =
r
∑ ( ga ( l , ω ) − ha (l , ω ) ) ⋅ N (l , ω )
d ∈Ω
d d d
where S0 is the speech signal coming from the look direction (so coinciding with S
if we consider the model (2)), g, h are the M-element complementary weight vectors,
Nonlinear Speech Enhancement: An Overview 237
Ω is the set of directions relative to the different interfering signals approaching the
beamformer, N d is the noise signal corresponding to the d-th direction. The quanti-
ties ga d , ha d describe the directivity patterns, and a d is the steering vector:
Sˆ ( l , ω ) = ⎡⎢ Y ( p ) ( l , ω ) − E ⎡ Y ( r ) ( l , ω ) ⎤ ⎤⎥ ⋅ exp ( jφ ( ω ) )
1 2 2
(38)
2⎣ ⎣⎢ ⎦⎥ ⎦
STFT g1 Primary
+ 1
STFT g2 Σ Σ i2 +
−
Σ
2
i2
+
STFT gM
X (f )
Array h1 + Array
Input
… Output
h2 Σ −
Σ i 2
E [i ]
Reference
hM
Conventional Phase φ (ω )
Beamformer
Fig. 5. Block diagram of the nonlinear mic array based on complementary beamforming
The directivity patterns are designed under the constraint of keeping the terms
ga d (ω ) ⋅ ha d (ω ) as small as possible, for all ω , d , in order to have low noise con-
tribution to the primary signal. This results in a nonlinear constrained least squares
minimization problem, tackled by a suitable iterative procedure. The approach is
238 A. Hussain et al.
supervised and the common choice made for the target directivity pattern involves
setting the value 1 for the look direction and 0 otherwise. In such a way, it is possible
to get lower sidelobes with respect to the DS array, resulting in a significant im-
provement of speech enhancement capability. However, the optimization procedure
employed is not specifically oriented to minimize the average gain in each direction,
causing a certain difficulty to reduce directional noise. That is why another optimiza-
tion scheme, within the complementary beamforming based framework described
above and depicted in Fig.5, has been proposed in [45]. According to this, the power
spectrum of the estimated speech Sˆ ( l , ω ) is calculated through a block averaging
2
R prio , m
Gm ( p, ω ) = Γ (1.5 ) ⋅ ⋅
(1 + Rpost ,m ) ⎛⎜1 + ∑ R prio,m ⎞⎟
M
⎝ m =1 ⎠
⎛ 2
⎞ (40)
∑ (1 + R )R
M
iϑm
⎜ post , m prio , m e ⎟
⋅F1 ⎜⎜ −0.5,1, − m =1
M
⎟
⎟
⎜ 1 + ∑ R prio , m ⎟
⎜ ⎟
⎝ m =1
⎠
Nonlinear Speech Enhancement: An Overview 239
where F1 is the confluent hypergeometric series, Γ the Gamma function and ϑm the
m-th noisy channel phase. Eq. (40) turns to (9) when M = 1 , since (9) can be shown
to be equal to:
G ( p, ω ) = Γ (1.5 ) ⋅
R prio ⎛
⋅ F1 ⎜ −0.5,1, −
(1 + Rpost ) Rprio ⎞
⎟ (41)
(1 + R post )(1 + R prio ) ⎜⎝ 1 + R prio ⎟
⎠
It must be observed that (40) is obtained if perfect DOA (Direction of Arrival) cor-
rection is assumed within the microphone-array when the short-term spectral ampli-
tude estimation Sˆ ( p, ω ) is performed. As pointed out in [51], for DOA independent
speech enhancement, the amplitude estimation has to be calculated by conditioning
the expectation of the joint observation of noisy amplitudes, i.e. (39) turns to:
{
Sˆm = E Sm Y1 , Y2 , " , YM . } (42)
In order to do the above in a simple and effective way, the authors in [47] sug-
gested to employ the MAP estimator proposed originally for the single-channel case
in [62]. It follows that, denoting p ( ⋅) as the probability density function (pdf) of a
generic random variable, the following has to be maximized
( (
log ( L ) = log p Y1 , Y2 ," , YM Sm ⋅ p ( Sm ) )) (43)
R prio , m (1 + R post , m ) ⎛M
Gm ( p, ω ) =
⎛ ⎞
⋅ Re ⎜∑ (1 + R post , m )R prio , m +
⎝ m =1
M
2 ⋅ ⎜ 1 + ∑ R prio , m ⎟
⎝ m =1 ⎠ (44)
⎞ ⎞⎟
2
⎛ ⎞ ⎛
(1 + R )R
M M
+ ⎜∑ post , m prio , m ⎟ + ( 2 − M ) ⎜ ∑ prio, m ⎟ ⎟
1 + R
⎝ m =1 ⎠ ⎝ m =1 ⎠⎠
Along this direction we must cite the approach recently proposed by Cohen and
Berdugo [49] that focused on the minimization of the Log-Spectra amplitude (LSA)
distortion in environments where time-varying noise is present. The overall scheme
(Fig.6) comprises an adaptive beamforming system (made of a fixed beamformer, a
blocking matrix and a multi-channel adaptive noise canceller) and a suitable LSA es-
timation chain acting on the beamformer outputs, written as (in STFT domain):
V ( l , ω ) = S1 ( l , ω ) + N 1, st ( l , ω ) + N 1, ns ( l , ω )
(46)
U ( l , ω ) = S ( l , ω ) + N ( l , ω ) + N
m m m , st m , ns(l , ω ) m = 1," , M
where st and ns stand for stationary and non-stationary respectively. The objective is
to find a suitable estimator of S1 ( l , ω ) minimizing the LSA distortion.
The noise cancellation system is responsible for reducing the stationary contribu-
tion and yielding the signal V ( l , ω ) on which the optimally-modified log-spectral
amplitude (OM-LSA) gain function will be applied to achieve the goal. The evalua-
tion of the nature of transient occurrences is performed through a suitable estimation
of speech presence probability, which is based on a Gaussian statistical model and in
particular on the transient beam-to-reference ratio (TBRR) defined as:
S ⎣⎡V ( l , ω ) ⎦⎤ − M ⎣⎡V ( l , ω ) ⎦⎤
Ω (l, ω ) =
{ }
(47)
max S ⎡⎣U m ( l , ω ) ⎤⎦ − M ⎡⎣U m ( l , ω ) ⎤⎦
2≤m≤ M
where S[.], M[.] are the smoothing operator and the noise spectrum estimator arising
by recursively averaging past spectral power values [48]. Assuming that the beam-
former steering error is low and that the interfering noise is uncorrelated with speech,
it can be said that a high TBRR means speech presence. When this is not the case, the
noise estimation can be fast updated and then given to the OM-LSA [5] estimator for
final speech enhancement. As confirmed by experimental results, such an approach
seems to provide an adequate estimation of the time-varying noise spectral compo-
nents and so a significant reduction of noise impact without degrading the speech
Y (l, ω)
V (l, ω)
W(l ) Σ
+
−
Speech Noise Spectral Sˆ1 (l, ω)
M dimensional Presence Enhancement
Spectrum
Probability Estimation (OM-LSA)
Estimator Estimator
B(l ) H(l, ω)
( ) ) ≈ ∑λ ( )).
M
log S ( log Ym(
d d
m (48)
m =1
ym [ k ] = hm [ k ] ∗ s [ k ] + g m [ k ] ∗ n [ k ] . (49)
Moving to the STFT domain, we can approximate the log-power spectrum of the
m-th mic signal by a two-dimensional Taylor-series expansion around the reference
Ym0 so that:
( ) (
log (Ym ) − log (Ym0 ) ≈ am log ( S ) − log ( S 0 ) + bm log ( N ) − log ( N 0 ) ) (50)
where it can be shown that the coefficients am , bm depends on the SNR at m-th loca-
(d ) ( 0)
tion. Now, considering ( • ) the deviation from ( • ) , (50) turns to:
( ) ) ≈ a ( log ( S ( ) )) + b ( log ( N ( ) ) ) .
log Ym(
d
m
d
m
d
(51)
The regression error is then given by the difference between the two terms in (48).
The optimal weights λm can be obtained by minimizing such an error over a suitable
number T of training samples, i.e.:
2
1 T ⎪⎧ ⎤ ⎪⎫
T t =1 ⎪⎩ ⎣ ⎦ t
⎡M
( )
ε = ∑ ⎨ ⎡ log S ( d ) ⎤ − ⎢ ∑ λm log Ym( d ) ⎥ ⎬ .
⎣ m =1 ⎦ t ⎪⎭
( ) (52)
orthogonality of the Discrete Cosine Transform (DCT) transform ensures that mini-
mization of (52) is equivalent to minimization in the cepstral domain. Several experi-
mental results have shown that the MRLS approach allows a good approximation of
the close-talking microphone and outperforms the adaptive beamformer from the
perspective of speech recognition performance, ensuring also a low computational
cost. Further improvements have been obtained when nonlinear regression (through
Multi-Layer Perceptrons and Support Vector Machines) is employed [51]. A draw-
back is likely represented by the supervised optimization procedure that can be
adopted within a speech recognition scheme, but turns out to be limiting in a more
general framework for speech enhancement. An alternative approach to multi-channel
non-linear speech enhancement has been described in [63], which applies neural net-
work based sub-band processing (within the MMSBA processing framework) with
promising initial results using real automobile reverberant data. This interesting ap-
proach warrants further investigation.
5 Concluding Summary
In this section, we summarize in tabular form for comparative purposes, the general
features, list of operating assumptions, the relative advantages and drawbacks, and the
various types of non-linear techniques for each class of speech enhancement strategy
reviewed in this paper. Some references related to the methods not specifically de-
scribed in the paper, are not included in the table.
BASIC METHODS
- NN used to get the nonlinear mapping Y->x
- different topologies of NN can be used
General Features - iterative algorithm
- joint estimation of signal and noise parameters
- classical methods make use of EM technique
- noise less correlated than speech
Assumptions - short term stationarity of involved signals
Advantages - no need of clean speech data
Drawbacks - computationally demanding
OTHER EXTENTIONS
- ML estimates for both enhanced speech and parameters
- can be used also with colored noise
Dual EKF [20], [26] - usually used in the EM algorithm
- high computational cost
- frame by frame iteration
244 A. Hussain et al.
References
1. Vaseghi, S.V.: Advanced Signal Processing and Digital Noise Reduction (2nd ed.). John
Wiley & Sons, 2000
2. O’Shaughnessy, D.: Speech Communications – Human and Machine. IEEE Press, 2nd ed,
Piscataway, NJ, 2000
3. Benesty, J, Makino, S., and Chen, J.,: Speech Enhancement. Signal and Communication
Technology Series, Springer Verlag, 2005
4. Ephraim, Y., Cohen, I.: Recent Advancements in Speech Enhancement. The Electrical En-
gineering Handbook, CRC Press, 2005
5. Cohen, I. and Berdugo, B.H.: Speech enhancement for non-stationary noise environments.
Signal Processing, vol. 81, pp. 2403-2418, 2001.
6. Boll S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans.
Acoust. Speech, Signal Process., ASSP-27:113-120, April 1979.
7. Lockwood, P., Boudy, J.: Experiment with a Nonlinear Spectral Subtractor (NSS). Hidden
Markov Models and the Projection, for Robust Speech Recognition in Cars. Speech Com-
munications, 11, 215-228, 1992.
8. Ephraim, Y. Malah, D.:. Speech Enhancement Using a Minimum Mean Square Error Short
Time Spectral Amplitude Estimator. IEEE Trans. Acoust., Speech, Signal Processing, vol.
ASSP-32, pp. 1109-1121, 1984
9. Ephraim, Y. Malah, D.: Speech enhancement using a minimum mean square log spectral
amplitude estimator. IEEE Trans. Acoust., Speech, Sig.Proc., vol 33, no 2, pp 443-445,
1985
10. Cappè, O.: Elimination of the musical noise phenomenon with the Ephraim and Malah
noise suppressor. IEEE Trans. Speech and Audio Proc., vol. 2, pp. 345 -349, April 1994
11. McAulay, R.J. and Malpass, M.L.: Speech Enhancement Using a Soft-Decision Noise
Suppression Filter. IEEE Trans.on Acoust., Speech and Sig.Proc., vol. ASSP-28, no. 2,
1980
12. Xie, F. and Compernolle, D. V.: Speech enhancement by nonlinear spectral estimation - a
unifying approach. EUROSPEECH'93, 617-620, 1993
13. Virag, N.: Single channel speech enhancement based on masking properties of the human
auditory system. IEEE Trans. Speech Audio Processing, vol. 7, pp. 126–137, March 1999
246 A. Hussain et al.
14. Ephraim, Y. and Van Trees, H.L.: A signal subspace approach for speech enhancement.
IEEE Trans. Speech and Audio Proc., vol. 3, pp. 251-266, July 1995
15. Lev-Ari, H. and Ephraim, Y.: Extension of the signal subspace speech enhancement ap-
proach to colored noise. IEEE Sig. Proc. Let., vol. 10, pp. 104-106, April 2003
16. Y. Ephraim: Statistical-model-based speech enhancement systems. Proc. IEEE, 80(10),
October 1992
17. Ephraim, Y.: A Bayesian Estimation Approach for Speech Enhancement Using Hidden
Markov Models. IEEE Trans. Signal Processing, vol. 40, pp. 725-735, Apr. 1992
18. Lee, K.Y., McLaughlin, S., and Shirai, K.: Speech enhancement based on extended Kal-
man filter and neural predictive hidden Markov model. IEEE Neural Networks for Signal
Processing Workshop, pages 302-10, September 1996
19. Lee, J.; Seo; C., and Lee, K.Y. : A new nonlinear prediction model based on the recurrent
neural predictive hidden Markov model for speech enhancement. ICASSP '02. vol. 1,
pp.:1037-1040, May 2002
20. Wan, E.A., Nelson, A.T.: Networks for Speech Enhancement. Handbook of Neural Net-
works for Speech Processing, Edited by Shigeru Katagiri, Boston, USA. 1999
21. Dawson, M.I. and Sridharan, S.: Speech enhancement using time delay neural networks,
Proceedings of the Fourth Australian International Conf. on Speech Science and Technol-
ogy, pages 152-5, December 1992
22. Tamura, S.: An analysis of a noise reduction neural network. ICASSP ‘87, pp. 2001-4,
1987
23. Tamura, S.: Improvements to the noise reduction neural network, ICASSP ‘90, vol. 2, pp.
825-8, 1990
24. Knecht, W.G.: Nonlinear Noise Filtering and Beamforming Using the Perceptron and Its
Volterra Approximation. IEEE Trans. On Speech and Audio Proc., vol.2, no.1, part 1,
1994
25. Knecht, W, Schenkel, M., Moschytz, G S.,: Neural Network Filters for Speech Enhance-
ment. IEEE Trans. Speech & Audio Proc., 3(6),433-438, 1995
26. X-M. Gao, S.J. Ovaska, and I.O. Hartimo. Speech signal restoration using an optimal neu-
ral network structure, IJCNN 96, pages 1841-6, 1996
27. Gannot, S., Burshtein, D. and Weinstein, E.: Iterative and Sequential Kalman Filter-Based
Speech Enhancement Algorithms. IEEE Trans. Speech and Audio Proc., vol. 6, pp. 373-
385, 1998
28. Wan, E.A., Nelson, A.T.: Neural dual extended Kalman filtering: applications in speech
enhancement and monaural blind signal separation. Proceedings Neural Networks for Sig-
nal Processing Workshop, 1997
29. Vermaak, J., Andrieu, C., Doucet, A., Godsill, S.J.: Particle Methods for Bayesian Model-
ing and Enhancement of Speech Signals. IEEE Trans. Speech and Audio Processing, vol.
10, pp. 173 -185, Mar. 2002
30. Fong, W., Godsill, S.J., Doucet, A. and West, M.: Monte Carlo smoothing with application
to audio signal enhancement. IEEE Trans. Signal Processing, vol. 50, pp. 438-449, 2002
31. Wan, E. and Van der Merwe, R.: Noise-Regularized Adaptive Filtering for Speech En-
hancement. Proceedings of EUROSPEECH’99, Sep 1999
32. Widrow, B., Glover jr., J. R., McCool, J. M., Kaunitz, J., Williams, C. S., Hearn, R. H.,
Zeidler, J. R., Dong jr., E. and Goodlin, R. C.: Adaptive Noise Cancelling: Principles and
Applications. Proceedings of the IEEE, 63 (12): 1692–1716,1975
33. Clarkson, P.M.: Optimal and Adaptive Signal Processing. CRC Press, Boca Raton, 1993.
34. Toner, E.: Speech Enhancement Using Digital Signal Processing, PhD thesis, University
of Paisley, UK, 1993
Nonlinear Speech Enhancement: An Overview 247
35. Darlington, D.J., Campbell, D.R.: Sub-band Adaptive Filtering Applied to Hearing Aids.
Proc.ICSLP’96, pp. 921-924, Philadelphia, USA, 1996
36. 36.Hussain, A., Campbell, D.R.,: Intelligibility improvements using binaural diverse sub-
band processing applied to speech corrupted with automobile noise. IEE Proceedings: Vi-
sion, Image and Signal Processing, Vol. 148, no.2, pp.127-132, 2001
37. Hussain, A., Campbell, D.R.: A Multi-Microphone Sub-Band Adaptive Speech Enhance-
ment System Employing Diverse Sub-Band Processing. International Journal of Robotics
& Automation, vol. 15, no. 2, pp. 78-84, 2000
38. Hussain, A., Squartni, S., Piazza, F.: Novel Subband Adaptive Systems Incorporating Wie-
ner Filtering for Binaural Speech Enhancement. NOLISP05, ITRW on Non-Linear Speech
Processing - LNAI 3817, Springer-Verlag, 2005.
39. Cha, I., Kassam, S.A.: Interference Cancellation Using Radial Basis Function Networks, .
Signal Processing, vol.47, pp.247-268, 1995
40. Vorobyov, S.A., Cichocki, A.: Hyper Radial Basis Function Neural Networks for Interfer-
ence Cancellation with Nonlinear Processing of Reference Signal. Digital Signal Process-
ing, Academic Press, July 2001, vol. 11, no. 3, pp. 204-221(18)
41. Giannakis, G.B., Dandawate, A.V.: Linear and Non-Linear Adaptive Noise Cancellers.
Proc ICASSP 1990. pp 1373-1376, Albuquerque, 1990
42. Amblard, P., Baudois, D.: Non-linear Noise Cancellation Using Volterra Filters, a Real
Case Study. Nonlinear Digital Signal Processing, IEEE Winter Workshop on, Jan. 17-20,
1993
43. Brandstein, M.S. and Ward, D.B.: Microphone Arrays: Signal Processing Techniques and
Applications. Springer-Verlag, Berlin, 2001
44. Saruwatari, H., Kajita, S., Takeda, K., Itakura, F.: Speech Enhancement Using Nonlinear
Microphone Array Based on Complementary Beamforming, IEICE Trans. Fundamentals,
vol.E82-A, no.8, pp.1501-1510, 1999.
45. Saruwatari, H., Kajita, S., Takeda, K., Itakura, F.: Speech Enhancement Based on Noise
Adaptive Nonlinear Microphone Array, EUSIPCO 2000, X European Signal Processing
Conference, Tampere Finland, 2000
46. Dahl, M. and Claesson, I.: A neural network trained microphone array system for noise re-
duction. IEEE Neural Networks for Signal Processing VI, pages 311-319, 1996
47. Lotter, T., Benien, C., Vary, P.: Multichannel Direction-Independent Speech Enhancement
using Spectral Amplitude Estimation. Eurasip Journal on Applied Signal Processing, 11,
pp. 1147-1156, 2003
48. I. Cohen, Berdugo, B.: Noise Estimation by Minima Controlled Recursive Averaging for
Robust Speech Enhancement, IEEE Signal Processing Letters, vol.9, no.1 pp. 12-15, 2002
49. Cohen, I. and Berdugo, B.: Speech enhancement based on a microphone array and log-
spectral amplitude estimation. Electrical and Electronics Engineers in Israel, the 22nd
Convention, pp. 4.:6, Dec. 2002
50. Shinde, T., Takeda, K., Itakura, F.: Multiple regression of log-spectra for in-car speech
recognition. ICSLP-2002, pp. 797-800, 2002
51. Li, W., Miyajima, C., Nishino, T., Itou, K., Takeda, K., Itakura, F.: Adaptive Nonlinear
Regression using Multiple Distributed Microphones for In-Car Speech Recognition. IEICE
Trans. Fundamentals, vol. E88-A, no. 7, pp. 1716-1723, 2005
52. Parveen, S. and Green, P.D.: Speech enhancement with missing data techniques using re-
current neural networks, Proc. IEEE ICASSP 2004, Montreal, 2004
53. Haykin, S. 2002 Adaptive Filter Theory (4th ed) Prentice Hall Information and System
Science Series, Thomas Kailath Series Editor
248 A. Hussain et al.
54. Kamath, S. and Loizou, P.: A multi-band spectral subtraction method for enhancing
speech corrupted by colored noise, ICASSP 2002
55. Gülzow, T., Ludwig, L. and Heute, U.: Spectral-Substraction Speech Enhancement in
Multirate Systems with and without Non-uniform and Adaptive Bandwidths. Signal Proc-
essing, vol. 83, pp. 1613-1631, 2003
56. Greenwood, V: A Cochlear Frequency-Position Function for Several Species-29 Years
Later. J. Acoustic Soc. Amer., vol. 86, no. 6, pp. 2592-2605, 1990
57. Abutalebi, H. R., Sheikhzadeh, H., Brennan, R. L., Freeman, G.H.: A Hybrid Sub-Band
System for Speech Enhancement in Diffuse Noise Fields, IEEE Sig. Process. Letters, 2003
58. Le Bouquin, R., Faucon, G.: Study of a Voice Activity Detector and its Influence on a
Noise Reduction System. Speech Communication, vol. 16, pp. 245-254, 1995
59. Bahoura M. and Rouat J., “A new approach for wavelet speech enhancement”, Proc.
EUROSPEECH, pp. 1937-2001, 2001
60. Bahoura M. and Rouat J., “Wavelet speech enhancement based on the Teager Energy Op-
erator,” IEEE Signal Proc. Lett., 8(1), pp. 10-12, 2001
61. Cecchi, S, Bastari, A., Squartini, S. and Piazza, F.: Comparing Performances of Different
Multiband Adaptive Architectures for Noise Reduction. Communications, Circuits and
Systems (ICCCAS), 2006 International Conference of Guilin-China 2006
62. Wolfe, P.J. and. Godsill, S.J.: “Efficient alternatives to the Ephraim and Malah suppres-
sion rule for audio signal enhancement,” URASIP Journal on Applied Signal Processing,
no. 10, pp. 1043–1051, 2003, special issue: Digital Audio for Multimedia Communica-
tions
63. Hussain, A., Campbell, D.R.: “Binaural sub-band adaptive speech enhancement using arti-
ficial neural networks,” Speech Communication, vol.25, pp.177-186, 1998, Special Issue:
Robust Speech Recognition for Unknown Communication Channels
The Amount of Information on Emotional States
Conveyed by the Verbal and Nonverbal Channels: Some
Perceptual Data
Anna Esposito1,2
1
Dipartimento di Psicologia, Seconda Università di Napoli
Via Vivaldi 43, 81100 Caserta, Italy
2 IIASS, Via Pellegrino 19, 84019, Vietri sul Mare, SA, Italy
[email protected],[email protected]
Abstract. In a face-to-face interaction, the addressee exploits both the verbal and
nonverbal communication modes to infer the speaker’s emotional state. Is such an
informational content redundant? Is the amount of information conveyed by each
communication mode the same or is it different? How much information about the
speaker’s emotional state is conveyed by each mode and is there a preferential red
communication mode for a given emotional state? This work attempts to give an
answer to the above questions evaluating the subjective perception of emotional
states in the single (either visual or auditory channel) and the combined channels
(visual and auditory). Results show that vocal expressions convey the same
amount of information as the combined channels and that the video alone conveys
poorer emotional information than the audio and the audio and video together. In-
terpretations of these results (that seem to not support the data reported in litera-
ture proving the dominance of the visual channel in the emotion’s perception) are
given in terms of cognitive load, language expertise and dynamicity. Also, a
mathematical model inspired to the information processing theory is hypothesized
to support the suggested interpretations.
1 Introduction
It has been and remains difficult to define emotions, and this difficulty continues,
even though there have been a lot of attempts to characterize emotional states. The
proposed approaches to understand emotions are centered on finding insights on what
emotions are, but the question still remains unanswered. James [44] was the first to
posit the question and to indicate the bodily approach as a first coherent approach to
explain emotions. According to James, emotions are the feeling of <<…bodily
changes that follow directly the perception of the exciting fact…>>, whereas in Dar-
win’s opinion [16] they are patterns of actions deriving from our evolutionary or indi-
vidual past and may be <<…of the least use…>> in our modern context. Along the
past centuries, the social role of emotions has been differently interpreted and they
have been either considered as conflicting aspects of soul (Plato, 450 B.C.[60]) or
essential in the context of the Ethic (Aristotele, 384 B.C.[5]) as well as disturbing
passions (Descartes, 1649 [38]) or as playing a central role in the human behaviour
Y. Stylianou, M. Faundez-Zanuy, A. Esposito (Eds.): WNSP 2005, LNCS 4391, pp. 249 – 268, 2007.
© Springer-Verlag Berlin Heidelberg 2007
250 A. Esposito
[64]. Theory of emotions tends to privilege inner and psychological motivations but
modern anthropologists conceive emotions as socially embedded responses that
<<…take on behavioural significance within a field of culturally interpreted person-
person and person-situations relations... [80]>>. During the last century, the study of
emotions involved different scientific fields and today the effort among psychologists,
neurologists, anthropologists, and moral philosophers is to converge towards a holis-
tic theory of emotion. However, there is still the need to define emotions and below a
working definition is proposed.
A working definition is not “the definition”. It is just a way to deal with a phenome-
non that is not yet entirely understood to provide an orientation. Such a definition can
be changed as soon as new relevant explanations are discovered. On this premise, the
central condition of emotions is considered to be the evaluation of an event. A person
is consciously or unconsciously evaluating an external or internal event that is rele-
vant to him or to some of his concerns or goals. This is what the psychologists call the
appraisal phase. At the core of this evaluation (or concurrently with this evaluation)
there is a “readiness to act” and the “prompting of plans” [32-33] that allow the per-
son to handle the evaluated event producing appropriate mental states accompanied
by physiological changes, feelings, and expressions. This process can be schematized
in the diagram reported in Figure 1 which is a personal interpretation of the feedback
loop proposed by Plutchik [61-62].
The above interpretation assumes that after the evaluation of an event there are sev-
eral processes that take place concurrently to the readiness of action. They all affect
each other and it is possible that none of them has a primary role. They are all the
result of mental processes that take over on the current cognitive state to handle the
subject’s emotional state. This interpretation is in accord with the largely accepted
definition of emotions proposed, among others, by Plutchik [61-62], Frijda [32-33],
and Scherer [66, 70], even though it may differ by not assigning a priority to any of
them. Our idea is that, being emotions experienced as a comprehensive manifestation
of feelings, expressions, physiological changes, action readiness, and prompting of
plans, assigning a subsidiary role to any of these signals would produce a restricted
interpretation of their essence as a whole.
The Amount of Information on Emotional States 251
According to many authors (see also Oatley and Jenkins [58]) emotions have sev-
eral functional roles, among these the most important in driving social interaction are:
1. They solicit appropriate reactions to events that happen unexpectedly and
demand immediate action;
2. They act as signs (exhibited through a variety of comportments) made for
the purpose of giving (consciously or unconsciously) notice of a plan of fu-
ture intents;
3. They play a communicative function.
However, the above functions, do not completely define the role of emotions and
why they should exist. At the date, a unique scientific theory of emotions is not yet
available. Scientists are debating on divergent opinions ranging from evolutionary vs.
socio-cultural theories [12-13, 22, 24, 29-30, 32-33, 40-41, 44, 48-51, 59, 61-62, 70].
It is not in the aims of this work to discuss on them and an interested reader can find
very fascinating highlights in the book of Oatley and Jenkins [58]. Our primary aim is
on the expressions of emotions, on the perceptual features exploited to recognize them
from facial expressions, voice, and body movements, and on trying to establish if
there is a preferential channel for perceiving an emotional state. These will be the
arguments discussed in the following sections.
Facial expressions as markers of emotions were first suggested by Darwin [16]. Dar-
win approached the study of facial expressions of emotions from a biological point of
view considering them as innate “ancient habits” that can partially be modified by
learning and imitation processes. This idea was debated by several authors, among
whom the most representatives were Klineberg [48], La Barre [50], and Birdwhistell
252 A. Esposito
[7], and more recently by White and Fridlund [29-30, 80], according to whom any
behaviour, and therefore also facial expressions of emotions, are learned.
Darwin’s approach was recovered and reinforced about a century later by Tomkins
[77-79], which posits the basis for the research on the instrumental role of facial
movements and facial feedbacks in experiencing emotions. According to Tomkins,
emotions are the driving forces of our primary motivational system acting as amplifi-
ers of what we receive from the environment, and our face is the most important ex-
pression of them. This is because facial expressions of emotions are universally
shared. People from different cultures can easily recognize a happy or a sad face be-
cause of an “innate affective program” for each of what is considered “a primary
emotion” that acts on the facial musculature to produce emotional facial expressions.
On this research line, Ekman [21-23] and Izard [39-41] identified a small set of emo-
tions (the primary ones) associated with a unique configuration of facial muscle
movements and both developed some anatomically based coding systems (the Facial
Action Coding System, (FACS) [20, 25-26], the Maximally Discriminative Facial
Movement Coding System (MAX) and AFFEX [42-43] for measuring facial behav-
iours and identifying emotional expressions. The three systems are different both in
the primary emotions they include (only six for Ekman - happiness, sadness, anger,
fear, surprise, and disgust; nine for Izard) and in the expressions they assign to some
of them. Sophisticated instruments, such as facial EMG (Electromyography) were
used [12-13, 72-74] to asses the above facial coding systems when emotional infor-
mation was not visually perceptible in facial actions. Nevertheless, such studies
proved that distinction among primary emotions or more generally among negative
and positive emotions, was not possible using either visible or EMG information on
facial expressions.
At the date, the role of facial expressions of emotions in identifying and recogniz-
ing emotions is debated among two theories:
1. The “micro-expression hypothesis” which assumes that emotion-specific fa-
cial expressions do exist even when they are not visually perceptible and they
emerge as visible expressions as the emotional intensity increases
[72-74]. This means that an innate program that controls our facial muscula-
ture does exist and it changes our face when we are emotionally involved;
2. The “ethological perspective” that assumes that facial expressions are not di-
rectly related to emotional states, being signals associated to the communica-
tive context [29-30]. The essence of this theory is that “we make faces”. They
are contextual to the moment and the particular situation. Facial expressions
may have nothing to do with our emotional states.
some derived measures on F0 for pitch; and speaking rate, utterance and syllable
lengths, empty and filled pauses for timing. The search for acoustic features that de-
scribe emotional states in speech is performed through long-term or supra-segmental
measurements since emotions are expected to last more than single speech segments and
therefore, sophisticated acoustic measurements have been proposed to detect vocal
expressions of emotions. Apart from F0 frequency [45, 56], these measurements include
the spectral tilting [45], the vocal envelope [71] - defined as the time interval for a
signal to reach the maximum amplitude and decay to zero amplitude - the Long Term
Average Spectrum (LTAS), the energy in different frequency bands, some inverse filter-
ing measurements [47, 57], the F0 contour, the F0 maximum and minimum excursion,
and the F0 jitter - random fluctuations on F0 values [2, 28, 63]. Exploiting this research
line, the cited authors identified some acoustic correlates of vocal expressions for some
primary emotions (see [63, 67]). However, these studies have shown that the perceptual
and acoustic cues to infer emotional states from speech may be complex as much as
using physiological variables and facial expressions, and that the acoustic and temporal
features that can be used to unequivocally describe emotions have not yet been identi-
fied because of the many sources of variability associated with the speech signal.
Gesturing is what people unconsciously do when approaching each other and under-
standing gestures is crucial to understand social interactions. The Oxford Dictionary
of English defines a gesture as “a movement of the body, or any part of it, that is
considered as expressive of thought or feeling”. This definition incorporates a very
large set of body expressions and movements among these facial expressions but also
face directions, hand movements, eye contact, posture, proxemics (study on varying
patterns of physical proximity), touch, dressing, and so on. Therefore, excluding fa-
cial expressions, there is a lot that has been neglected about emotional gestures. Three
fundamental questions remain uncovered in this field:
What can gestures tell about human emotions?
Which gestures are exhibited for a particular emotion?
Do humans make use of such non verbal cues to infer a person’s emotional
state and how?
At present there is limited knowledge about the extent to which body movements
and gestures provide reliable cues to emotional states even though several authors
have suggested that the use of body motion in interaction is part of the social sys-
tem and therefore body movements not only have potential meaning in communica-
tive contexts but can also affect the interlocutor’s behavior [3, 7, 9-10, 14, 36, 46,
52-55]. However, these studies were often concerned with gestures (in particular
hand gestures) used jointly with verbal communication and therefore showing inti-
mate relationships between the speaker’s speech and actions and the listener’s non-
verbal behaviors [36, 46, 52]. These gestures may have nothing to do with emotions
since according to Ekman and Friesen [20, 25-27] affective displays are independ-
ent from language. However, there is subjective evidence that variations in body
movements (e.g. gait, body posture) and gestures (e.g. hand position) convey im-
portant information about people's emotions. To this aim, the author performed
an informal perceptual experiment showing to about 50 people participating at a
254 A. Esposito
3.1 Materials
The collected data are based on extracts from Italian movies whose protagonists were
carefully chosen among actors and actresses that are largely acknowledged by the
critique and considered capable of giving some very real and careful interpretations.
The final database consists of audio and video stimuli representing 6 basic emotional
states: happiness, sarcasm/irony, fear, anger, surprise, and sadness. We introduced
sarcasm/irony to substitute the emotional state of disgust, since after one year of
movie analysis only 1 video clip was identified for this emotion.
For each of the above listed emotional states, 10 stimuli were identified, 5 ex-
pressed by an actor and 5 expressed by an actress, for a total of 60 audio and video
stimuli. The actors and actresses were different for each of the 5 stimuli to avoid bias
in their ability to portray emotional states. The stimuli were selected short in duration
(the average stimulus’ length was 3.5s, SD = ± 1s). This was due to two reasons: 1)
longer stimuli may produce overlapping of emotional states and confuse the subject’s
perception; 2) emotional states for definition cannot last more than a few seconds and
then other emotional states or moods take place in the interaction [58]. Consequently,
longer stimuli do not increase the recognition reliability and in some cases they can
create confusion making the identification of emotions difficult, since in a 20 seconds
long video clip, the protagonist may be able to express more than one and sometimes
very complex emotions.
Care was taken in choosing video clips where the protagonist’s face and the upper
part of the body were clearly visible. Care was also taken in choosing the stimuli such
that the semantic meaning of the sentences expressed by the protagonists was not
clearly expressing the portrayed emotional state and its intensity level was moderate.
256 A. Esposito
For example we avoided to include in the data, sadness stimuli were the actress/actor
were clearly crying or happiness stimuli where the protagonist was strongly laughing.
This was because we wanted the subjects to exploit emotional signs that could be less
obvious but that were generally employed in every natural and not extreme emotional
interaction. From each complete stimulus - audio and video - we extracted the audio
and the video alone coming up with a total of 180 stimuli (60 stimuli only audio, 60
only video, and 60 audio and video).
The emotional labels assigned to the stimuli were given first by two expert judges
and then by three naïve judges independently. The expert judges made a decision on
the stimuli carefully exploiting emotional information on facial and vocal expressions
described in the literature [4, 20-21, 25-26, 42-43, 66-68] and also knowing the con-
textual situation the protagonist is interpreting. The naïve judges made their decision
after watching the stimuli several times. There were no opinion exchanges between
the experts and naïve judges however, the final agreement on the labelling between
the two groups was 100%. The stimuli in each set were then randomized and pro-
posed to the subjects participating at the experiments.
3.2 Participants
3.3 Results
Table 1 reports the confusion matrix for the audio and video condition. The numbers
are percentages computed over the number of the subject’s correct answers to each
stimulus and averaged over the number of the expected correct answers for each emo-
tional state.
The data displayed in Table 1 show that, in the audio and video condition, the
higher percentage of correct answers - 75% - was for sadness (in 14% of the cases
confused with the irony), followed by 64% of correct answers for irony (in 12% of
the cases confused with surprise). Anger reached 60% (confusion was made for 11%
The Amount of Information on Emotional States 257
Table 1. Confusion matrix for the audio and video condition. The numbers are percentages
computed considering the number of subject’s correct answers over the total number of ex-
pected correct answers (300) for each emotional state.
Audio and Sad Ironic Happy Fear Anger Surprise Others Neutral
Video
Sad 75 14 0 2 2 0 4 3
Ironic 2 64 5 3 7 12 3 4
Happy 1 29 50 3 0 11 0 6
Fear 19 2 2 48 4 15 7 3
Anger 5 10 1 11 60 0 8 5
Surprise 0 8 5 14 2 59 0 12
Table 2. Confusion matrix for the video condition. The numbers are percentages computed
considering the number of subject’s correct answers over the total number of expected correct
answers (300) for each emotional state.
of the cases with fear, and 10% with irony), and surprise 59% of correct answers
(confusion was made for 14% of the cases with fear, and 12% with a neutral expres-
sion). In all cases the percentage of correct recognition was largely above the chance.
Surprisingly, happiness and fear were not very well recognized. Happiness had 50%
of correct answers but was significantly confused with irony in almost 30% of the
cases and partially confused with surprise in 11% of the cases. The percentage of
correct answers for fear was 48% and it was confused with sadness and surprise by
almost 20% and 15% of the cases respectively.
Table 2 reports the confusion matrix for the video condition. The numbers are per-
centages computed over the number of the subject’s correct answers to each stimulus
and averaged over the number of the expected correct answers for each emotional
state.
The data displayed in Table 2 show that, in the video condition, the higher percent-
age of correct answers - 68% - was for anger (the higher confusion, 9%, was with
fear), followed by 61% of correct answers for happiness (significantly confused -
24% - with irony). Fear reached 59% of correct answers (confusion was made in 14%
of the cases with anger). Sadness and irony obtained both the 49% of correct an-
swers. The highest confusion was on neutral expressions that had 13% and 14% for
sadness and irony respectively. Surprise was not well recognized getting only 37% of
correct answers. The higher confusion was spread out among anger -13%- fear -11%-
and neutral -18%.
258 A. Esposito
Table 3. Confusion matrix for the audio condition. The numbers are percentages computed
considering the number of subject’s correct answers over the total number of expected correct
answers (300) for each emotional state.
Table 3 reports the confusion matrix for the audio condition. Again, the numbers
are percentages computed over the number of the subject’s correct answers to each
stimulus and averaged over the number of the expected correct answers for each emo-
tional state.
The data displayed in Table 3 show that, in the audio condition, the higher percent-
age of correct answers - 77% - was for anger (the higher confusion, 10%, was with
fear), followed by 75% of correct answers for irony (confused in 9% of the cases
with sadness). Sadness obtained 67% of correct answers and in 13% of the cases was
confused with a neutral expression. Fear got 63% (confusion was made in 12% of the
cases with sadness), and happiness reached 48% of correct answers. Happiness was
significantly confused with irony in 25% of the cases. Surprise was not well recog-
nized also in the audio condition getting only 37% of correct answers. It was signifi-
cantly confused with irony and fear in 21% and 19% of the cases respectively.
The emotional state better identified in all of the three conditions was anger (60%
in the audio and video, 68% in the video, and 77% in the audio). Sadness and irony
were easily recognized in the audio and video (75% and 64% respectively) and in the
audio (67% and 75% respectively) conditions but hardly recognized in the video
alone (49% and 49 % respectively).
Happiness was better identified in the video alone (61%) than in the audio (48%)
and in the audio and video (50%) condition. Fear was better perceived in the audio
(63%) and video (59%) than in the audio and video condition (49%), whereas, sur-
prise was more easily identified in the audio and video (59%) than in the audio (37%)
and video (37%) alone.
A comparative display of the data discussed in Tables 1, 2, and 3 is reported in
Figure 2. Figure 2 shows the number of correct answers for each emotional state
(sadness in Figure 2a., irony in Figure 2b., happiness in Figure 2c., fear in Figure 2d.,
anger in Figure 2e., and surprise in Figure 2f.) and for the audio, video and audio and
video conditions. On the x-axis are reported the emotional labels and on the y-axis is
reported the number of correct answers obtained for each emotional state.
An ANOVA analysis was performed on the above data considering the condition
(audio alone, video alone, and audio and video) as a between subject variable and the
emotional state as a six level (a level for each emotional state) within subject variable.
The statistic showed that condition plays a significant role for the perception of
emotional states (F(2, 12) = 7.701, ρ=.007) and this did not depend on the different
The Amount of Information on Emotional States 259
a. b.
c. d.
e. f.
Fig. 2. Histograms of the number of correct answers given by the subjects participating to the
perceptual experiments for the six emotional states under examination and the three different
perceptual conditions (gray for audio, white for video, and black for audio and video – color
would be different in a color version of the paper)
emotional states (F(5, 60) = .938, ρ=.46) since no interaction was found between
the two variables (F(10, 60) = 1.761, ρ=.08). Moreover, female stimuli were more
affected (F(2, 12) = 4.951, ρ=.02) than male stimuli (F(2, 12) = 1.141, ρ=.35) by
condition, even though gender was not significant (F(1, 12) = .001, ρ=.98) and no
interaction was found between condition and gender (F(2, 12) = .302, ρ=.74). In the
details, we found that there was not a significant difference in the perception of the
emotional states for the audio alone and the audio and video condition (F(1, 8) = .004,
ρ=.95). Significant differences were found between the video alone and the audio and
260 A. Esposito
video condition (F(1, 8) = 10.414, ρ=.01) in particular for sadness (F(1, 8) = 5.548,
ρ=.04), and the video and audio alone (F(1, 8) = 13.858, ρ=.005). In this particular
case, significant differences were found for sadness (F(1,8) = 8.941, ρ=.01), fear
(F(1,8) = 7.342, ρ=.02), and anger (F(1,8) = 9.737, ρ=.01).
A methodological concern that could be raised at this point is whether the ability of
the actors and/or actresses in expressing emotions through vocal expressions may have
affected the results making the audio stimuli more emotionally expressive than the
video. However, if this was the case, it would be necessary to explain why in the com-
bined audio and video condition subjects did not exploited the same vocal cues they
have perceived in the audio alone to infer the protagonist emotional state. We will return
to this argument in the final discussion. Nonetheless, further research is necessary to
evaluate the above possibility, and to determine whether the present results generalize to
other geographically and socially homogeneous groups of subjects.
4 Discussion
The statistical data discussed above and displayed in Tables 1, 2, and 3, and in
Figure 2 show a really unexpected trend and posit the basis for new interpretations
of the amount of perceptual information that are caught in the three different con-
ditions. According to common sense it should be expected that subjects will pro-
vide the highest percentage of correct answers to all the emotional states in the
audio and video condition, since it can be assumed that the combined channels
contain more or otherwise redundant information than the audio and video alone.
However, the experimental data did not support this trend. The subjects perform
similarly, or at least, there is not a significant difference in perceiving emotional
states either exploiting only the audio or the audio and video together, whereas
they are less effective in recognizing emotional states through the video alone. In
particular, in the video alone, the percentage of correct answers for sadness is sig-
nificantly lower than those obtained in the audio alone and in the audio and video
together, suggesting that the visual channel alone conveys less information’s on
sad perceptual cues. Moreover, in the video alone, the percentage of correct an-
swers for fear and anger is significantly lower than that obtained in the audio
alone, attributing to the visual channel also a minor effectiveness in communicat-
ing perceptual cues related to these two emotional states. Furthermore, there is not
a significant difference between the audio alone and the audio and video together
on the subject’s perception of the emotional states under consideration. Therefore,
it appears that the vocal and the visual communication modes convey a different
amount of information about the speaker’s emotional state. The vocal or audio
channel can be considered as preferential since it conveys the same information as
the combined audio and video channels. In some cases it also seems to be able to
resolve ambiguities produced by combining information derived from both the
channels. For example, for anger, we had a stimulus that was correctly identified
by 18 subjects in the audio alone, whereas it was confused with fear by almost all
the subjects in the audio and video condition. Even though in a non statistically
significant manner, audio beats audio and video for anger, irony and fear, whereas
happiness appears to be more effectively transmitted by the video alone than the
The Amount of Information on Emotional States 261
audio and the audio and video together. Surprise seems to be the only emotional
state that to be clearly perceived requires information from both the audio and
video channels. We can conclude that the audio channel is preferential for most of
the emotional states under examination and that, the combined audio and visual
stimuli do not help in improving the perception of emotional states, as it was
expected.
These results first of all suggest a nonlinear processing of emotional cues. In perceiv-
ing emotions, expressive information is not added (this is true even in the case of
surprise, since the percentage of correct answers in the combined audio and video
channels is not the sum of those obtained in the audio and video alone) over the
amount of emotional cues provided. Instead, as the number of emotional cues increase
(assuming that when combining together the video and the audio, it must increase) the
subject’s cognitive load increases since she/he should concurrently elaborate gestures,
facial and vocal expressions, dynamic information (due to the fact that the video and
audio information evolved along the time), and the semantic of the message. These
concurrent elaborations deviate the subject’s attention from the perceived emotion
bringing to a perception that may go beyond the required task since subjects are more
inclined to attempt to identify the contextual situation of the protagonist (and the
identification of the contextual situation may be harder than the identification of the
emotional state due to shortness of the stimuli) than the expressed emotional state.
This could be seen from the number of alternative labels given in each task. The
fewer contextual cues they have, the fewer alternative labels are given. In fact, in the
audio, the total number of alternative labels was 13, against the 64 of the video and
the 49 of the video and audio together.
The explanation provided above is not in disaccord with reported data [12-13, 21,
25-26, 37, 39, 42-43, 72-74] demonstrating that facial expressions are more powerful
than vocal expressions in eliciting emotional states, since these data are based on still
photographs. As we already said in a previous section, still images usually capture the
apex of the expression, i.e. the instant at which the indicators of emotions are mostly
marked. In this case, the subject’s cognitive load is reduced since only static and
highly marked emotional information must be processed. These highly marked
emotional indicators reproduced in the photos are like emblems universally shared
because of the possible existence of an innate motor program [12-13, 72-74] that
controls our facial musculature and changes our faces when we are emotionally in-
volved. And their meaning is also universally shared explaining why we make faces
also when we are not emotionally but contextually involved in emotional situations.
When dynamic is involved, audio is better. This is because audio is intrinsically
dynamic and we are used to processing it dynamically. Moreover, listeners are used to
dynamically vocal emotional cues. Therefore, the cognitive load is just reduced and
they perform better. Moreover, we must here add a further speculation. Our idea is
that audio is better for native speakers of the language, since natives rely on paralin-
guistic and suprasemental information that is strictly related and unique to their own
language. They learned this information by themselves when they were emotionally
involved and observed the changes in their vocal expressions. They may have learned
262 A. Esposito
how a given word is produced in a particular emotional state, even though the word is
not semantically related to the emotional state. Consequently, they may be able to
capture very small supra-segmental and paralinguistic cues that are not noticeable to a
non native speaker. This is not to say that vocal expressions of emotions are learned.
We are not making this point since we do not have data on that. The uncontrolled
changes in the vocal motor program when we are emotionally involved may be innate
as those in our facial musculature. However, changes in intonation and other paralin-
guistic information that derive from changes in the vocal motor control may have
been learned together with the language. Hence, native speakers may be able to catch
up more emotional information in the audio than in the video when the emotions are
expressed in their own language. Nonetheless, they may rely more on visual rather
than vocal information when they are requested to catch up emotional states in a for-
eign language. This is because visual cues may share more commonalities being more
universal than vocal cues since the visual processing of inputs from the outside world
exploits neural processes common to all human beings whereas, languages require a
specific expertise or training to be processed.
At the light of the above considerations, we may expect that using a foreign lan-
guage, we will find that the percentage of correct answers will not vary significantly
between the video alone and the combined audio and video channels and that signifi-
cant differences will be reported for the audio alone. We are actually testing this hy-
pothesis (and we have already some preliminary results) through a set of perceptual
experiments similar to those reported above, with the differences that in this case the
protagonists of our clips speak a foreign language and therefore in the audio alone
condition the participants cannot rely on their language expertise. To this aim, we
checked that the subjects participating at the experiments did not know the language.
Dynamicity and language expertise is also able to explain why subjects performed
poorer in the audio and video alone than in the combined channels for surprise. Sur-
prise is an almost instantaneous state, is like a still image. The short duration of this
emotional state allow the listener to exploit both the visual and auditory information
without increasing the cognitive load and avoid to diverge her/his attention from the
perceived emotion.
The discussion on the language expertise however, rises a more fundamental ques-
tion on how much of the expressions of emotions are universally shared. Data re-
ported in literature support a universal interpretation of emotional facial expressions
[12-13, 21-23, 39, 72-74] but these data were based, as we already pointed out, on
still images. In our case the question is how much of the dynamic features of emo-
tional faces are universal? We have already speculated that the perception of emo-
tional vocal expressions can strictly depend on the language expertise and that, in
such a case, visual cues are more universally recognized than vocal cues. However,
we have not considered that dynamic emotional visual stimuli can also be culture
specific. This means that we may expect subjects to perform better on culturally close
visual emotional cues than on culturally distant ones. The above proposed experi-
ments could be able to answer also to this question, and quantify how much dynami-
cal emotional information we are able to capture in a foreign face-to-face interaction.
They should also be able to identify how much is universally and how much is culture
specific in the expressions of emotions.
The Amount of Information on Emotional States 263
The above results are also interesting from an information processing point of view.
Information processing theory is a research branch of communication theory that
attempts to quantify how well signals encode meaningful information and how well
systems process the received information. In this context, entropy is defined as a
measure of “self-information” or “uncertainty” transmitted by a given signal source as
it can be considered in our case, both the audio and video channels. Mathematically
the entropy is described by the equation:
N
H ( X ) = − ∑ p ( X n ) log p ( X n ) (1)
n =1
Where Xn, , n=1, …, N are the symbols that can be emitted by the source and p(Xn) the
probability for a given symbol to be emitted. In the Shannon view, the entropy is a
measure of the complexity or the information contained in the set of symbols and its
value is highly affected by the symbol’s probability distribution. The entropy model
can be extended to more than one source defining what is called the joint entropy,
which, for two sources is mathematically described by the equation:
N M
H ( X , Y ) = − ∑ ∑ p ( X n , Ym ) log p ( X n , Ym ) (2)
n =1m =1
Where Xn, , n=1, …, N , and Ym, , m=1, …, M are the symbols that can be emitted by
the sources X and Y, and p(Xn, Ym,) the joint probability distribution.
Theoretically, it can be proved that the joint entropy of two different sources is
greater than that of a single source, and therefore, it can be expected that increasing
the number of sources the amount of information transmitted over a communication
channel will consequently increases [75]. The theoretical definition of entropy does
not take into account the informative value of the source, i.e. the meaning the signal
can convey and what the receiver assumes to be informative about the received signal.
Consequently, entropy measure does not consider cases where the source can be only
noise, and/or cases where extraneous information can be exploited by the receiver. In
our context entropy cannot be considered an objective measure of the information
transmitted by the audio and video channels. A more adequate measure could be the
mutual information that takes into account the effects of the transmission channel on
the received signal. Mutual information is a measure of the dependence between the
source and/or the receiver and it produces a reduction of the “uncertainty” or the
amount of information transmitted by the source. Mutual information is mathemati-
cally expressed by the equation:
N K p( X n , Z k )
I ( X , Z ) = − ∑ ∑ p ( X n , Z k ) log (3)
n =1k =1 p( X n ) p(Z k )
Mutual information could be an appropriate mathematical model to describe our
results, since being always less greater or equal to the entropy of the source it can
264 A. Esposito
explain why by combining audio and video channels the information about the emo-
tional states does not increases. However, mutual information requires a complete
knowledge on how the receiver filters (or transforms) the information emitted by the
source. In our case this information is completely unknown, since we do not have at
the date a clear idea on how our brain process information received by visual and
auditory channels. Moreover, mutual information cannot explain our hypothesized
differences in perception between native and non native speakers and between close
and distant cultural backgrounds.
A more adequate model could be the information transfer ratio proposed by
Sinanović and Johnson [76] that quantifies how the receiver affects the received in-
formation by measuring the distance between (the probability distributions of) the
actions performed by the receiver for given received inputs and (the probability dis-
tributions of) the received inputs. Assuming that emotional information encoded
through the audio and the video represents the input to a processing system, the in-
formation transfer ratio value can be quantified as the ratio of the distances between
the information transmitted by the sources and the distances between the outputs (for
those sources) of the processing system. Information transfer ratio is described by the
following equation:
d Z (α 0 , α1 ) d Z (α 0 , α1 )
γ XY , Z (α 0 , α1 ) ≡ + (4)
d X (α 0 , α1 ) dY (α 0 , α1 )
Where α1 indicates the changes in the emotional contents of the sources with respect
to a reference point α0, which depends not on the signals itself but on their probability
functions pX(x; α0), pX(x; α1), pY(y; α0), pY(y; α1), and pZ(z; α0), pZ(z; α1). X and Y
are the sources (audio and video), Z the processing systems, and d(…) indicate an
information theoretic distance (for example, the Kullback-Leibler distance [1]) that
obeys to the Data Processing Theorem [35], which roughly states that the output of
any processing system cannot contain more information than the processed signal.
The information transfer ratio quantifies the ability of the receiver to process the
received information and predicts in the case of multiple sources that the overall in-
formation transfer ratio may be smaller than the individual ratios for each input sig-
nal. This is due to the arbitrary filtering action performed by the receiver (the human
being) whose transfer function is in our case unknown, even though its ability to catch
emotional information can be derived by the number of the subject’s correct answers.
In simple words, in the complex world of emotion’s perception the whole, in some
cases, seems not to be more than the sum of its parts.
5 Conclusions
The present paper reports on perceptual data showing that native speakers rely more
on the auditory than on the visual channel to infer information on emotional states,
and that the amount of emotional information conveyed does not increases combining
auditory and visual information. An attempt is made to explain this phenomenon in
terms of the subject’s cognitive load and information processing theory.
The Amount of Information on Emotional States 265
Acknowledgements
This research has been partially supported by the project i-LEARN: IT’s Enabled
Intelligent and Ubiquitous Access to Educational Opportunities for Blind Student, at
Wright State University, Dayton, Ohio, USA. Acknowledgment goes to Tina Marcella
Nappi for her editorial help. Acknowledgement also goes to Professors Maria Mari-
naro and Olimpia Matarazzo for their useful comments and suggestions. In particular,
Olimpia has greatly helped in the selection of stimuli. The students Gianluca Baldi,
Eliana La Francesca, and Daniela Meluccio are greatly acknowledged for running the
perceptual tests.
References
1. Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution
from another. Journal of Royal Statistic Society, 28, 131-142 (1966)
2. Apple, W., Hecht, K.: Speaking emotionally: The relation between verbal and vocal com-
munication of affect. Journal of Personality and Social Psychology, 42, 864–875 (1982)
3. Argyle, M.: Bodily Communication. New York: Methuen & Co (1988)
4. Banse, R., Scherer, K.: Acoustic profiles in vocal emotion expression. Journal of Personal-
ity & Social Psychology 70(3), 614-636 (1996)
5. Barnes, J.: Aristotele complete work. Revised Oxford translation in 2 volumes, Princeton
University press (1984)
6. Bachorowski, J.A. (1999). Vocal expression and perception of emotion. Current Directions
in Psychological Science, 8, 53–57.
7. Birdwhistell, R.: Kinesies and context. Philadelphia, University of Pennsylvania Press
(1970)
8. Breitenstein, C., Van Lancker, D., & Daum, I. The contribution of speech rate and pitch
variation to the perception of vocal emotions in a German and an American sample. Cog-
nition & Emotion, 15, 57–79 (2001)
9. Burgoon, J.K.: Nonverbal signals. In Knapp, J. L., Miller, G. R. (eds.). Handbook of inter-
personal communication, Thousand Oaks, CA: Sage, 229-285 (1994)
10. Burleson, B.R.: Comforting messages: Features, functions, and outcomes. In Daly, J.A.,
Wiemann, J.M. (eds.), Strategic interpersonal communication, Hillsdale, NJ: Erlbaum,
135-161 (1994)
11. Burns, K. L., Beier, E. G.: Significance of vocal and visual channels in the decoding of
emotional meaning. Journal of Communication, 23, 118–130 (1973)
12. Caccioppo, J. T., Klein, D. J: Bernston, G.C., Hatfield, E.: The psychophysiology of emo-
tion. In Haviland, M., Lewis, J. M. (eds) Handbook of Emotion, New York: Guilford
Press, 119-142 (1993)
13. Caccioppo, J. T., Bush, L.K., Tassinary, L.G.: Microexpressive facial actions as a func-
tions of affective stimuli: Replication and extension. Personality and Social Psychology
Bullettin, 18, 515-526 (1992)
14. Corraze, G.: Les communications nonverbales. Presses Universitaires de France, Paris
(1980)
15. Cosmides, L.: Invariances in the acoustic expressions of emotions during speech..Journal
of Experimental Psycology, Human Perception Performance, 9, 864-881 (1983)
16. Darwin, C.: The expression of the emotions in man and the animals (1872). Reproduced
by the University of Chicago, Chicago press (1965)
266 A. Esposito
17. Davitz, J. R.: Auditory correlates of vocal expression of emotional feeling. In Davitz J. R.
(ed.), The communication of emotional meaning, New York: McGraw Hill, 101-112
(1964)
18. Davitz, J.: The communication of emotional meaning. McGraw-Hill (1964)
19. Davitz, J. R.: Auditory correlates of vocal expression of emotional feeling. In Davitz J. R.
(ed.), The communication of emotional meaning, New York: McGraw Hill, 101-112
(1964)
20. Ekman, P., Friesen, W.V., Hager, J.C.: The facial action coding system. Second edition.
Salt Lake City: Research Nexus eBook.London: Weidenfeld & Nicolson(2002)
21. Ekman, P. Facial expression of emotion: New findings, new questions. Psychological Sci-
ence, 3, 34-38 (1992)
22. Ekman, P.: An argument for basic emotions. Cognition and Emotion, 6, 169-200 (1992)
23. Ekman, P.: The argument and evidence about universals in facial expressions of emotion.
In H. Wagner, H., Manstead, A. (eds.). Handbook of social psychophysiology, Chichester:
Wiley, 143-164 (1989)
24. Ekman, P.: Expression and the nature of emotion. In Scherer, K., Ekman, P. (eds), Ap-
proaches to emotion, Hillsdale, N.J.: Lawrence Erlbaum, 319-343 (1984)
25. Ekman, P. Friesen, W. V.: Facial action coding system: A technique for the measurement
of facial movement. Palo Alto, Calif.: Consulting Psychologists Press (1978)
26. Ekman, P., Friesen, W.V.: Manual for the Facial Action Coding System, Palo Alto: Con-
sulting Psychologists Press (1977)
27. Ekman, P., Friesen, W.V.: Head and body cues in the judgement of emotion: A reformula-
tion. Perceptual Motor Skills 24: 711-724 (1967)
28. Frick, R.: Communicating emotions: the role of prosodic features. Psychological Bullettin,
93, 412-429 (1985)
29. Fridlund, A.J.: The new ethology of human facial expressions. In Russell J.A., Fernandez-
Dols J. (eds.), The psychology of facial expression Cambridge: Cambridge University
Press, 103-129 (1997).
30. Fridlund, A.J.: Human facial expressions: An evolutionary view. San Diego: CA Aca-
demic press (1994)
31. Friend, M.: Developmental changes in sensitivity to vocal paralanguage. Developmental
Science, 3, 148–162 (2000)
32. Frijda, N.H.: Moods, emotion episodes, and emotions. In Haviland, M., Lewis, J. M. (eds)
Handbook of Emotion, New York: Guilford Press, 381-402 (1993)
33. Frijda, N.H.: The emotions, Cambridge University press (1986)
34. Fulcher, J.A.: Vocal affect expression as an indicator of affective response. Behavior Re-
search Methods, Instruments, & Computers, 23, 306–313 (1991)
35. Gallager, R.G.: Information theory and reliable communication. John Wiley & Son (1968)
36. Goldin-Meadow, S.: Hearing gesture: How our hands help us think. The Belknap Press at
Harvard University Press (2003)
37. Graham, J., Ricci-Bitti, P.E., Argyle, M.: A cross-cultural study of the communication of
emotion by facial and gestural cues. Journal of Human Movement Studies, 1, 68-77 (1975)
38. Haldane, E.L., Ross, G.R.: The philosophical work of Descartes. New York: Dover, cur-
rent edition (1911)
39. Izard, C.E.: Innate and universal facial expressions: Evidence from developmental and
cross-cultural research. Psychological Bulletin, 115, 288–299 (1994)
40. Izard, C.E.: Organizational and motivational functions of discrete emotions. In Lewis M.,
Haviland J. M (eds.), Handbook of emotions New York: Guilford Press, 631–641 (1993)
The Amount of Information on Emotional States 267
41. Izard, C.E.: Basic emotions, relations among emotions, and emotion–cognition relations.
Psychological Review, 99, 561–565 (1992)
42. Izard, C.E., Dougherty, L.M., Hembree, E.A.: A system for identifying affect expressions
by holistic judgments. Unpublished manuscript. Available from Instructional Resource
Center, University of Delaware (1983)
43. Izard, C.E.: The maximally discriminative facial movement coding system (MAX). Un-
published manuscript. Available from Instructional Resource Center, University of Dela-
ware (1979)
44. James W.: What is an Emotion? First published in Mind, 9, 188-205 (1884). See
http://psychclassics.asu.edu/James/emotion.htm
45. Junqua, J.C.: The Lombard reflex and its role on human listeners and automatic speech
recognizers. JASA, 93(1), 510-524 (1993)
46. Kendon, A.: Gesture : Visible action as utterance. Cambridge University Press (2004)
47. Klasmeyer, G., Sendlmeier W. F.: Objective voice parameters to characterize the emo-
tional content in speech. In Proceedings of ICPhS 1995, Elenius, K., Branderudf, P. (eds),
1,182-185, Arne Strömbergs Grafiska (1995)
48. Klinerberg, O.: Emotional expression in chinese literature. Journal of Abnormal and Social
Psychology, 33, 517-520 (1938)
49. Klinnert M.D., Campos J.J., Sorce J.F., Emde R.N., Svejda M.: Emotions as behaviour
regulators: Social referencing in infancy. In Plutchik R,. Kellerman H. (eds), Emotions,:
Theory, research, and experience, 57-86, New York, Academic Press (1983)
50. La Barre, W.: The cultural basis of emotions and gestures. Journal of Personality, 16, 49-
68 (1947)
51. Lumsden, C.J., Wilson E.O.: Genes mind and culture: The co-evolutionary process. Har-
vard University Press (1981)
52. McNeill, D.: Gesture and thought. Chicago: University of Chicago Press (2005)
53. Mehrabian, A.: Orientation behaviors and non verbal attitude communication. Journal of
Communication, 17(4), 324-332 (1967)
54. Millar, F.E., Rogers, L.E.: A relational approach to interpersonal communication. In
Miller, G.R. (ed.), Explorations in interpersonal communication Beverly Hills, CA: Sage,
87-105 (1976)
55. Miller: Explorations in interpersonal communication. London: Sage, (1976)
56. Mozziconacci, S.: Pitch variations and emotions in speech. In Proceedings of ICPhS 1995,
Elenius, K., Branderudf, P. (eds), 1, 178-181, Arne Strömbergs Grafiska (1995)
57. Nushikyan, E. A.: Intonational universals in texual context. In Proceedings of ICPhS 1995,
Elenius, K., Branderudf, P. (eds),1, 258-261, Arne Strömbergs Grafiska (1995)
58. Oatley, K., Jenkins, J. M.: Understanding emotions. Oxford, England: Blackwell (1996)
59. Panksepp, J.: Affective neuroscience: A conceptual framework for the neurobiological
study of emotions. In Strongman K (ed.) International Review of Studies of Emotions,
Chichester, England: Wiley, 1, 59-99 (1991)
60. Plato: The Republic. (375 BC). Harmondworth, Middlesex: Penguin, current edition
(1955)
61. Plutchik, R.:Emotion and their vicissitudes: Emotions and Psychopatology. In Haviland,
M., Lewis, J. M. (eds) Handbook of Emotion, New York: Guilford Press, 53-66 (1993)
62. Plutchik, R.: The emotions. Lanham, MD: University Press of America (1991)
63. Pittam, J., Scherer, K.R.: Vocal expression and communication of emotion. In Haviland,
M., Lewis, J.M. (eds) Handbook of Emotion, New York: Guilford Press, 185-197 (1993)
64. Ricouer, P.: The voluntary and the involuntary. Translation of Kohak K. Original work
1950, Northwestern University Press (1966)
268 A. Esposito
65. Rosenberg, B.G., Langer, J.: A study of postural-gestural communication. Journal of Per-
sonality and Social Psychology 2(4), 593-597 (1965)
66. Scherer, K: Vocal communication of emotion: A review of research paradigms. Speech
Communication 40, 227-256 (2003)
67. Scherer, K.R., Banse, R., Wallbott, H.G.: Emotion inferences from vocal expression corre-
late across languages and cultures. Journal of Cross-Cultural Psychology, 32, 76–92
(2001)
68. Scherer, K.R., Banse, R., Wallbott, H.G., Goldbeck, T.: Vocal cues in emotion encoding
and decoding. Motivation and Emotion, 15, 123–148 (1991)
69. Scherer, K.R: Vocal correlates of emotional arousal and affective disturbance. In Wagner,
H., Manstead, A. (eds.) Handbook of social Psychophysiology New York: Wiley, 165–197
(1989)
70. Scherer, K.R.: Experiencing Emotion: A Cross-cultural Study, Cambridge University
Press, Cambridge (1982)
71. Scherer, K.R., Oshinsky, J.S.: Cue utilization in emotion attribution from auditory stimuli.
Motivation and Emotion, 1, 331–346 (1977)
72. Schwartz, G.E., Weinberger, D.A., Singer, J.A.: Cardiovascular differentiation of happi-
ness, sadness, anger, and fear following imagery and exercise. Psychosomatic Medicine,
43, 343–364 (1981)
73. Schwartz, G.E., Ahern, G.L., Brown, S.: Lateralized facial muscle response to positive and
negative emotional stimuli. Psychophysiology, 16, 561-571 (1979)
74. Schwartz, G.E., Fair, P.L., Salt, P., Mandel, M.R., Klerman, G.L.: Facial muscle pattern-
ing to affective imagery in depressed and non-depressed subjects. Science, 192, 489- 491
(1976)
75. Shannon, C.E., Weaver, W.: Mathematical Theory of Communication. US: University of
Illinois Press (1949)
76. Sinanović, S., Johnson, D.H.: Toward a theory of information processing. Submitted to
Signal Processing. http://www-ece.rice.edu/~dhj/cv.html#publications (2006)
77. Tomkins, S.S.: Affect, imagery, consciousness. The positive affects. New York: Springer,
1 (1962)
78. Tomkins, S.S.: Affect, imagery, consciousness. The negative affects. New York: Springer,
2 (1963)
79. Tomkins, S.S.: Affect theory. In Scherer, K.R., Ekman, P.(eds.), Approaches to emotion,
Hillsdale, N.J.: Erlbaum, 163-196 (1984)
80. White, G.M.: Emotion inside out The anthropology of affect. In Haviland, M., Lewis, J.
M. (eds) Handbook of Emotion, New York: Guilford Press, 29-40 (1993)
Author Index