A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
AbstractVoiced speeches have a quasi-periodic nature that the most popularly used methods due to its simplicity and effi-
allows them to be compactly represented in the cepstral domain. ciency. In these methods, a noisy speech signal is divided into
It is a distinctive feature compared with noises. Recently, the
temporal cepstrum smoothing (TCS) algorithm was proposed and
overlapped frames and the short-time Fourier transform (STFT)
was shown to be effective for speech enhancement in non-sta- is applied to each frame to obtain its frequency spectrum. A gain
tionary noise environments. However, the missing of an automatic function is then applied to suppress frequency components se-
parameter updating mechanism limits its adaptability to noisy lected based on some criteria, such as to minimize the mean
speeches with abrupt changes in SNR across time frames or
frequency components. In this paper, an improved speech en- square error (MSE) from the clean speech [3]. Many gain func-
hancement algorithm based on a novel expectation-maximization tions were thus proposed, including the Wiener filter [4] and
(EM) framework is proposed. The new algorithm starts with the log-minimum mean square error (logmmse) [5] gain func-
the traditional TCS method which gives the initial guess of the tions. Both of them aim at suppressing the spectral components
periodogram of the clean speech. It is then applied to an norm
regularizer in the M-step of the EM framework to estimate the of low signal-to-noise ratio (SNR) such that the average SNR
true power spectrum of the original speech. It in turn enables the of the speech can be increased. To improve the efficiency in
estimation of the a-priori SNR and is used in the E-step, which suppressing the unwanted noisy speech frequency components,
is indeed a logmmse gain function, to refine the estimation of the the speech presence probability (SPP) function is also applied
clean speech periodogram. The M-step and E-step iterate alter-
nately until converged. A notable improvement of the proposed [6][8] such that the magnitude of a speech frequency com-
algorithm over the traditional TCS method is its adaptability to ponent will be further attenuated if the probability of speech
the changes (even abrupt changes) in SNR of the noisy speech. presence is low. Indeed the efficiency of these methods relies
Performance of the proposed algorithm is evaluated using stan-
heavily on the accuracy of the estimated a-priori SNR, which is
dard measures based on a large set of speech and noise signals.
Evaluation results show that a significant improvement is achieved defined as the ratio between the true power spectra of speech and
compared to conventional approaches especially in non-stationary noise. While the estimation of the true speech power spectrum
noise environment where most conventional algorithms fail to is known to be difficult, the estimation of the true noise power
perform.
spectrum is not easy either. This is particularly the case when the
Index TermsCepstral analysis, expectation-maximization, contaminating noise signal is non-stationary. It ends up with the
speech enhancement. musical noises [1] introduced to the resulting enhanced speech,
which is extremely annoying to users. Many approaches, such as
the decision-direct method [3], efficient SPP estimators [6][8]
I. INTRODUCTION
etc., were suggested to reduce the musical noise. Their perfor-
mance however can deteriorate significantly when the SNR is
S PEECH enhancement is an important research topic due low or when the noise is non-stationary. They may also remove
to its widespread applications in speech recognition, voice the original speech contents, hence reducing the intelligibility
communications, and information forensics [1]. Among the var- of the resulting speech.
ious solutions available, spectral subtraction [2] is still one of Recently, the temporal cepstrum smoothing (TCS) technique
was proposed for speech enhancement [9][10]. Since voiced
speeches are quasi-periodic in nature, their magnitude spectrum
Manuscript received August 15, 2013; revised November 01, 2013; ac-
cepted November 01, 2013. Date of publication November 11, 2013; date of exhibits peaks and valleys separated by harmonics of the fun-
current version December 31, 2013. This work was supported by the Hong damental frequency which can be compactly represented in the
Kong University Grant Council under Grant B-Q19F. The associate editor cepstral domain. Since most noises do not have such harmonic
coordinating the review of this manuscript and approving it for publication was
Prof. Thomas Fang Zheng. structure, it allows us to selectively reduce the variance of the
D. P. K. Lun and T. W. Shen are with the Centre for Signal Processing, cepstral coefficients which are likely contributed by noise. In
Department of Electronic and Information Engineering, The Hong Kong Poly- general, the TCS method can improve the accuracy in estimating
technic University, Hong Kong, China (e-mail: [email protected]; takwai.
[email protected]).
the a-priori SNR of a noisy speech comparing with the tradi-
K. C. Ho is with the Department of Electrical and Computer Engineering, tional spectral subtraction methods. It was also reported that the
University of Missouri, Columbia, MO 65211 USA (e-mail: hod@missouri. method works well in some non-stationary noise environments
edu).
Color versions of one or more of the figures in this paper are available online
[9]. Nevertheless, the TCS method requires a set of empirically
at http://ieeexplore.ieee.org. selected parameters to control the cepstrum smoothing process.
Digital Object Identifier 10.1109/TASLP.2013.2290497 As there is not a mechanism to automatically adjust the parame-
2329-9290 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
336 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014
ters, the TCS method cannot adapt itself to the changes in SNR often a challenge when they are working in open environments.
of the noisy speech signal across time frames or frequency com- There are many other applications of the EM algorithm and
ponents. The problem is particularly obvious if there are some interested readers are referred to [23][24] for more details.
parts of a noisy speech spectrum having significantly low SNR Similar to the abovementioned approaches, the proposed al-
(e.g. the noise is composed by a few strong tones of varying fre- gorithm makes use of the EM algorithm to define a theoretical
quencies). The TCS method cannot fully remove the related cep- framework for the design of an iterative speech enhancement
stral coefficients with the speech content intact. To deal with the process. However, it is non-parametric, hence it does not re-
problem, it was suggested to further apply a SPP estimator with quire specific prior knowledge about the speech or noise model.
the TCS technique to remove the outliers in the enhanced speech In the proposed algorithm, the parameters to be estimated are
[11]. The result however is still not very satisfactory since the the cepstral coefficients of the true speech power spectrum, of
accuracy of the SPP estimators will also deteriorate when the which their accurate estimation is important in speech enhance-
SNR is low or when the noise is non-stationary as mentioned ment. It enables the computation of the a-priori SNR of the
above. noisy speech, which is one of the most essential parameters re-
In this paper, we present an improved speech enhance- quired in different speech enhancement gain functions as men-
ment algorithm based on a novel expectation-maximization tioned above. The proposed algorithm first makes use of the
(EM) framework working in the cepstral domain. The EM TCS technique to generate an initial guess of the clean speech
algorithm was discovered and employed independently by periodogram, which is the complete data set of our problem. It
several different researchers until Dempster [12] brought is applied to an norm regularizer [30] in the M-step of our
their ideas together and coined the term EM algorithm. It EM framework to give the first estimate of the required cepstral
is particularly suitable to the parameter estimation problems coefficients of the true speech power spectrum. They are then
in which the data for evaluating the parameters are missing used to compute the a-priori SNR that is needed for the log-
or incomplete. It is known to produce the maximum-likeli- mmse gain function to refine the estimation of the clean speech
hood (ML) parameter estimates when there is a many-to-one periodogram, which is the E-step of our EM framework. Sub-
mapping from an underlying distribution to the distribution sequently, the estimate is fed back to the M-step to refine the
governing the observation. The algorithm contains two main estimation of the cepstral coefficients of the true speech power
steps. The E-step (expectation) gives an expectation of the spectrum. The E-step and M-step iterate alternately until con-
unknown underlying distribution based on the observed data vergence is reached. The operation is illustrated in Fig. 1. A
and the M-step (maximization) estimates the parameters by notable improvement of the proposed algorithm over the tradi-
maximizing the expectation. The E-step and M-step then iterate tional non-parametric speech enhancement methods is that, due
alternately until converged. The EM algorithm has widespread to the iterative process, the proposed algorithm can adapt to the
applications in digital image and speech processing [13][22]. changes (even abrupt changes) in SNR of the noisy speech. In
One of the widely cited applications of the EM algorithm is addition, the proposed algorithm fully utilizes the sparsity of
the estimation of the hidden Markov models (HMMs), which speeches in the cepstral domain by adopting an norm regu-
is particularly relevant to speech processing [17][20]. In [19], larizer in the M-step. It enables the regularization process to be
a HMM-based gain modeling algorithm was proposed for the carried out on coefficients with improved SNR hence reduces
enhancement of speech in noise. It applies the EM algorithm the effect due to the error in estimating the non-stationary noise
for offline training and the recursive EM algorithm for online statistical characteristics. As a result, the proposed algorithm
estimation of the HMM parameters. The EM algorithm is also has outstanding performance when working in non-stationary
used in an approximate Bayesian based speech enhancement noise environments. Extensive performance evaluations have
algorithm [20] for learning the speech and noise spectra under been conducted using the speech samples from the TIMIT data-
the Gaussian approximation. Similar to the conventional model base [25] contaminated by many different noise signals. Signif-
based speech enhancement methods, these approaches require icant improvement is noted in almost all cases over the com-
prior knowledge about the noise model, or it has to be de- peting speech enhancement methods measured using standard
tected online. Degraded performance will be resulted if there performance metrics.
is error in detection or the detected noise model is not in the This paper is organized as follows. In Section II, a brief re-
training database. Besides HMM models estimation, the EM view of the traditional spectral subtraction and temporal cep-
algorithm is also used in the estimation of autoregressive (AR) strum smoothing algorithm is given. It is followed by a brief
model for speech enhancement [21], where the E-step is in introduction of the EM algorithm in Section III. The new EM
fact the Kalman filter and the M-step is similar to the standard framework for speech enhancement in non-stationary environ-
Yule-Walker solution for estimating the coefficients of AR ments is described in Section IV. The simulation results are
processes. It is noted that the performance of the method is shown in Section V, and conclusions are drawn in Section VI.
rather unstable (particularly at input SNR from 4 dB to 10 dB).
Effort was made [22] to improve the problem by using the II. SPECTRAL SUBTRACTION AND CEPSTRUM SMOOTHING
Rao-Blackwellized particle filter in the E-step to replace the Let us begin with a speech signal that is contaminated by
Kalman filter. However, the overall performance, particularly additive noise such that the observed noisy speech is
at low SNR, still has much room to improve. In general, the . In the frequency domain, the noisy speech over a short-time
performance of model based speech enhancement methods frame can be expressed as: , where
depends heavily on the accuracy in model estimation, which is is the frame index, , and represent the
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 337
where (3)
(4)
where the parameters and control the trade-off between
the amount of noise reduction and the distortion of speech tran-
sients in a speech enhancement framework. Although much ef-
fort has been devoted to resolve the difficulties, in general the
performance of current spectral subtraction algorithms will still
degrade significantly when the SNR is low or if the noise is
non-stationary.
It is shown in [9] that TCS method can give a good estima-
tion of the a-priori SNR for some non-stationary noise environ-
Fig. 1. The operation of the proposed speech enhancement algorithm based on
the new EM framework.
ments. The algorithm can be implemented by first computing
the ML estimation of the a-priori clean speech power spectrum
as follows:
-th spectral component from the Fourier transforms of , ,
and over the time frame, respectively. We shall denote the (5)
power spectra of the noisy speech, the underlying clean speech
and the input additive noise as , and . where is the minimum value allowed for and is
The spectral subtraction method consists of the application of the maximum-likelihood estimate of the a-priori SNR given by
a spectral gain function to each noisy short-time frame (3). Next, the cepstral representation of is computed
to enhance the speech signal. Two most popular gain as,
functions are the Wiener filter and the logmmse gain functions.
Their most commonly used basic forms are [1]: (6)
(1) where is the cepstral index, also known as the quefrency index
[27]; and idft is the inverse discrete Fourier transform. Next, the
selected cepstral coefficients are recursively smoothed over time
with a quefrency dependent parameter as follows:
, and for different need to be determined set in the our EM framework. can be computed from the
empirically. It is believed that these parameters should be periodogram of the original speech , i.e. where
adaptively adjusted in order to achieve optimal performance for , as follows:
noisy speeches of different SNR values. It is however difficult
to derive an efficient algorithm for such purpose due to the
empirical nature of these parameters.
(13)
III. THE EXPECTATION MAXIMIZATION ALGORITHM where
The basic idea behind the EM algorithm can be described as
follows. Assume that is the parameter set we would like to es- (14)
timate and the probability density function (pdf) of some
data set given is known, where is referred as the complete and is the Eulers constant. It is shown in [28][29]
data set in the context of the EM algorithm. Let us also assume that under some regularity conditions and for large sample size
the pdf is a continuous and appropriately differentiable func- ( ) real-valued data, the estimated cepstral coefficients
tion of . If is known, can be readily evaluated by maxi- are even symmetric and independent random variables
mizing : having normal distributions with means and variances
as follows:
(9)
(15)
Unfortunately in many practical applications, some or all ele-
ments of cannot be obtained directly from the experiments but where
only by means of another observed data set . Besides, there
can be a many-to-one mapping between and . So for the (16)
E-step of the EM algorithm, we first compute the expectation
of given the data set and our current estimate of . In the remaining of this section we shall drop the index , where
That is, appropriate, for simplifying the equations. The dependency of
on the relevant quantities should be apparent.
(10) It can be seen in (15) that the required parameter in fact
is the mean of the complete data set , which has a normal
where is the -th estimation of and is the iteration index.
distribution. However, is unknown since we do not have
Then for the M-step of the EM algorithm, we find the value of
the original speech data. Hence we cannot directly compute
which maximizes as:
from . We have to rely on the observed noisy speech peri-
(11) odogram to help us in estimating . The expecta-
tion of given the data set and the current esti-
where is our refined estimation of . The E-step and mate of can be expressed as follows:
M-step then iterate alternately and will converge to give the
(17)
ML estimation of as proven in [12].
IV. THE NEW EM FRAMEWORK FOR SPEECH ENHANCEMENT From (15) and (16), we know that,
IN NON-STATIONARY NOISE ENVIRONMENTS
When applying the EM algorithm to speech enhancement, we (18)
consider the cepstral coefficients of the true clean speech power
spectrum to be the parameters for estimation. It is defined
as, Hence
(12)
where is the total number of frequency components and the
frame index is dropped for notation simplicity. As it is ex-
plained in Section II, is an important parameter to be es-
timated in speech enhancement applications. The objective of (19)
the proposed algorithm is to obtain a good estimation of so
as to compute based on (12). The a-priori SNR of the We shall apply to the M-step of the proposed algo-
noisy speech can then be obtained and used in the traditional rithm. The purpose of the M-step is to optimize in order
speech enhancement gain functions such as (1) or (2) for the es- to maximize . From the recent research in iterative
timation of the unknown clean speech periodogram. regularization [30], it is known that if the signal is sparse or if
Next, we select the cepstral coefficients of the original clean the signal can be transformed into a domain where its coeffi-
speech periodogram, denoted as , to be the complete data cients are sparse, the inclusion of a penalty term made up by the
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 339
norm of the signal or its coefficients can significantly im- where returns if is positive, 0 or negative,
prove the chance for the iterative process to reach its global op- respectively. Hence the optimal is the one such that
timum point. Consequently, there are works adopting this idea . It is then used as the new estimate of . That is,
in the M-step of the EM algorithm. For instance, the regularized
M-step is used in [31] and [32] for image restoration and image
reconstruction applications. For the proposed method, the EM
algorithm is operating in the cepstral domain. Since the coeffi-
cients of speech in the cepstral domain are sparse and are very
much different from noise, we include a penalty term made up (25)
by the norm of the desired cepstral coefficients in the opti-
mization process. More specifically, the M-step of the proposed
algorithm is given as follows: where and . (25) is indeed
the well-known soft thresholding non-linearity [30][33] with an
additional constraint .
To implement (25), we need to have a good estimate of .
For the proposed algorithm, we estimate by the following,
(21)
where , and dft and idft are the discrete Fourier
transform and its inverse, respectively. In both cases, bias is re-
In (21), is a free parameter that adjusts the amount of reg-
moved using the approach in [29] whenever transforming data
ularization applied to the maximization process. Its selection
between the cepstral domain and the spectral domain. As shown
method will be discussed at the end of this section. is de-
in (26), we adopt the TCS method [9] to obtain the initial guess
pendent on and it is defined as a weakly differentiable binary
of at . Afterwards, we update by using a logmmse
function such that,
gain function [5] in which the a-priori SNR is computed based
on the current estimate . The logmmse gain function theoret-
(22) ically gives the minimum mean square error estimation of the
log-magnitude spectra. From [5], we know that,
The introduction of the penalty term imposes a constraint to the
optimization process such that the energy of the estimated cep- (27)
stral coefficients will concentrate at very low quefrencies as well
as the quefrencies associated with the fundamental frequency.
Hence,
They are exactly the features of voiced speeches in the cepstral
domain. By doing so, the optimization process can be carried out
on coefficients with improved SNR and hence reduces the effect
due to the estimation error of the non-stationary noise charac-
teristics. It turns out to be the major factor that leads to the good
performance of the proposed algorithm in non-stationary noise
environments. The use of the norm in the penalty term in
(21) is based on the assumption that has a Laplacian prior.
From (20), can be obtained by taking the derivative of the (28)
right hand side of (20) and setting the result to 0. That is, let
It can be seen in (28) that the logmmse gain function can give
a good estimation of . However, we do not use it for the
(23)
initial guess since without a good a-priori SNR estimator, the
logmmse gain function can accidentally remove speech spectral
Then components of low SNR. This is particularly the case if the noise
is non-stationary. The TCS method gives a reasonably good es-
timation of without the need of a very good a-priori SNR es-
timator. It also works reasonably well for non-stationary noises.
It is thus used as the initial guess of and is afterwards refined
by the logmmse gain function using the a-priori SNR estimate
(24) obtained by the M-step of the proposed algorithm.
340 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014
(29)
APPENDIX
MAP ESTIMATION OF
Given , the MAP estimator of is,
(32)
(33)
(34)
(35)
(36)
(37)
(38)
Fig. 6. (Left) Segmental SNR improvement over the noisy speech;
(Right) PESQ improvement over the noisy speech achieved by using
MMSE-LSA [5](x), MMSE-Gamma [34] (), MMSE-LSA FP SPP [7] We can obtain the MAP estimate of by taking the deriva-
() MMSE-LSA TCS SPP [11] ( ) and the proposed Logmmse-L1-EM tive of the terms in the square bracket in (38) with respect to .
( ) for the case of white noise, pink noise, destroyer engine noise, F16 noise,
buccaneer noise and babble noise contamination. Then,
verified that the proposed algorithm improves over the com- (39)
peting speech enhancement methods from literature in almost
all testing conditions, such as different kinds of noise at dif- We now need the prior , i.e. the distribution of cepstral
ferent noise levels using different evaluation measures. They coefficients, , of the clean speech. In Fig. 7, we show the pdf
have clearly demonstrated the robustness of the proposed algo- obtained from the cepstral coefficients of 40 male and
rithms in general speech enhancement applications. For future 40 female test speeches from the TIMIT database [25], and fit
work, the methods for enhancing the unvoiced speech deserve it with different distributions. It is seen that it can be modeled
344 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014
TABLE II
COMPOSITE MEASUREMENT COMPARISON OF DIFFERENT ALGORITHMS.
by a Laplacian, or Gaussian or GGD distribution without large Put (42) into (39), we have,
error. It is highly likely that this will also be the case for .
Assume that is modeled using a Laplacian pdf: (43)
(41)
This is the soft threshold nonlinearity. in this case is given
As a result, by,
(42) (45)
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 345
Daniel P. K. Lun received his B.Sc.(Hons.) degree K. C. Ho (M91SM00F09) was born in Hong
from the University of Essex, U.K., and Ph.D degree Kong. He received the B.Sc. degree with First Class
from the Hong Kong Polytechnic University (for- Honors in 1988 and the Ph.D. degree in 1991, both
merly called Hong Kong Polytechnic) in 1988 and from the Chinese University of Hong Kong, Hong
1991, respectively. He is now an Associate Professor Kong.
and Acting Head of the Department of Electronic He was a research associate in the Royal Military
and Information Engineering of the Hong Kong College of Canada from 1991 to 1994. He joined the
Polytechnic University. Dr Lun is active in research Bell-Northern Research, Montreal, Canada in 1995
activities. He has published more than 120 interna- as a member of scientific staff. He was a faculty
tional journals and conference papers. His research member in the Department of Electrical Engineering
interest includes speech and image enhancement, at the University of Saskatchewan, Saskatoon,
wavelet theories and applications, as well as multimedia technologies. He Canada from September 1996 to August 1997. Since September 1997, he
was the Chairman of the IEEE Hong Kong Chapter of Signal Processing in has been with the University of Missouri and is currently a Professor in the
1999-00. He was the Finance Chair of 2003 IEEE International Conference on Electrical and Computer Engineering Department. His research interests are in
Acoustics, Speech and Signal Processing, the General Chair of 2004 Interna- sensor array processing, speech processing, source localization, detection and
tional Symposium on Intelligent Multimedia, Video and Speech Processing estimation, and wireless communications.
(ISIMP2004), and the Finance Chair of 2010 International Conference on Dr. Ho is the Chair of the Sensor Array and Multichannel Technical Com-
Image Processing (ICIP2010). Dr Lun is a member of the DSP and Visual mittee in the IEEE Signal Processing Society. He was the Associate Rapporteur
Signal Processing and Communications Technical Committee of IEEE Circuits of the ITU-T Q16/SG16: Speech Enhancement Functions in Signal Processing
and Systems Society. He is a Chartered Engineer, a corporate member of IET, Network Equipment in 2013 and was the Rapporteur of the ITU-T Q15/SG16:
HKIE and a senior member of IEEE. Voice Gateway Signal Processing Functions and Circuit Multiplication Equip-
ment/Systems from 2009 to 2012. He has served as the Editor of the ITU-T Stan-
dard Recommendations G.160: Voice Enhancement Devices and G.168: Digital
Network Echo Cancellers for over seven years. He was an Associate Editor of
Tak-Wai Shen received the B. Eng (Hons.) degree the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2003 to 2006 and from
in Electronic Engineering from Oxford Brookes 2009 to 2013, and the IEEE Signal Processing Letters from 2004 to 2008.
University, England, and the M.Phil. degree in Dr. Ho received the Junior Faculty Research Award in 2003 and the Senior
Electronic Engineering from the Hong Kong Poly- Faculty Research Award in 2009 from the College of Engineering at the Uni-
technic University, Hong Kong, in 1992 and 1998, versity of Missouri. He is an inventor of 20 patents in the United States, Europe,
respectively. He is currently working towards the Asia and Canada on geolocation and signal processing for wireless communi-
Ph.D. degree at the Department of Electronic and In- cations.
formation Engineering, the Hong Kong Polytechnic
University. His research interests include digital
signal and image processing, fast algorithms and
speech enhancement. He is a member of the IEEE,
USA and IET, UK.