0% found this document useful (0 votes)
40 views12 pages

A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments

The documents is

Uploaded by

Mahesh Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments

The documents is

Uploaded by

Mahesh Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO.

2, FEBRUARY 2014 335

A Novel Expectation-Maximization Framework


for Speech Enhancement in Non-Stationary
Noise Environments
Daniel P. K. Lun, Senior Member, IEEE, Tak-Wai Shen, Student Member, IEEE, and K. C. Ho, Fellow, IEEE

AbstractVoiced speeches have a quasi-periodic nature that the most popularly used methods due to its simplicity and effi-
allows them to be compactly represented in the cepstral domain. ciency. In these methods, a noisy speech signal is divided into
It is a distinctive feature compared with noises. Recently, the
temporal cepstrum smoothing (TCS) algorithm was proposed and
overlapped frames and the short-time Fourier transform (STFT)
was shown to be effective for speech enhancement in non-sta- is applied to each frame to obtain its frequency spectrum. A gain
tionary noise environments. However, the missing of an automatic function is then applied to suppress frequency components se-
parameter updating mechanism limits its adaptability to noisy lected based on some criteria, such as to minimize the mean
speeches with abrupt changes in SNR across time frames or
frequency components. In this paper, an improved speech en- square error (MSE) from the clean speech [3]. Many gain func-
hancement algorithm based on a novel expectation-maximization tions were thus proposed, including the Wiener filter [4] and
(EM) framework is proposed. The new algorithm starts with the log-minimum mean square error (logmmse) [5] gain func-
the traditional TCS method which gives the initial guess of the tions. Both of them aim at suppressing the spectral components
periodogram of the clean speech. It is then applied to an norm
regularizer in the M-step of the EM framework to estimate the of low signal-to-noise ratio (SNR) such that the average SNR
true power spectrum of the original speech. It in turn enables the of the speech can be increased. To improve the efficiency in
estimation of the a-priori SNR and is used in the E-step, which suppressing the unwanted noisy speech frequency components,
is indeed a logmmse gain function, to refine the estimation of the the speech presence probability (SPP) function is also applied
clean speech periodogram. The M-step and E-step iterate alter-
nately until converged. A notable improvement of the proposed [6][8] such that the magnitude of a speech frequency com-
algorithm over the traditional TCS method is its adaptability to ponent will be further attenuated if the probability of speech
the changes (even abrupt changes) in SNR of the noisy speech. presence is low. Indeed the efficiency of these methods relies
Performance of the proposed algorithm is evaluated using stan-
heavily on the accuracy of the estimated a-priori SNR, which is
dard measures based on a large set of speech and noise signals.
Evaluation results show that a significant improvement is achieved defined as the ratio between the true power spectra of speech and
compared to conventional approaches especially in non-stationary noise. While the estimation of the true speech power spectrum
noise environment where most conventional algorithms fail to is known to be difficult, the estimation of the true noise power
perform.
spectrum is not easy either. This is particularly the case when the
Index TermsCepstral analysis, expectation-maximization, contaminating noise signal is non-stationary. It ends up with the
speech enhancement. musical noises [1] introduced to the resulting enhanced speech,
which is extremely annoying to users. Many approaches, such as
the decision-direct method [3], efficient SPP estimators [6][8]
I. INTRODUCTION
etc., were suggested to reduce the musical noise. Their perfor-
mance however can deteriorate significantly when the SNR is

S PEECH enhancement is an important research topic due low or when the noise is non-stationary. They may also remove
to its widespread applications in speech recognition, voice the original speech contents, hence reducing the intelligibility
communications, and information forensics [1]. Among the var- of the resulting speech.
ious solutions available, spectral subtraction [2] is still one of Recently, the temporal cepstrum smoothing (TCS) technique
was proposed for speech enhancement [9][10]. Since voiced
speeches are quasi-periodic in nature, their magnitude spectrum
Manuscript received August 15, 2013; revised November 01, 2013; ac-
cepted November 01, 2013. Date of publication November 11, 2013; date of exhibits peaks and valleys separated by harmonics of the fun-
current version December 31, 2013. This work was supported by the Hong damental frequency which can be compactly represented in the
Kong University Grant Council under Grant B-Q19F. The associate editor cepstral domain. Since most noises do not have such harmonic
coordinating the review of this manuscript and approving it for publication was
Prof. Thomas Fang Zheng. structure, it allows us to selectively reduce the variance of the
D. P. K. Lun and T. W. Shen are with the Centre for Signal Processing, cepstral coefficients which are likely contributed by noise. In
Department of Electronic and Information Engineering, The Hong Kong Poly- general, the TCS method can improve the accuracy in estimating
technic University, Hong Kong, China (e-mail: [email protected]; takwai.
[email protected]).
the a-priori SNR of a noisy speech comparing with the tradi-
K. C. Ho is with the Department of Electrical and Computer Engineering, tional spectral subtraction methods. It was also reported that the
University of Missouri, Columbia, MO 65211 USA (e-mail: hod@missouri. method works well in some non-stationary noise environments
edu).
Color versions of one or more of the figures in this paper are available online
[9]. Nevertheless, the TCS method requires a set of empirically
at http://ieeexplore.ieee.org. selected parameters to control the cepstrum smoothing process.
Digital Object Identifier 10.1109/TASLP.2013.2290497 As there is not a mechanism to automatically adjust the parame-

2329-9290 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
336 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

ters, the TCS method cannot adapt itself to the changes in SNR often a challenge when they are working in open environments.
of the noisy speech signal across time frames or frequency com- There are many other applications of the EM algorithm and
ponents. The problem is particularly obvious if there are some interested readers are referred to [23][24] for more details.
parts of a noisy speech spectrum having significantly low SNR Similar to the abovementioned approaches, the proposed al-
(e.g. the noise is composed by a few strong tones of varying fre- gorithm makes use of the EM algorithm to define a theoretical
quencies). The TCS method cannot fully remove the related cep- framework for the design of an iterative speech enhancement
stral coefficients with the speech content intact. To deal with the process. However, it is non-parametric, hence it does not re-
problem, it was suggested to further apply a SPP estimator with quire specific prior knowledge about the speech or noise model.
the TCS technique to remove the outliers in the enhanced speech In the proposed algorithm, the parameters to be estimated are
[11]. The result however is still not very satisfactory since the the cepstral coefficients of the true speech power spectrum, of
accuracy of the SPP estimators will also deteriorate when the which their accurate estimation is important in speech enhance-
SNR is low or when the noise is non-stationary as mentioned ment. It enables the computation of the a-priori SNR of the
above. noisy speech, which is one of the most essential parameters re-
In this paper, we present an improved speech enhance- quired in different speech enhancement gain functions as men-
ment algorithm based on a novel expectation-maximization tioned above. The proposed algorithm first makes use of the
(EM) framework working in the cepstral domain. The EM TCS technique to generate an initial guess of the clean speech
algorithm was discovered and employed independently by periodogram, which is the complete data set of our problem. It
several different researchers until Dempster [12] brought is applied to an norm regularizer [30] in the M-step of our
their ideas together and coined the term EM algorithm. It EM framework to give the first estimate of the required cepstral
is particularly suitable to the parameter estimation problems coefficients of the true speech power spectrum. They are then
in which the data for evaluating the parameters are missing used to compute the a-priori SNR that is needed for the log-
or incomplete. It is known to produce the maximum-likeli- mmse gain function to refine the estimation of the clean speech
hood (ML) parameter estimates when there is a many-to-one periodogram, which is the E-step of our EM framework. Sub-
mapping from an underlying distribution to the distribution sequently, the estimate is fed back to the M-step to refine the
governing the observation. The algorithm contains two main estimation of the cepstral coefficients of the true speech power
steps. The E-step (expectation) gives an expectation of the spectrum. The E-step and M-step iterate alternately until con-
unknown underlying distribution based on the observed data vergence is reached. The operation is illustrated in Fig. 1. A
and the M-step (maximization) estimates the parameters by notable improvement of the proposed algorithm over the tradi-
maximizing the expectation. The E-step and M-step then iterate tional non-parametric speech enhancement methods is that, due
alternately until converged. The EM algorithm has widespread to the iterative process, the proposed algorithm can adapt to the
applications in digital image and speech processing [13][22]. changes (even abrupt changes) in SNR of the noisy speech. In
One of the widely cited applications of the EM algorithm is addition, the proposed algorithm fully utilizes the sparsity of
the estimation of the hidden Markov models (HMMs), which speeches in the cepstral domain by adopting an norm regu-
is particularly relevant to speech processing [17][20]. In [19], larizer in the M-step. It enables the regularization process to be
a HMM-based gain modeling algorithm was proposed for the carried out on coefficients with improved SNR hence reduces
enhancement of speech in noise. It applies the EM algorithm the effect due to the error in estimating the non-stationary noise
for offline training and the recursive EM algorithm for online statistical characteristics. As a result, the proposed algorithm
estimation of the HMM parameters. The EM algorithm is also has outstanding performance when working in non-stationary
used in an approximate Bayesian based speech enhancement noise environments. Extensive performance evaluations have
algorithm [20] for learning the speech and noise spectra under been conducted using the speech samples from the TIMIT data-
the Gaussian approximation. Similar to the conventional model base [25] contaminated by many different noise signals. Signif-
based speech enhancement methods, these approaches require icant improvement is noted in almost all cases over the com-
prior knowledge about the noise model, or it has to be de- peting speech enhancement methods measured using standard
tected online. Degraded performance will be resulted if there performance metrics.
is error in detection or the detected noise model is not in the This paper is organized as follows. In Section II, a brief re-
training database. Besides HMM models estimation, the EM view of the traditional spectral subtraction and temporal cep-
algorithm is also used in the estimation of autoregressive (AR) strum smoothing algorithm is given. It is followed by a brief
model for speech enhancement [21], where the E-step is in introduction of the EM algorithm in Section III. The new EM
fact the Kalman filter and the M-step is similar to the standard framework for speech enhancement in non-stationary environ-
Yule-Walker solution for estimating the coefficients of AR ments is described in Section IV. The simulation results are
processes. It is noted that the performance of the method is shown in Section V, and conclusions are drawn in Section VI.
rather unstable (particularly at input SNR from 4 dB to 10 dB).
Effort was made [22] to improve the problem by using the II. SPECTRAL SUBTRACTION AND CEPSTRUM SMOOTHING
Rao-Blackwellized particle filter in the E-step to replace the Let us begin with a speech signal that is contaminated by
Kalman filter. However, the overall performance, particularly additive noise such that the observed noisy speech is
at low SNR, still has much room to improve. In general, the . In the frequency domain, the noisy speech over a short-time
performance of model based speech enhancement methods frame can be expressed as: , where
depends heavily on the accuracy in model estimation, which is is the frame index, , and represent the
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 337

is defined as . In practice, the power spectrum of


noise is estimated by averaging the peri-
odograms of all the noise frames detected using a voice activity
detector (VAD) [26]. We denote the estimation of as . Ob-
viously, the estimation error can be high if is not stationary.
The estimation of is even more difficult
since is not known. Hence cannot be exactly evaluated.
Different approaches were suggested to estimate . The max-
imum likelihood (ML) estimate of the a-priori SNR
can be obtained as follows [1]:

where (3)

The variance of however is often too large for use in the


traditional gain functions. The inaccuracy of is further
amplified due to the estimation error of . To reduce the vari-
ance in estimation, the decision-direct approach is often used in
practice where the a-priori SNR is estimated based on a pre-
vious clean-speech estimate as follows [3]:

(4)
where the parameters and control the trade-off between
the amount of noise reduction and the distortion of speech tran-
sients in a speech enhancement framework. Although much ef-
fort has been devoted to resolve the difficulties, in general the
performance of current spectral subtraction algorithms will still
degrade significantly when the SNR is low or if the noise is
non-stationary.
It is shown in [9] that TCS method can give a good estima-
tion of the a-priori SNR for some non-stationary noise environ-
Fig. 1. The operation of the proposed speech enhancement algorithm based on
the new EM framework.
ments. The algorithm can be implemented by first computing
the ML estimation of the a-priori clean speech power spectrum
as follows:
-th spectral component from the Fourier transforms of , ,
and over the time frame, respectively. We shall denote the (5)
power spectra of the noisy speech, the underlying clean speech
and the input additive noise as , and . where is the minimum value allowed for and is
The spectral subtraction method consists of the application of the maximum-likelihood estimate of the a-priori SNR given by
a spectral gain function to each noisy short-time frame (3). Next, the cepstral representation of is computed
to enhance the speech signal. Two most popular gain as,
functions are the Wiener filter and the logmmse gain functions.
Their most commonly used basic forms are [1]: (6)

(1) where is the cepstral index, also known as the quefrency index
[27]; and idft is the inverse discrete Fourier transform. Next, the
selected cepstral coefficients are recursively smoothed over time
with a quefrency dependent parameter as follows:

where (2) (7)


The parameter is also smoothed recursively using:
The frame index is dropped for the ease in presentation. It is
seen in (1) and (2) that the determination of both gain functions
requires the evaluation of two parameters: (i) a-posteriori SNR (8)
which is defined as , where is also where refers to the set of cepstral bin indices associated
referred as the periodogram of ; (ii) a-priori SNR which with the fundamental frequency. All parameters including
338 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

, and for different need to be determined set in the our EM framework. can be computed from the
empirically. It is believed that these parameters should be periodogram of the original speech , i.e. where
adaptively adjusted in order to achieve optimal performance for , as follows:
noisy speeches of different SNR values. It is however difficult
to derive an efficient algorithm for such purpose due to the
empirical nature of these parameters.
(13)
III. THE EXPECTATION MAXIMIZATION ALGORITHM where
The basic idea behind the EM algorithm can be described as
follows. Assume that is the parameter set we would like to es- (14)
timate and the probability density function (pdf) of some
data set given is known, where is referred as the complete and is the Eulers constant. It is shown in [28][29]
data set in the context of the EM algorithm. Let us also assume that under some regularity conditions and for large sample size
the pdf is a continuous and appropriately differentiable func- ( ) real-valued data, the estimated cepstral coefficients
tion of . If is known, can be readily evaluated by maxi- are even symmetric and independent random variables
mizing : having normal distributions with means and variances
as follows:
(9)
(15)
Unfortunately in many practical applications, some or all ele-
ments of cannot be obtained directly from the experiments but where
only by means of another observed data set . Besides, there
can be a many-to-one mapping between and . So for the (16)
E-step of the EM algorithm, we first compute the expectation
of given the data set and our current estimate of . In the remaining of this section we shall drop the index , where
That is, appropriate, for simplifying the equations. The dependency of
on the relevant quantities should be apparent.
(10) It can be seen in (15) that the required parameter in fact
is the mean of the complete data set , which has a normal
where is the -th estimation of and is the iteration index.
distribution. However, is unknown since we do not have
Then for the M-step of the EM algorithm, we find the value of
the original speech data. Hence we cannot directly compute
which maximizes as:
from . We have to rely on the observed noisy speech peri-
(11) odogram to help us in estimating . The expecta-
tion of given the data set and the current esti-
where is our refined estimation of . The E-step and mate of can be expressed as follows:
M-step then iterate alternately and will converge to give the
(17)
ML estimation of as proven in [12].

IV. THE NEW EM FRAMEWORK FOR SPEECH ENHANCEMENT From (15) and (16), we know that,
IN NON-STATIONARY NOISE ENVIRONMENTS
When applying the EM algorithm to speech enhancement, we (18)
consider the cepstral coefficients of the true clean speech power
spectrum to be the parameters for estimation. It is defined
as, Hence

(12)
where is the total number of frequency components and the
frame index is dropped for notation simplicity. As it is ex-
plained in Section II, is an important parameter to be es-
timated in speech enhancement applications. The objective of (19)
the proposed algorithm is to obtain a good estimation of so
as to compute based on (12). The a-priori SNR of the We shall apply to the M-step of the proposed algo-
noisy speech can then be obtained and used in the traditional rithm. The purpose of the M-step is to optimize in order
speech enhancement gain functions such as (1) or (2) for the es- to maximize . From the recent research in iterative
timation of the unknown clean speech periodogram. regularization [30], it is known that if the signal is sparse or if
Next, we select the cepstral coefficients of the original clean the signal can be transformed into a domain where its coeffi-
speech periodogram, denoted as , to be the complete data cients are sparse, the inclusion of a penalty term made up by the
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 339

norm of the signal or its coefficients can significantly im- where returns if is positive, 0 or negative,
prove the chance for the iterative process to reach its global op- respectively. Hence the optimal is the one such that
timum point. Consequently, there are works adopting this idea . It is then used as the new estimate of . That is,
in the M-step of the EM algorithm. For instance, the regularized
M-step is used in [31] and [32] for image restoration and image
reconstruction applications. For the proposed method, the EM
algorithm is operating in the cepstral domain. Since the coeffi-
cients of speech in the cepstral domain are sparse and are very
much different from noise, we include a penalty term made up (25)
by the norm of the desired cepstral coefficients in the opti-
mization process. More specifically, the M-step of the proposed
algorithm is given as follows: where and . (25) is indeed
the well-known soft thresholding non-linearity [30][33] with an
additional constraint .
To implement (25), we need to have a good estimate of .
For the proposed algorithm, we estimate by the following,

For initial guess


(20) For subsequent iterations

where is the penalty term, which is selected as follows: (26)

(21)
where , and dft and idft are the discrete Fourier
transform and its inverse, respectively. In both cases, bias is re-
In (21), is a free parameter that adjusts the amount of reg-
moved using the approach in [29] whenever transforming data
ularization applied to the maximization process. Its selection
between the cepstral domain and the spectral domain. As shown
method will be discussed at the end of this section. is de-
in (26), we adopt the TCS method [9] to obtain the initial guess
pendent on and it is defined as a weakly differentiable binary
of at . Afterwards, we update by using a logmmse
function such that,
gain function [5] in which the a-priori SNR is computed based
on the current estimate . The logmmse gain function theoret-
(22) ically gives the minimum mean square error estimation of the
log-magnitude spectra. From [5], we know that,
The introduction of the penalty term imposes a constraint to the
optimization process such that the energy of the estimated cep- (27)
stral coefficients will concentrate at very low quefrencies as well
as the quefrencies associated with the fundamental frequency.
Hence,
They are exactly the features of voiced speeches in the cepstral
domain. By doing so, the optimization process can be carried out
on coefficients with improved SNR and hence reduces the effect
due to the estimation error of the non-stationary noise charac-
teristics. It turns out to be the major factor that leads to the good
performance of the proposed algorithm in non-stationary noise
environments. The use of the norm in the penalty term in
(21) is based on the assumption that has a Laplacian prior.
From (20), can be obtained by taking the derivative of the (28)
right hand side of (20) and setting the result to 0. That is, let
It can be seen in (28) that the logmmse gain function can give
a good estimation of . However, we do not use it for the
(23)
initial guess since without a good a-priori SNR estimator, the
logmmse gain function can accidentally remove speech spectral
Then components of low SNR. This is particularly the case if the noise
is non-stationary. The TCS method gives a reasonably good es-
timation of without the need of a very good a-priori SNR es-
timator. It also works reasonably well for non-stationary noises.
It is thus used as the initial guess of and is afterwards refined
by the logmmse gain function using the a-priori SNR estimate
(24) obtained by the M-step of the proposed algorithm.
340 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

To summarize, the proposed speech enhancement algorithm


based on the new EM framework can be described as follows:
A. Initial guess:
Compute the initial guess of using the TCS method,
i.e. (see (26)).
B. M-step:
Estimate the parameter using the constrained soft
thresholding method (see (25)).
C. E-step:
Refine the estimation of using the logmmse gain func-
tion with the estimated obtained from the M-step (see
(26)).
D. Iterate the M-step and E-step alternately until conver-
gence is reached. The enhanced speech can thus be ob-
tained by
Fig. 2. A comparison of the traditional TCS method and the proposed Log-
mmse-L1-EM algorithm.

(29)

where is the phase angle of . The operation of the


algorithm is also described in Fig. 1.
Besides those required in the original TCS method [9], the
proposed algorithm has very few free parameters. Once these
parameters are set, they can be used for speeches of different
genders and at different noise levels, as it is the case in our sim-
ulations. More specifically, the setting of the free parameters in
the TCS method can be found in [9]. The parameters for deter-
mining the soft threshold in (25) can be obtained as follows:
(i) the value of can be found in (16); (ii) the setting of param-
eter requires the parameter in (22), which is set as 0.025 M
in our simulations. in (22) can be obtained from the TCS
method given by [9] when generating the initial guess. Finally,
Fig. 3. The result of the proposed Logmmse-L1-EM algorithm after each
the parameter is set as iteration.

(30) (logmmse gain function and norm regularization) used in


the new EM framework. Fig. 2 shows a segment of a typical
where is the variance of and is approximated by, noisy speech periodogram (red line), its original clean speech
periodogram (black line), the enhanced speech periodogram
(31) using the traditional TCS (green line) and the proposed Log-
mmse-L1-EM algorithm (yellow line). It can be seen that the
We show in the Appendix that by setting as in (30), the soft
noisy periodogram consists of a sharp noise spectral peak at
thresholding operation in (25) indeed achieves a good approx-
frequency index about 170. The TCS method tries to remove
imation of the maximum a-posterior (MAP) estimation of . the noise spectral peak by reducing the variance of the cepstral
We would like to emphasize that due to the initial guess and coefficients. It however is only partially successful and leaves
the additional constraints applied to the algorithm, the proposed behind a rather strong noise spectral peak. In addition, the TCS
algorithm need not be iterated many times to achieve satisfac- method over-smoothes the speech spectral peak at indices near
tory performance. In our experiments, iterating 3 times already 100 and 120. It is clear that a fixed set of smoothing parameters
gives very good results. Iterating further in most cases will only is difficult to handle noisy speech spectral components equally
increase the computation time but not further improve the per- well if they have great difference in SNR. On the contrary, the
formance. Hence it is not recommended. proposed Logmmse-L1-EM algorithm gives extremely good
performance in removing the noise spectral peak while keeping
V. SIMULATIONS AND RESULTS the speech spectral peaks. It is due to the iterative process
In this section, the performance of the proposed algorithm is through EM by which the noise spectral peak is reduced suc-
shown and compared with those achieved by the state-of-the-art cessively in each iteration while the speech spectral peaks are
speech enhancement methods. To start with, we use an example also gradually adjusted. Fig. 3 illustrates how the proposed
to illustrate the deficiency of the traditional TCS method and algorithm refines the estimation in each iteration.
how it is solved by the proposed algorithm. For the ease in The results in Fig. 2 and Fig. 3 clearly explain why the pro-
presentation, let us first denote the proposed algorithm as posed Logmmse-L1-EM algorithm can provide a better perfor-
Logmmse-L1-EM, which represents the two major operations mance than the original TCS method and the traditional log-
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 341

mmse method. As mentioned above, the TCS method cannot


take care of noisy components with great difference in SNR.
However, it serves as a good initial guess of the clean speech
power spectrum since it is less sensitive to the accuracy of the
a-priori SNR estimation. The initial guess is then applied to the
norm regularizer in the M-step of the proposed EM frame-
work, which is indeed a constraint soft thresholding process.
Besides a good MAP estimation of the true power spectrum of
the clean speech, the applied constraint fully utilizes the spar-
sity feature of speech signals in the cepstral domain. Noises, no
matter stationary or non-stationary, will be rejected since most
of them cannot fulfill this constraint. For every iteration the pro-
posed algorithm performs, the same constraint is imposed to
the observed noisy data and gradually improves the estimation,
which is shown in Fig. 3. The M-step of the proposed EM al-
gorithm enables a gradually improved a-priori SNR estimate
for use in the E-step of the proposed algorithm, which is just
the traditional logmmse method. With a good a-priori SNR, the
logmmse method can often give a good estimate to the original
clean speech periodogram for the M-step again.
A comparison of the spectrogram of the enhanced speeches
generated using different algorithms with different colored
noises at input is shown in Fig. 4 to Fig. 5. Table I gives a
summary of the algorithms that have been compared. We start
with the case that the speech is contaminated by pink noise,
which is a relatively stationary noise. Fig. 4a shows the clean
speech spectrogram of a female speech selected from the
TIMIT database [25] saying the following sentence: She had
your dark suit in greasy wash water all year. Fig. 4b shows
the result when pink noise is added to the speech with input Fig. 4. Spectrogram of the original, noisy and enhanced speeches (pink
segSNR about 5 dB. Fig. 4c depicts the spectrogram using the noise) (a) Original speech (b) Noisy speech (Noise: Pink, dB segSNR)
traditional TCS method. It can be seen that although the TCS (c) Traditional TCS method [9] (d) MMSE-LSA plus SPP with fixed prior [7]
method can recover much speech contents, its noise control (e) MMSE-LSA plus SPP and TCS [11] (f) The proposed logmmse-L1-EM.
is not sufficient and strong background residue noise remains
in the enhanced speech. Fig. 4d shows the spectrogram using error when estimating the a-priori SNR. Hence the performance
the MMSE log-spectral amplitude (MMSE-LSA) method plus of the traditional MMSE-LSA approach, although using SPP,
SPP with fixed prior [7], which is a relatively recent logmmse will not be good. As can be seen in Fig. 5d, the strong tones
estimator enhanced by using a special speech presence prob- cannot be removed and will be quite annoying in human audi-
ability function. It has better control of the background noise tory. The TCS method improves slightly as shown in Fig. 5c.
however it also removes speech content and the intelligibility The combined TCS and MMSE-LSA approach gives better
of the enhanced speech is reduced. Furthermore, musical noise result as can be seen in Fig. 5e. However in both cases, the
appears particularly at the beginning of the speech. Fig. 4e tones still cannot be sufficiently removed while the background
shows the spectrogram given by a combination of the TCS noises remain. The result of the proposed Logmmse-L1-EM
method and the MMSE-LSA plus SPP [11]. In that approach, algorithm is illustrated in Fig. 5f. It is clear that the time varying
the TCS is used for the estimation of the a-priori SNR and is tones are largely suppressed while the speech contents are well
used in the MMSE-LSA and the SPP estimation. It has better preserved comparable with that using the TCS method. The
noise control compared with that in Fig. 4c and Fig. 4d. How- background noise is also significantly reduced. The overall
ever, some speech contents are removed as it is indicated in the performance is very promising.
circled areas. It seems that the speech contents removed by the The performance of the proposed Logmmse-L1-EM algo-
SPP estimator cannot be recovered although the TCS method rithm is further evaluated using standard evaluation measures.
is used. Fig. 4f shows the spectrogram using the proposed A series of simulations have been performed for comparing the
Logmmse-L1-EM algorithm. It has very well background performance between the following approaches: MMSE-LSA
noise control and the speech content is also better preserved as [5], MMSE-Gamma [34], MMSE-LSA plus SPP with fixed
indicated in the circled areas. prior [7], MMSE-LSA plus TCS method [11], and the proposed
Fig. 5 shows the case where the speech is contaminated by Logmmse-L1-EM algorithm. The speech sampling rate is
buccaneer noise, which is a non-stationary noise such that two 16 kHz. Simulation details are listed as follows: frame size512
tones of varying frequencies together with other background samples ( ms), FFT size1024 samples (zeros padded each
noises are added to the speech. The noisy speech is shown in frame with 512 samples), window shift step size128 samples
Fig. 5b. When speeches are contaminated by non-stationary (75% overlap). For all algorithms, the noise power spectrum
noises, traditional decision-direct approach will generate large is estimated by first using the initial frames that are assumed
342 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

using standard measures including (i) the segmental signal-to-


noise ratio (segSNR) [1]; and (ii) the perceptual evaluation of
speech quality (PESQ), which is an ITU standard for evaluating
speech quality [36]. The results are shown in Fig. 6. It can be
seen that the performance of the proposed algorithm is always
the best in all cases. For instance, when comparing with the
MMSE-LSA plus TCS approach [11], the proposed algorithm
can always give an improvement in segSNR and PESQ score
for all noise signals. More specifically, for the pink noise case,
the average improvement in segSNR and PESQ is about 0.9 dB
and 0.1, respectively. For the buccaneer noise case, the average
improvement in segSNR and PESQ is about 0.85 dB and 0.12,
respectively. Similar results can be found in Fig. 6 for other
kinds of noise contamination.
To predict the quality of noisy speech enhanced by noise sup-
pression algorithms, three composite objective metrics [1] are
often used in the literature, which include (a) : Signal dis-
tortion (SIG) formed by linearly combining the LLR, PESQ,
and WSS measures; (b) : Noise distortion (BAK) formed
by linearly combining the segSNR, PESQ, and WSS measures.
(c) : Overall quality (OVL) formed by linearly combining
the PESQ, LLR, and WSS measures. Table II lists the perfor-
mance of 5 different algorithms for noisy speech signals con-
taminated by 6 different kinds of noise, both stationary and
non-stationary, at input SNR ranging from 0 dB to 20 dB. Con-
curring to the results in segSNR and PESQ, the proposed Log-
mmse-L1-EM algorithm always outperforms the other 4; and
in many cases, the improvement is significant. These results
have demonstrated the robustness of the proposed algorithm. Its
performance is consistent when enhancing speeches made by
Fig. 5. Spectrogram of the original, noisy and enhanced speeches (buccaneer
noise) (a) Original speech (b) Noisy speech (Noise: buccaneer1, dB people of different genders and are contaminated by different
segSNR) (c) Traditional TCS method [9] (d) MMSE-LSA plus SPP with kinds of noise at different noise levels.
fixed prior [7] (e) MMSE-LSA plus SPP and TCS [11] (f) The proposed
logmmse-L1-EM.
VI. CONCLUSION AND FUTURE WORK
TABLE I In this paper, an improved speech enhancement algorithm
SUMMARY OF THE ALGORITHMS COMPARED IN THE SIMULATIONS
based on a novel expectation-maximization framework is pro-
posed. The new algorithm makes use of the EM algorithm to de-
fine a theoretical framework for the estimation of the true power
spectrum of the original speech and its periodogram from a
noisy observation. The proposed algorithm starts with the tradi-
tional cepstrum smoothing method which gives the initial guess
of the periodogram of the clean speech. It is applied to an
norm regularizer in the M-step of the EM framework to esti-
mate the cepstral coefficients of the true speech power spec-
trum. It enables the estimation of the a-priori SNR and is used
in the E-step, which is indeed a logmmse gain function, to re-
fine the estimate of the clean speech periodogram. The M-step
and E-step then iterate for 2 more times, with which we have
shown to be sufficient in most cases to achieve good result. The
to have no speech energy; then updated whenever a frame is
detected to have no speech energy by using a VAD [26]. For proposed algorithm fully utilizes the sparsity of speeches in the
the algorithms using the logmmse gain function, is set at cepstral domain by adopting the norm regularizer. It enables
25 dB which helps masking musical noise and limits speech the optimization process to be carried out on coefficients with
distortion. improved SNR and hence reduces the effect due to the estima-
In the simulation, 40 male and 40 female test speeches were tion error of the non-stationary noise characteristics. As a re-
arbitrarily selected from the TIMIT database [25]. The noise sult, the proposed algorithm works particularly well when the
signals were adopted from the NOISEX-92 database [35] and input speech is contaminated by non-stationary noises. Besides,
added to the speeches with input segmental signal-to-noise ratio due to the iterative process, the proposed algorithm has very
(segSNR) ranging from about dB to dB. The resulting good control of the residue background noises which makes
enhanced speeches generated by all algorithms were evaluated it outperform the traditional methods. Simulation results have
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 343

further consideration. It is known that on average over 20% of a


normal speech is unvoiced. However, both the cepstral method
and the MMSE method may not be best candidates for their
enhancement. It is because unvoiced speeches do not have a
harmonic structure and they have a noise-like statistical prop-
erty which can easily be classified as noises by MMSE based
methods and are thus suppressed. A different algorithm may be
needed to work together with the proposed algorithm in order
to take care of both the voiced and unvoiced parts of a speech
in an enhancement process.

APPENDIX
MAP ESTIMATION OF
Given , the MAP estimator of is,

(32)

We obtain by using the Bayes rule,

(33)

Since the value of that maximizes the right-hand side is


not influenced by the denominator, the MAP estimate of can
be rewritten as,

(34)

The logarithm function can be applied to (34) because it is


monotonic. Hence,

(35)

As is normal distributed with mean [29],

(36)

By using (36), (35) becomes,

(37)

Let us define . Then we have,

(38)
Fig. 6. (Left) Segmental SNR improvement over the noisy speech;
(Right) PESQ improvement over the noisy speech achieved by using
MMSE-LSA [5](x), MMSE-Gamma [34] (), MMSE-LSA FP SPP [7] We can obtain the MAP estimate of by taking the deriva-
() MMSE-LSA TCS SPP [11] ( ) and the proposed Logmmse-L1-EM tive of the terms in the square bracket in (38) with respect to .
( ) for the case of white noise, pink noise, destroyer engine noise, F16 noise,
buccaneer noise and babble noise contamination. Then,

verified that the proposed algorithm improves over the com- (39)
peting speech enhancement methods from literature in almost
all testing conditions, such as different kinds of noise at dif- We now need the prior , i.e. the distribution of cepstral
ferent noise levels using different evaluation measures. They coefficients, , of the clean speech. In Fig. 7, we show the pdf
have clearly demonstrated the robustness of the proposed algo- obtained from the cepstral coefficients of 40 male and
rithms in general speech enhancement applications. For future 40 female test speeches from the TIMIT database [25], and fit
work, the methods for enhancing the unvoiced speech deserve it with different distributions. It is seen that it can be modeled
344 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

TABLE II
COMPOSITE MEASUREMENT COMPARISON OF DIFFERENT ALGORITHMS.

by a Laplacian, or Gaussian or GGD distribution without large Put (42) into (39), we have,
error. It is highly likely that this will also be the case for .
Assume that is modeled using a Laplacian pdf: (43)

(40) Therefore, as a function of is given by

In this case, (44)

(41)
This is the soft threshold nonlinearity. in this case is given
As a result, by,

(42) (45)
LUN et al.: NOVEL EM FRAMEWORK FOR SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS 345

[13] L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for


emission tomography, IEEE Trans. Med. Imag., vol. 1, no. 2, pp.
113122, Oct. 1982.
[14] D. L. Snyder and D. G. Politte, Image reconstruction from list-mode
data in an emission tomography system having time-of-flight measure-
ments, IEEE Trans. Nucl. Sci., vol. NS-30, no. 3, pp. 18431849, Jun.
1983.
[15] J. Zhou, J. Coatrieux, A. Bouusse, H. Shu, and L. Luo, A Bayesian
MAP-EM algorithm for PET image reconstruction using wavelet trans-
form, IEEE Trans. Nucl. Sci., vol. 54, no. 5, pp. 16601669, Oct. 2007.
[16] J. Tang, T. S. Lee, X. He, W. P. Segars, and B. M. W. Tsui, Compar-
ison of 3D OS-EM and 4D MAP-RBI-EM reconstruction algorithms
for cardiac motion abnormality classification using a motion observer,
IEEE Trans. Nucl. Sci., vol. 57, no. 5, pp. 25712577, Oct. 2010.
[17] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A maximization tech-
nique occurring in the statistical analysis of probabilistic functions of
Markov chains, Ann. Math. Statist., vol. 41, no. 1, pp. 164171, Feb.
1970.
[18] L. R. Welch, Hidden Markov Models and the Baum-Welch Algo-
rithm, IEEE Inf. Theory Soc. Newslett., vol. 53, no. 4, pp. 113, 2003.
[19] D. Y. Zhao and W. B. Kleijn, HMM-based gain modeling for enhance-
ment of speech in noise, IEEE Trans. Audio, Speech, Lang. Process.,
Fig. 7. The pdf of cepstral coefficients of 40 male and 40 female test speeches vol. 15, no. 3, pp. 882892, Mar. 2007.
from the TIMIT database [25]. The pdf approximated directly from the speeches [20] J. Hao, H. Attias, S. Nagarajan, T.-W. Lee, and T. J. Sejnowski,
(red dotted line). The pdf fit by using a Laplacian model ( , ) Speech enhancement, gain, and noise spectrum adaptation using
(blue solid line), Gaussian model ( , ) (green dashed line), approximate Bayesian estimation, IEEE Trans. Audio, Speech, Lang.
and GGD model ( , ) (black dash-dot line). Process., vol. 17, no. 1, pp. 2437, Jan. 2009.
[21] S. Gannot, D. Burshtein, and E. Weinstein, Iterative and sequential
Kalman filter-based speech enhancement algorithms, IEEE Trans.
which is similar to that in (25) except the omission of the con- Speech, Audio Process., vol. 6, no. 4, pp. 373385, Jul. 1998.
straint . To summarize, the soft thresholding operation as de- [22] S. Park and S. Choi, A constrained sequential EM algorithm for
speech enhancement, Neural Netw., vol. 21, pp. 14011409, Nov.
fined in (25) is a good approximation of the MAP estimate of 2008.
with the assumption that has a Laplacian prior. [23] T. K. Moon, The expectation-maximization algorithm, IEEE Signal
Process. Mag., vol. 11, no. 6, pp. 4760, Nov. 1996.
[24] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions,
REFERENCES 2nd Ed. ed. New York, NY, USA: Wiley, 2008.
[1] P. C. Loizou, Speech enhancement: Theory and practice. Boca [25] J. S. Garofolo, Getting started with the DARPA TIMIT CD-ROM:
Raton, FL, USA: CRC, 2007. An acoustic phonetic continuous speech database, Nat. Inst. Standards
[2] S. F. Boll, Suppression of acoustic noise in speech using spectral sub- Technol. (NIST), Gaithersburg, MD, USA, prototype as of Dec. 1988.
traction, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, [26] A. Davis, S. Nordholm, and R. Togneri, Statistical voice activity
no. 2, pp. 113120, Apr. 1979. detection using low-variance spectrum estimation and an adaptive
[3] Y. Ephraim and D. Malah, Speech enhancement using a minimum threshold, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no.
mean square error short time spectral amplitude estimator, IEEE 2, pp. 412424, Mar. 2006.
Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. [27] B. P. Bogert, M. J. R. Healy, and J. W. Tukey, The quefrency alanysis
11091121, Dec. 1984. of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cep-
[4] J. S. Lim and A. V. Oppenheim, Enhancement and band-width com- strum and saphe cracking, in Proc. Time Series Anal., M. Rosenblatt,
pression of noisy speech, Proc. IEEE, vol. 67, no. 12, pp. 15861604, Ed., 1963, pp. 209243, Ch. 15.
Dec. 1979. [28] Y. Ephraim and W. Roberts, On second-order statistics of log-peri-
[5] Y. Ephraim and D. Malah, Speech enhancement using a minimum odogram with correlated components, IEEE Signal Process. Lett., vol.
mean-square error log-spectral amplitude estimator, IEEE Trans. 12, pp. 625628, Sep. 2005.
Acoust., Speech, Signal Process., vol. ASSP-33, no. 2, pp. 443445, [29] P. Stoica and N. Sandgren, Smoothed non parametric spectral estima-
Apr. 1985. tion via cepstrum thresholding, IEEE Signal Process. Mag., vol. 23,
[6] I. Cohen, Optimal speech enhancement under signal presence uncer- no. 6, pp. 3445, Nov. 2006.
tainty using log-spectral amplitude estimator, IEEE Signal Process. [30] B. Kaltenbacher, A. Neubauer, and O. Scherzer, Iterative regulariza-
Lett., vol. 9, no. 4, pp. 113116, Apr. 2002. tion methods for nonlinear ill-posed problems. Berlin, Germany:
[7] T. Gerkmann, C. Breithaupt, and R. Martin, Improved a posteriori Walter de Gruyter, 2008.
speech presence probability estimation based on a likelihood ratio with [31] M. Figueiredo, An EM algorithm for wavelet-based image restora-
fixed priors, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. tion, IEEE Trans. Image Process., vol. 12, no. 8, pp. 906916, Aug.
5, pp. 41654174, Jul. 2008. 2003.
[8] D. P. K. Lun, T. W. Shen, T. C. Hsung, and K. C. Ho, Wavelet based [32] N. Cao, A. Nehorai, and M. Jacob, Image reconstruction for diffuse
speech presence probability estimator for speech enhancement, Dig- optical tomography using sparsity regularization and expectation-max-
ital Signal Process., vol. 22, no. 6, pp. 11611173, Dec. 2012. imization algorithm, Opt. Express, vol. 15, no. 21, pp. 1369513708,
[9] C. Breithaupt, T. Gerkmann, and R. Martin, A novel a priori SNR Oct. 2007.
estimation approach based on selective cepstro-temporal smoothing, [33] D. L. Donoho, De-noising by soft-thresholding, IEEE Trans. Inf.
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Theory, vol. 41, no. 3, pp. 613627, May 1995.
Apr. 2008, pp. 48974900. [34] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, Minimum
[10] C. Breithaupt, T. Gerkmann, and R. Martin, Cepstral smoothing of mean-square error estimation of discrete Fourier coefficients with Gen-
spectral filter gains for speech enhancement without musical noise, eralized Gamma priors, IEEE Trans. Audio, Speech, Lang. Process.,
IEEE Signal Process. Lett., vol. 14, no. 12, pp. 10361039, Dec. 2007. vol. 15, no. 6, pp. 17411752, Aug. 2007.
[11] T. Gerkmann, M. Krawczyk, and R. Martin, Speech presence prob- [35] A. Varga and H. J. M. Steeneken, Assessment for automatic speech
ability estimation based on temporal cepstrum smoothing, in Proc., recognition II: NOISEX-92: A database and an experiment to study
IEEE Int. Conf. Acoust., Speech, Signal Process., Mar. 2010, pp. the effect of additive noise on speech recognition systems, Speech
42544257. Commun., vol. 12, no. 3, pp. 247251, Jul. 1993.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood [36] Perceptual evaluation of speech quality (PESQ), and objective method
from incomplete data via the EM algorithm, J. R. Stat. Soc., vol. 39, for end-to-end speech quality assessment of narrowband telephone net-
no. 1, pp. 138, 1977. works and speech codecs, ITU, ITU-T Rec. P. 862, 2000.
346 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014

Daniel P. K. Lun received his B.Sc.(Hons.) degree K. C. Ho (M91SM00F09) was born in Hong
from the University of Essex, U.K., and Ph.D degree Kong. He received the B.Sc. degree with First Class
from the Hong Kong Polytechnic University (for- Honors in 1988 and the Ph.D. degree in 1991, both
merly called Hong Kong Polytechnic) in 1988 and from the Chinese University of Hong Kong, Hong
1991, respectively. He is now an Associate Professor Kong.
and Acting Head of the Department of Electronic He was a research associate in the Royal Military
and Information Engineering of the Hong Kong College of Canada from 1991 to 1994. He joined the
Polytechnic University. Dr Lun is active in research Bell-Northern Research, Montreal, Canada in 1995
activities. He has published more than 120 interna- as a member of scientific staff. He was a faculty
tional journals and conference papers. His research member in the Department of Electrical Engineering
interest includes speech and image enhancement, at the University of Saskatchewan, Saskatoon,
wavelet theories and applications, as well as multimedia technologies. He Canada from September 1996 to August 1997. Since September 1997, he
was the Chairman of the IEEE Hong Kong Chapter of Signal Processing in has been with the University of Missouri and is currently a Professor in the
1999-00. He was the Finance Chair of 2003 IEEE International Conference on Electrical and Computer Engineering Department. His research interests are in
Acoustics, Speech and Signal Processing, the General Chair of 2004 Interna- sensor array processing, speech processing, source localization, detection and
tional Symposium on Intelligent Multimedia, Video and Speech Processing estimation, and wireless communications.
(ISIMP2004), and the Finance Chair of 2010 International Conference on Dr. Ho is the Chair of the Sensor Array and Multichannel Technical Com-
Image Processing (ICIP2010). Dr Lun is a member of the DSP and Visual mittee in the IEEE Signal Processing Society. He was the Associate Rapporteur
Signal Processing and Communications Technical Committee of IEEE Circuits of the ITU-T Q16/SG16: Speech Enhancement Functions in Signal Processing
and Systems Society. He is a Chartered Engineer, a corporate member of IET, Network Equipment in 2013 and was the Rapporteur of the ITU-T Q15/SG16:
HKIE and a senior member of IEEE. Voice Gateway Signal Processing Functions and Circuit Multiplication Equip-
ment/Systems from 2009 to 2012. He has served as the Editor of the ITU-T Stan-
dard Recommendations G.160: Voice Enhancement Devices and G.168: Digital
Network Echo Cancellers for over seven years. He was an Associate Editor of
Tak-Wai Shen received the B. Eng (Hons.) degree the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2003 to 2006 and from
in Electronic Engineering from Oxford Brookes 2009 to 2013, and the IEEE Signal Processing Letters from 2004 to 2008.
University, England, and the M.Phil. degree in Dr. Ho received the Junior Faculty Research Award in 2003 and the Senior
Electronic Engineering from the Hong Kong Poly- Faculty Research Award in 2009 from the College of Engineering at the Uni-
technic University, Hong Kong, in 1992 and 1998, versity of Missouri. He is an inventor of 20 patents in the United States, Europe,
respectively. He is currently working towards the Asia and Canada on geolocation and signal processing for wireless communi-
Ph.D. degree at the Department of Electronic and In- cations.
formation Engineering, the Hong Kong Polytechnic
University. His research interests include digital
signal and image processing, fast algorithms and
speech enhancement. He is a member of the IEEE,
USA and IET, UK.

You might also like