0% found this document useful (0 votes)
21 views9 pages

SpeechComm2017c

Uploaded by

用户ming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

SpeechComm2017c

Uploaded by

用户ming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Speech Communication 87 (2017) 49–57

Contents lists available at ScienceDirect

Speech Communication
journal homepage: www.elsevier.com/locate/specom

Speech dereverberation using weighted prediction error with


correlated inter-frame speech components
Mahdi Parchami a,∗, Wei-Ping Zhu a, Benoit Champagne b
a
Department of Electrical and Computer Engineering, Concordia University, Montreal, Quebec, Canada
b
Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we propose a new dereverberation approach based on the weighted prediction error (WPE)
Received 18 June 2016 method implemented in the short-time Fourier transform (STFT) domain. Our main contribution is to
Revised 11 November 2016
model the temporal correlation of the STFT coefficients across analysis frames, referred to as inter-frame
Accepted 5 January 2017
correlation (IFC), and exploit it in the dereverberation process. Since accurate modeling of the IFC is
Available online 6 January 2017
not tractable, we consider an approximate model wherein only a finite number of consecutive speech
Keywords: frames are considered correlated. It is shown that, given an estimate of the IFC matrix, the proposed
Inter-frame correlation approach results in a convex quadratic optimization problem with respect to the reverberation prediction
Multi-channel linear prediction (MCLP) weights, and a closed-form solution can be accordingly derived. Furthermore, an efficient method for
Speech dereverberation the estimation of the underlying IFC matrix is developed based on the extension of a recently proposed
Speech enhancement speech variance estimator. We evaluate the performance of our approach incorporating the estimated IFC
matrix and compare it to the original and several variants of the WPE method. The results reveal lower
residual reverberation and higher overall quality of the enhanced speech when the proposed method is
employed.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction signals in these important applications, highly efficient reverbera-


tion suppression algorithms are required.
Speech signals captured by distant microphones in an acous- Over the past three decades, various single- and multi-
tic environment are subject to reflections from the surrounding microphone dereverberation techniques have been developed,
surfaces and objects, such as walls or furniture. The consequence which can be broadly classified into: blind system identifica-
of this phenomenon, known as reverberation, can deteriorate the tion and inversion, multi-channel spatial processing, spectral
perceived quality/intelligibility of the captured speech and also enhancement and probabilistic model-based approaches (Naylor
can degrade to a large extent the performance of speech commu- and Gaubitch, 2010; Habets, 2007). In addition, a few alternative
nication systems such as hearing aids, hands-free teleconferenc- dereverberation approaches emerged recently, such as those based
ing, source separation, passive source localization, and automatic on the use of expectation-maximization (EM) and Kalman filtering
speech recognition systems (Naylor and Gaubitch, 2010; Yoshioka algorithms, as reported in Schmid et al. (2012) and Togami and
et al., 2012). Therefore, to improve the overall quality of speech Kawaguchi (2013). Among all these techniques, the model-based
statistical approaches, which seek to optimally estimate the
anechoic speech, have attracted considerable interest as further
Abbreviations: ATF, acoustic transfer function; AR, auto-regressive; CD, cepstrum discussed below.
distance; CGG, complex generalized Gaussian; DRR, direct to reverberant ratio; EM,
In Attias et al. (2001), probabilistic models of speech were
expectation-maximization; FW-SNR, frequency-weighted segmental SNR; IFC, inter-
frame correlation; ISM, image source method; LRSV, late reverberation spectral vari- incorporated into a variational Bayesian EM algorithm which esti-
ance; LPC, linear prediction coefficients; ML, maximum likelihood; MMSE, minimum mates the source signal, the acoustic channel and all the involved
mean-squared error; MCLP, multi-channel linear prediction; PESQ, perceptual evalu- parameters in an iterative manner. A different strategy was fol-
ation of speech quality; RIR, room impulse response; STFT, short-time Fourier trans- lowed in Yoshioka et al. (2009), where the parameters of all-pole
form; SRMR, signal to reverberation modulation energy ratio; SNR, signal to noise
models for both the desired speech signal and the reverberation
ratio; WPE, weighted prediction error.

Corresponding author. component are iteratively determined by maximizing the likeli-
E-mail addresses: [email protected] (M. Parchami), weiping@ece. hood function of the model parameters through an EM approach.
concordia.ca (W.-P. Zhu), [email protected] (B. Champagne). In this way, a minimum mean-squared error (MMSE) estimator is

http://dx.doi.org/10.1016/j.specom.2017.01.004
0167-6393/© 2017 Elsevier B.V. All rights reserved.
50 M. Parchami et al. / Speech Communication 87 (2017) 49–57

derived that yields the enhanced speech. In a similar way, by using prediction weights in the WPE method with IFC is developed and
a time-varying statistical model for the speech and a multi-channel a novel technique for the estimation of the IFC matrix is presented.
linear prediction (MCLP) model for the reverberation, an efficient The objective performance evaluation of the proposed approach
dereverberation approach has been developed in Nakatani et al. using different types of reverberant speech signals is discussed in
(2008a) and Kinoshita et al. (2009). Since the implementation of Section 4. Finally, a brief conclusion is given in Section 5.
such methods in the time domain is computationally expensive, it
was proposed in Nakatani et al. (2008b, 2010) to employ the MCLP- 2. A brief review of the WPE method
based method in the short-time Fourier transform (STFT) domain.
The resulting approach, referred to as the weighted prediction Suppose that a speech signal emanating from a single source
error (WPE) method, is an iterative algorithm that alternatively is captured by M microphones located in a reverberant enclo-
estimates the reverberation prediction coefficients and speech sure. In the STFT domain, we denote the clean speech signal by
spectral variance, using batch processing of the speech utterance. sn, k with time frame index n ∈ {1, . . . , N} and frequency bin in-
Basically, the WPE method and its variants consider temporally/ dex k ∈ {1, . . . , K } where N is the total number of frames and K
spectrally independent speech components in the STFT domain. is the number of available frequency bins. Then, the reverberant
(m )
This assumption, despite greatly simplifying the derivation and ap- speech signal observed at the mth microphone, xn,k , can be rep-
plication of the WPE method, is inaccurate and lacks the modeling resented in the STFT domain using a linear prediction model as
of inherent dependencies across time frames and spectral compo- Nakatani et al. (2010)
nents at each time frame. In Erkelens and Heusdens (2010), it was Lh −1
(m )
 ( m )∗ (m )
shown that the STFT coefficients of anechoic speech exhibit sig- xn,k = hl,k sn−l,k + en,k (1)
nificant correlation in time, even with frame overlaps of less than
l=0
50%. This correlation, referred to as inter-frame correlation (IFC),
(m )
is further pronounced in case of highly reverberant speech, due to where hl,k is an approximation of the acoustic transfer func-
the convolutive nature of the reverberation. In Habets et al. (2012), tion (ATF) between the speech source and the mth microphone in
in the context of multi-microphone noise reduction in a reverber- the STFT domain, Lh denotes the length of the ATF (measured in
ant environment, it was demonstrated that the achievable perfor- frames) and ∗ denotes the complex conjugate. The additive term
(m )
mance in terms of noise reduction and speech distortion can be en,k models the linear prediction error and the additive noise and
further improved by exploiting IFC. The noise reduction problem is neglected here as in Nakatani et al. (2010). Therefore, (1) can be
using IFC has also been addressed partially in Esch (2012) where, rewritten as
in the propagation step of a noise reduction method based on Lh −1
(m ) (m )
 ( m )∗
Kalman filter, the complex-valued prediction weight is used to ex- xn,k = dn,k + hl,k sn−l,k (2)
ploit the temporal correlation of successive speech and noise STFT l=D
coefficients. However, similar to Habets et al. (2012), this work as-  −1 (m )∗
(m )
sumes perfect knowledge of the theoretical IFC in the derivation of where dn,k = D h
l=0 l,k
sn−l,k is the sum of anechoic (direct-path)
various enhancement algorithms. In summary, the IFC has not been speech and early reflections at the mth microphone and D corre-
fully explored in the context of STFT domain speech enhancement sponds to the duration of the early reflections. Most dereverbera-
and the accurate modeling and applications of the speech IFC re- tion techniques, including the WPE method, aim at reconstructing
(1 )
mains an attractive area for future research, especially in the con- the desired signal, say dn,k ≡ dn,k , or suppressing the late rever-
text of dereverberation where the channel impulse responses are berant terms represented by the summation in (2). Replacing the
characterized by long memory (Vaseghi, 2006). convolutive model in (2) by an auto-regressive (AR) model results
In this work, in order to take into account the considerable in the well-known multi-channel linear prediction (MCLP) form for
IFC present in the desired speech (due to the speech character- the observation at the first microphone, i.e.,
istics, STFT framing overlaps and heavy reverberation), we refor-
(1 )

M
mulate the WPE method through the introduction of an approx- dn,k = xn,k − gk(m )H xn,k
(m ) (1 )
= xn,k − GH
k Xn,k (3)
imate model for the joint probability distribution of the desired m=1
speech STFT coefficients within finite segments, each consisting of (m )
consecutive frames. Following an ML approach similar to the orig- with superscript H as the Hermitian transpose and the vectors xn,k
inal WPE method, it is shown that the resulting dereverberation and gk(m ) are defined as
problem leads to a convex optimization problem with a closed-
form solution for the reverberation prediction weights, since it can gk(m ) = [g(0m,k) , g(1m,k) , . . . , g(Lm−1
)
,k
]T
k
be solved efficiently in a single attempt, unlike the original WPE (m )
xn,k = [xn(m )
−D,k
, xn(m )
−D−1,k
, . . . , xn(m )
−D−(L ),k ]
T
(4)
method whose solution requires an iterative procedure. In addi- k −1

tion, regarding the estimation of the underlying IFC matrix for where gk(m ) is the regression vector (reverberation prediction
the desired speech component, an extension of the method for weights) of order Lk for the mth channel and the superscript T de-
speech spectral variance estimation in Parchami et al. (2016) is notes transpose. The right-hand side of (3) has been obtained by
(m )
proposed. The proposed method can efficiently eliminate the re- concatenating {xn,k } and {gk(m) } over m to respectively form Xn, k
verberant component from the observed speech, prior to the esti-
and Gk . Estimation of the regression vector Gk and insertion of it in
mation of the cross-spectral variance of the desired speech, that is
(3) can provide an estimate of the desired (dereverberated) speech.
performed by a first order smoothing scheme. Finally, we evaluate
From a statistical viewpoint, estimation of Gk can be performed by
the performance of our approach incorporating the estimated IFC
applying the maximum likelihood (ML) criterion at each frequency
matrix and compare it to the original and several variants of the
bin. To this end, the conventional WPE method (Nakatani et al.,
WPE method. The results reveal lower residual reverberation and
2008b, 2010) assumes a circular complex Gaussian distribution for
higher overall quality of the enhanced speech when the proposed
the desired speech coefficients, dn, k , with (unknown) time-varying
method is employed.
spectral variance σd2 = E{|dn, k |2 } and zero mean. Assuming that
The remainder of this paper is organized as follows. In n,k

Section 2, a brief overview of the WPE method is presented. In the desired speech STFT coefficients dn, k are independent across
Section 3, a closed-form solution for the optimum reverberation frames, i.e., using zero IFC, the joint distribution of the desired
M. Parchami et al. / Speech Communication 87 (2017) 49–57 51

Table 1
Outline of the steps of the conventional WPE method with temporally independent
speech STFT coefficients.

• At each frequency bin k, consider the speech observations x(m ) , for all n and
n,k
m and the parameters D, Lk and  .
σd2 by σd2[1] = |xn,k |2 .
• Initialize
n,k n,k
• For, j = 1, 2, , J (with a fixed number of iterations, J) , repeat the
following:

A[kj] = Nn=1 σd−2[ j] Xn,k XH n,k
N
n,k
( 1 )∗
ak = n=1 σd−2[ j] Xn,k xn,k
[ j]
n,k

G[kj] = A−1[
k
j] [ j]
ak
[ j]
rn,k = G[kj]H Xn,k
[ j] (1 ) [ j]
dn,k = xn,k − rn,k
σd2[n,kj+1] = max{|dn,k | , }
[ j] 2

• G[ j] is the desired reverberation prediction weight vector after convergence.


k

speech coefficients for all frames at frequency bin k, as represented


by the vector dk , is given by
 

N 
N
1 |dn,k |2 Fig. 1. Normalized IFC of the early speech dn, k averaged over frequency bins versus
p( d k ) = p(dn,k ) = exp − 2 (5) STFT frame number for a selected speech utterance.
n=1 n=1
πσ 2
dn,k
σdn,k
Now, by inserting dn, k from (3) into (5), the joint distribution p(dk )
early speech dn, k .1 Next, the IFC measure |E {dn,k dn∗−l,k }| was esti-
can be viewed as a function of the regression vector Gk and the
mated through time averaging (i.e. long-term recursive smoothing)
desired speech spectral variances σd2 = {σd2 , σd2 , · · · , σd2 }. De-
k 1,k 2,k N,k of the product dn,k dn∗−l,k over n and then normalized by the esti-
noting this set of unknown parameters at each frequency bin by
mated value of E{|dn, k |2 }. The plotted values are the average over
k = {Gk , σd2 } and taking the negative of logarithm of p(dk ) ≡
k all frequency bins and have been obtained for the lag of l = 3. As
p(dk |k ) in (5), the objective function for k can be written as observed from Fig. 1, the amount of correlation between the early
J (k ) = − log p(dk |k ) speech components dn, k and dn−l,k is quite considerable as com-
  (1 )   pared to the spectral variance E{|dn, k |2 }. Whereas this correlation

N x − GH Xn,k 2 is neglected in earlier versions of the WPE method, the method
= log σ 2
+
n,k k
(6)
n=1
dn,k
σd2n,k that we here propose takes this correlation into account by jointly
modeling the early speech terms. From Fig. 1, it is also observed
where the constant terms have been discarded. To obtain the that, even though the updating rate of the underlying smoothing is
ML estimate of the parameter set k , (6) has to be minimized not high, the estimated IFC fluctuates rapidly across frames. There-
w.r.t. k . Since the optimization of (6) jointly w.r.t. Gk and σd2 fore, an efficient approach with fast convergence should be devised
k
is not mathematically tractable, an alternative sub-optimal solu- for its estimation.
tion is suggested in Nakatani et al. (2008b, 2010), where a two- In this section, we first derive a solution for the reverbera-
step optimization procedure is performed, where at each step, only tion prediction vector Gk by considering the IFC, in contrast to
one of the two parameter subsets Gk and σd2 is optimized alterna- the model in (5). Next, based on an extension of the method
k proposed for the estimation of the speech spectral variance in
tively. The two-step procedure is repeated iteratively until a con-
Parchami et al. (2016), an approach for the estimation of the IFC
vergence criterion is satisfied or a maximum number of iterations
matrix of the desired speech terms, as required by the derived so-
is reached. While this strategy is rather straightforward, there is no
lution, will be developed.
guarantee that the alternating procedure results in a globally opti-
mal solution (Jukic et al., 2015). A summary of the conventional
WPE method is outlined in Table 1. It should be noted that, due 3.1. Proposed approach
to the simple instantaneous estimator used for σd2 , as seen in
n,k
this table, the obtained value for this parameter has to be lower Considering the joint distribution of the desired speech STFT
bounded by  to avoid unreasonably small values when |dn, k | ap- coefficients and assuming the independence across frequency bins,
proaches zero. the temporally/spectrally independent model in (5) should be re-
In the following section, we propose an extension of the WPE placed by
approach by taking into account the correlation of dn, k across the 
N
STFT frame index, n, namely the IFC. p(dk ) = p(d1,k ) p(dn,k |Dn,k ) (7)
n=2
3. WPE method using inter-frame correlations
with p(dn, k |Dn, k ) denoting the distribution of dn, k conditioned on
Dn, k = [dn−1,k , dn−2,k , · · · , d1,k ]T . Considering the fact that dn, k de-
To demonstrate the importance of the temporal correlation in
pends only on a limited number of the speech coefficients from
the desired early speech component, dn, k , across STFT frames,
previous frames, or equivalently, the fact that the IFC length is fi-
which is the main motivation to develop the WPE method using
nite, (7) can be written as
IFC in this work, we have illustrated in Fig. 1 the IFC present in
the early speech for a given frame lag. To generate this figure, we
extracted the early part, i.e. the first 60 ms, of a room impulse re- 1
Note that, considering D = 3 early terms and using a frame length of 40 ms
sponse (RIR) with 60 dB reverberation time T60dB = 800 ms, and with 50% overlap, the early speech component corresponds to the first 60 ms of
then convolved it with the anechoic speech utterance to obtain the the RIR.
52 M. Parchami et al. / Speech Communication 87 (2017) 49–57


N It can be shown that the matrix Ak is positive semidefinite, and
p(dk ) = p(d1,k ) p(dn,k |dn−1,k ) therefore, the quadratic objective function in (15) is real-valued
n=2 and convex in terms of hk . Subsequently, to find the global min-

N
p(dn,k , dn−1,k ) imum of J (hk ), we can express (15) in the following form
= p(d1,k ) (8)  
p(dn−1,k ) J ( hk ) = hk − hk
H
Ak hk − hk + ck (17)
n=2

where the conditioning term Dn, k in (7) has been replaced by the where ck is an independent term and
shorter segment dn−1,k = [dn−1,k , dn−2,k , · · · , dn−τk ,k ]T with τ k as
the assumed IFC length in frames. Unfortunately, proceeding with hk = A−1
k
bH
k (18)
the model in (8) to find an ML solution for the regression vec-
It is evident that hk in the above is the global minimum of the
tor Gk does not lead to a convex optimization problem. Therefore,
objective function J (hk ) in (17), or equivalently, it is the esti-
to overcome this limitation, we alternatively exploit an approxi-
mate of the reverberation prediction weights by the proposed WPE
mate model by considering only the correlations among the frames
method.
within each segment, dn,k = [dn,k , dn−1,k , · · · , dn−τk +1,k ]T , and dis-
regarding the correlations across the segments. This results in the 3.2. Estimation of the IFC matrix
following approximate model
 τN  To calculate the optimal reverberation prediction weights by

k
 (18), Ak and bk in (16), and in turn, An, k and bn, k given by
p( d k )  p dn,k
(14) have to be calculated. To do so, as seen in (14), the IFC ma-
n=1
trix of the desired speech terms, n, k , has to be estimated be-
 τN 

k
1  forehand. In Parchami et al. (2016), a new variant of the WPE
= exp −dn,k
H
−1 d (9) method has been suggested, that exploits the geometric spec-
n=1
π τk det n,k n,k n,k
tral subtraction approach in Lu and Loizou (2008) along with the
estimation of late reverberation spectral variance (LRSV), in or-
where n, k = E {dn,k dn,k
H } represents the correlation matrix of d ,
n,k
der to estimate the spectral variance of the desired speech, σd2 ,
det denotes the determinant of a matrix and . is the floor func- n,k

tion. Now, using (3), the desired speech segment dn,k can be ex- unlike the iterative scheme in the original WPE method, as in
pressed as Table 1. We here develop an extension of the proposed method
in Parchami et al. (2016) to estimate the spectral cross-variances
dn,k = un,k − UH
n,k hk (10) of the desired speech terms, ρn1 ,n2 ,k = E {dn1 ,k dn∗ ,k }, which in
2
where fact constitute the IFC matrix n, k . In this regard, by resorting
to the dereverberation by spectral enhancement (gain function-
(1 ) (1 )
un,k = [xn,k , xn−1,k , · · · , xn(1−)τ ]T (11) based approach), the following estimate of dn, k can be obtained
k +1,k

Un,k = [Xn,k , Xn−1,k , · · · , Xn−τk +1,k ] ∗ (Parchami et al., 2016)

hk = G∗k 1−
(γn,k −ξn,k +1 )2
dˆn,k = 
4γn,k (1 )
(γn,k −ξn,k −1 )2
xn,k (19)
In the same manner as the original WPE method (Nakatani et al., 1− 4ξn,k
2010), by considering the negative of the logarithm of p(dk |hk ), an
ML-based objective function for the regression weight vector hk where the two parameters ξ n, k and γ n, k are defined as
can be derived as follows,
(1 ) 2
|dn,k |2 |xn,k |
 τN 
 ξn,k = , γ = (20)

k
|rn,k | 2 n,k
|rn,k |2
J (hk )  − log p(dk |hk ) = dn,k
H
−1 d + Kn,k
n,k n,k
(12)
n=1 (1 )
with rn, k = xn,k − dn,k as the reverberant-only component. We ex-
with Kn,k representing the terms independent of hk , which can be ploit (19) to provide primary estimates of dn1 ,k and dn2 ,k and then
discarded. Inserting (10) into (12) and doing further manipulation use recursive smoothing of dn1 ,k dn∗ ,k to estimate the elements of
2
result in the IFC matrix n, k . As explained in Parchami et al. (2016), due
 τN  to the unavailability of |dn, k |2 and |rn, k |2 , the two parameters de-

k
 fined in (20) are not known a priori and have to be substituted
J ( hk ) = hH H H
k An,k hk − bn,k hk − hk bn,k + cn,k (13)
by their approximations. To this end, we use |dˆn−1,k |2 given by
n=1
(19) for |dn, k |2 and a short-term estimate of the spectral variance
where we defined σr2 for |rn, k |2 . To determine the spectral variance σr2 , we re-
n,k n,k

An,k = Un,k −1


n,k
UH
n,k
sort to the statistical model-based estimation of the LRSV, which
has been widely explored in the context of spectral enhancement
bn,k = Un,k −1
n,k
un,k (14) (Habets, 2007). Therein, an estimator of the LRSV was derived, us-
cn,k = uH
n,k  −1
n,k
un,k ing a statistical model for the RIR in the spectral domain along
with a few recursive smoothing schemes. In brief, the follow-
Now by neglecting the constant term cn, k , (13) can be arranged as ing scheme has been conventionally used to estimate the LRSV
(Habets et al., 2009)
J ( hk ) = hH H H
k Ak hk − bk hk − hk bk (15) (1 ) 2
σx2(1) = (1 − β ) σx2(1) + β|xn,k | (21a)
n,k n−1,k
with Ak and bk as
 τN   τN  σr˜2n,k = (1 − κ ) σr˜2n−1,k + κ σx2(1) (21b)

k 
k n−1,k

Ak = An,k , bk = bn,k (16)


n=1 n=1 σr2n,k = e−2αk RNe σr˜2n−(Ne −1),k (21c)
M. Parchami et al. / Speech Communication 87 (2017) 49–57 53

where α k is related to the 60 dB reverberation time, T60dB, k , by


α k = 3log 10/(T60dB, k fs ) with fs as the sampling frequency in Hz, R
is the STFT frame advance in samples, β and κ are two smooth-
ing parameters and Ne is the frame delay specifying the number
of assumed early speech frames, which is chosen herein as D. This
choice is made so that the number of frames considered as early
speech in the LRSV estimation scheme will be equal to the number
of included frames in the desired speech dn, k by the WPE method,
as in (2). The term r˜n,k actually represents the entire reverberant
speech including both the early and late reverberations but exclud-
ing the direct-path. Using the LRSV estimator in (21), the short-
term estimate of σr2 is obtained by choosing the smoothing pa-
n,k
rameters β and κ to be close to one. By this choice, the estimate of
σr2 includes more updates (less smoothing) and will therefore be
n,k
a closer approximation to |rn, k |2 . Yet, to avoid unreasonably small
values for the approximated |rn, k |2 in the denominator of (20), this
parameter is lower thresholded by 10−3 .
Now given the estimate of the desired speech components
dˆn1 ,k and dˆn2 ,k , as by (19), it is straightforward to use a recursive
smoothing scheme to estimate the spectral cross-variance ρn1 ,n2 ,k ,
as the following

ρn1 ,n2 ,k = (1 − η )ρ(n1 −1),(n2 −1),k + η dˆn1 ,k dˆn∗2 ,k (22)


with η as a fixed smoothing parameter. Equivalently, by expressing
(22) in matrix form, it follows Fig. 2. A two-dimensional illustration for the geometry of the synthesized scenario
of a noisy reverberant environment.
n,k = (1 − η )n−1,k + η dˆ n,k d
ˆ H
n,k (23)
ˆ =
with the vector of the estimated desired speech terms d n,k real-world recorded and synthetic RIRs are used to generate mi-
[dˆn,k , dˆn−1,k , · · · , dˆn−τk +1,k ]T . The inverse of the estimated IFC ma- crophone array signals in a reverberant noisy environments. In the
trix n,k is to be used to obtain An, k and bn, k in (14). Here, to first case, to account for a real-world scenario, the clean speech
avoid the complexity involved in direct inversion of n,k and also utterances are convolved with measured RIRs from the SimData of
to overcome the common singularity issue encountered in the in- the REVERB Challenge (REVERB, 2013), wherein an 8-channel cir-
version of the sample correlation matrix, we use the Sherman– cular microphone array with a diameter of 20 cm was placed in
Morrison matrix inversion lemma (Hager, 1989) to implicitly invert three rectangular rooms (labeled 1–3) to measure the RIRs.2 Room
n,k , as given by (23). The simplified form of this lemma can be 1 is 3.7 × 5.5 m with T60dB of 250 ms, room 2 is 4.8 × 6.2 m
written as (Hager, 1989) with T60dB of 680 ms and room 3 is 6.6 × 6.1 m with T60dB of
730 ms. The height for all rooms is 2.5 m and the microphone
 −1 A−1 UV H A−1
A − UV H = A−1 + (24) array and speakers are placed 1.1 m high. To account for differ-
1 − V H A−1 U ent types of noise (i.e. babble, white and pink), the resulting re-
for an invertible matrix A and any two column vectors U and V. verberant signals are combined with different noises taken from
Using (24) for the inverse of n,k in (23), i.e. by taking A, U and V the Noisex-92 (Varga and Steeneken, 1993) database at a global
respectively as (1 − η )n−1,k , −ηd
ˆ ˆ signal-to-noise ratio (SNR) of 15 dB. Although we report the results
n,k and dn,k , it can be deduced
that for three types of noise here, considering other types of noise led
to the same conclusions as the ones drawn next. To properly add
−1 η −1 ˆ d
d ˆ H −1
n−1,k n−1,k n−1,k noise to the reverberant signals, we use the function v_addnoise
−1 = −
n,k n,k
(25)
n,k 1−η 1 − η 1 − η + η dˆ H −1 d ˆ of the speech processing toolbox VoiceBOX (Brookes, 2009), which
n,k n−1,k n,k
calculates the speech signal level according to the ITU-T recom-
The above can be recursively implemented to update the inverse mendation P.56 (ITU-T, 1993). In the second case, to further analyze
of n,k at each frame without the need for direct matrix inversion. the performance of the considered methods under different levels
It should be noted that the overall WPE-based dereverberation of reverberation, the image source method (ISM) (Lehmann, 2016)
approach presented in this section can be considered as an exten- is used to simulate the scenario illustrated in Fig. 2. As viewed, a
sion of the method presented in Parchami et al. (2016), by tak- source of anechoic speech and two independent anechoic sources
ing into account the IFC of the desired speech signal. Namely, for of babble noise taken from Noisex-92 (Varga and Steeneken, 1993)
the choice of τ k =1, it can be shown that the proposed solution in are placed in an acoustic room with the indicated dimensions. The
(18) degenerates to the method suggested in Parchami et al. (2016). RIRs from the speech and noise sources to the microphones are
synthesized to achieve a desired reverberation time T60dB . These
4. Performance evaluation are convolved with the anechoic signals to generate reverberant
microphone signals, which are next linearly combined to achieve
In this section, the performance of the proposed dereverbera- a desired global SNR of 15 dB.
tion approach is evaluated in comparison with the original WPE For the relative evaluation of different dereverberation methods,
method and a few recent variations of this method from the lit- we use four performance measures, as recommended by the RE-
erature. To this end, 20 clean speech utterances (including 10 VERB Challenge (Kinoshita et al., 2013). These performance metrics
male and 10 female speakers) are used from the TIMIT database
(Garofolo et al., 1993), where the average length of the speech
samples is 3.7 s and the average speech-to-silence ratio is 4.8. Both 2
Note that only two of the available 8 channels are used herein.
54 M. Parchami et al. / Speech Communication 87 (2017) 49–57

include: the perceptual evaluation of speech quality (PESQ), the


cepstrum distance (CD), the frequency-weighted segmental SNR
(FW-SNR) and the signal to reverberation modulation energy ratio
(SRMR). The PESQ score is one of the most frequently used per-
formance measures in the speech enhancement literature, which
has been recommended by ITU-T standards for speech quality as-
sessment (PESQ, 2001). It ranges between 1 and 4.5 with higher
values corresponding to better speech quality. The CD is calculated
as the log-spectral distance between the linear prediction coeffi-
cients (LPC) of the enhanced and clean speech spectra (Hu and
Loizou, 2008). It is often limited in the range of [0,10], where
a smaller CD value shows less deviation from the clean speech.
The FW-SNR is calculated based on a critical band analysis with
mel-frequency filter bank and using clean speech amplitude as the
corresponding weights (Hu and Loizou, 2008). It generally takes
a value in the range of [−10,35] dB with the higher the better.
The SRMR, which has been exclusively devised for the assessment
of dereverberation, is a non-intrusive measure (i.e., one requiring
only the enhanced speech for its calculation), and is based on an
auditory-inspired filterbank analysis of critical band temporal en-
velopes of the speech signal (Falk et al., 2010). A higher SRMR Fig. 3. Normalized IFC averaged over frequency bins and frames versus the frame
lag for speech samples with different amounts of reverberation.
refers to a higher energy of the anechoic speech relative to that
of the reverberant-only speech.
In the conducted experiments, the sampling rate is set to
16 kHz and a 40 ms Hamming window with overlap of 50% is used
for the STFT analysis-synthesis. To achieve the best dereverberation
performance, the number of early speech terms is chosen as D =
3 while the order of the regression vector Gk is chosen as Lk = 20.
To implement the original (iterative) version of the WPE method
(Nakatani et al., 2010), we take the number of iteration to be J =
5, since more iterations do not result in better performance. The
number of microphones is taken to be M = 2, as the results ob-
tained by using larger number of microphones lead to the same
conclusions. We use the first 10 s of the reverberant speech obser-
vation to estimate the reverberation prediction weights Gk in all
reported experiments. We take the length of IFC to be independent
of frequency, i.e. τ k ≡ τ , for our experiments.
In order to perform the matrix inversion in (18) with better ac-
curacy, we use the QR factorization of the matrix Ak in (16) with
forward-backward substitution (Press et al., 2007). Also, to esti-
mate the LRSV through (21), knowledge of the reverberation time
T60dB is required. Here, we use the reverberation time estimation
method in Löllmann et al. (2010) to estimate this parameter blindly Fig. 4. Performance of the proposed WPE method versus the assumed IFC length,
from the observed speech. The estimated T60dB in this way is ac- τ , for different D.
curate enough not to degrade the performance of the underlying
LRSV estimator in (21). The smoothing parameters β and κ in
to a large extent. Fig. 4 shows the PESQ scores of the proposed
(21) are respectively selected as 0.5 and 0.8 while η in (25) is fixed
approach versus different τ with D ranging from 1 to 4, when us-
at 0.7. Our approach requires no prior knowledge of the direct to
ing the measured RIRs from the SimData of the REVERB Challenge.
reverberant ratio (DRR) parameter or its estimate.
Apart from the observation that the performance of the proposed
To investigate the IFC present between early speech terms with
approach is best for D = 3, it can be seen that the higher the value
different frame lags, we calculated the normalized IFC by sam-
of D the larger the value of the choice of τ resulting in the best
ple averaging over all frequency bins and frames. The results are
performance. This result is due to the fact that the higher the value
shown in Fig. 3 for both anechoic and reverberant speech signals
of D the larger the amount of the IFC between subsequent frames
with different values of the reverberation time. As seen, the IFC
of the desired speech. It is also observed that the best choice of the
is quite pronounced for smaller lag values (say 5 or less), but de-
parameter τ occurs in the range of 2-6, despite the fact that the
creases to a lower level for larger lags. We will take into account
theoretically optimal choice of τ is N, i.e., the number of frames
this observation in choosing the appropriate IFC length, τ , in the
in the entire speech utterance.3 The reason for this limitation in
sequel. A more detailed study of the IFC in the STFT domain can
the performance of the proposed approach seems to be due to the
be found in Cohen (2005).
limited accuracy in the estimation of the IFC matrix, n, k . In ef-
Next, we study the effect of the assumed number of correlated
fect, the estimation error in n,k , which grows with the size τ of
speech frames, τ , on the overall performance of the proposed dere-
the matrix n, k , degrades the overall performance of the proposed
verberation approach. It was found that the choice of this param-
method. Therefore, we choose the value of τ = 5 for the case of
eter is more dependent on the number of early speech frames, D,
than on other involved parameters, e.g. Lk and T60dB . This theoret-
ically makes sense since the parameter D determines the duration 3
Note that in this case, the approximate model in (9) turns into an accurate joint
of the early reflections, and therefore, the IFC is controlled by D model for all the desired speech frames.
M. Parchami et al. / Speech Communication 87 (2017) 49–57 55

Table 2
Performance comparison of different WPE-based dereverberation methods using the
recorded RIR of room 1 from REVERB Challenge with babble noise.

Method PESQ CD FW-SNR SRMR


(dB) (dB)

Unprocessed 2.26 4.26 2.90 3.82


Original WPE (Nakatani et al., 2010) 2.57 3.55 5.11 6.42
CGG-based WPE (Jukic et al., 2015) 2.60 3.50 5.33 6.74
WPE suggested in Parchami et al. (2016) 2.67 3.42 6.08 7.53
Proposed WPE 2.73 3.24 6.79 7.99
Proposed WPE with IFC knowledge 2.81 3.11 7.52 8.40

Table 3
Performance comparison of different WPE-based dereverberation methods using the
recorded RIR of room 2 from REVERB Challenge with white noise.

Method PESQ CD FW-SNR SRMR


(dB) (dB)
Fig. 5. Improvement in PESQ for different WPE-based dereverberation methods.
Unprocessed 1.94 4.62 0.90 2.05
Original WPE (Nakatani et al., 2010) 2.10 3.75 1.88 3.17
CGG-based WPE (Jukic et al., 2015) 2.12 3.70 1.98 3.25
WPE suggested in Parchami et al. (2016) 2.18 3.44 2.30 3.67
Proposed WPE 2.23 3.32 2.51 3.99
Proposed WPE with IFC knowledge 2.30 3.15 2.74 4.24

Table 4
Performance comparison of different WPE-based dereverberation methods using the
recorded RIR of room 3 from REVERB Challenge with pink noise.

Method PESQ CD FW-SNR SRMR


(dB) (dB)

Unprocessed 1.87 4.96 0.52 1.98


Original WPE (Nakatani et al., 2010) 2.01 3.82 1.38 3.09
CGG-based WPE (Jukic et al., 2015) 2.02 3.73 1.51 3.23
WPE suggested in Parchami et al. (2016) 2.07 3.50 1.82 3.60
Proposed WPE 2.13 3.36 2.06 3.87
Proposed WPE with IFC knowledge 2.21 3.19 2.29 4.20

D = 3 in our experiments. This is also consistent with the fact that Fig. 6. Improvement in CD for different WPE-based dereverberation methods.
the IFC is more strongly present in the lag values of around 5 or
less, as inferred before from Fig. 3.
To evaluate the reverberation suppression performance of the As observed, whereas the CGG and Laplacian-based methods
proposed method, we compare it to the original WPE method achieve better scores w.r.t. the original WPE, the WPE with speech
(Nakatani et al., 2010), two recent developments of the same spectral variance estimation performs better than the former three
method based on the complex generalized Gaussian (CGG) fam- methods, and finally, the proposed WPE method in this work
ily of distributions (Jukic et al., 2015) and the Laplacian dis- achieves the best results as compared to the previous methods.
tribution (Jukic and Doclo, 2014) for the desired speech, the It should be noted that the superior performance of the proposed
WPE method using the estimation of speech spectral variance WPE with knowledge of IFC shows the possibility of improving the
(Parchami et al., 2016), and finally, the proposed method by us- proposed method through the availability of more accurate IFC ma-
ing the perfect knowledge of the desired speech component. The trix estimates. It is found that the relative performance of the con-
CGG-based method makes use of the same solution for the regres- sidered methods in terms of the four investigated scores is consis-
sion vector Gk as the original WPE method but with a different tent.
estimator of the speech spectral variance in its iterative procedure. Next, to evaluate the performance of the considered dereverber-
The Laplacian-based method does not have a closed-form solution ation methods for different amounts of reverberation, the objec-
for the reverberation prediction weights, Gk , and has to be imple- tive performance measures are obtained by using the synthesized
mented through numerical optimization, e.g. by using the CVX op- RIRs with different T60dB . The results are presented in Figs. 5–8 for
timization toolbox (CVX Research, 2012). Next, the WPE method T60dB in the range of 100–1000 ms. For better visualization, only
presented in Parchami et al. (2016), is actually a particular case the resulting improvements in the performance scores w.r.t. the
of the presented method in this work by disregarding the IFC and unprocessed speech (denoted by PESQ and such) are illustrated.
estimating only the speech spectral variance at each frame inde- As seen in these figures, the proposed method in this work and
pendently. Finally, the proposed WPE method with IFC knowledge the one in Parchami et al. (2016), which are both based on the
is obtained by exploiting only the early component of the speech estimation of the speech spectral variance by means of an LRSV
(this can be obtained in the same manner as that for Fig. 1) as dˆ estimator, perform significantly better than the previous versions
n,k
in (23), and is considered as a reference for comparison. The com- of the WPE method, which estimate the speech spectral variance
parative results obtained by using the recorded RIRs from REVERB iteratively along with the reverberation prediction weights. Also,
Challenge with different noise types are presented in Tables 2–4 in it is observed that the proposed method achieves the best scores
terms of the aforementioned objective performance measures. in comparison with the others in almost the entire range of T60dB .
56 M. Parchami et al. / Speech Communication 87 (2017) 49–57

the performed experimentations, it can be concluded that the ex-


isting limit on the performance of the suggested WPE method in
this work is mostly due to the limited accuracy in the estimation
of the IFC matrix, and therefore, this shortcoming can be overcome
by developing a more precise estimator for the IFC. This can serve
as a topic of future research on linear prediction-based dereverber-
ation in the STFT domain.

Acknowledgment

This work was supported in part by the Natural Sciences and


Engineering Research Council (NSERC) of Canada.

References

Attias, H., Platt, J.C., Acero, A., Deng, L., 2001. Speech denoising and dereverberation
Fig. 7. Improvement in FW-SNR for different WPE-based dereverberation methods. using probabilistic models. Adv Neural Inf. Process. Syst. 13, 758–764.
Brookes, M., 2009. VoiceBOX: speech processing toolbox for MATLAB. Available:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, last accessed on
May 2016.
Cohen, I., 2005. Relaxed statistical model for speech enhancement and a priori SNR
estimation. IEEE Trans. Speech Audio Process. 13 (5), 870–881.
CVX Research, I., 2012. CVX: matlab software for disciplined convex programming,
version 2.0. Available at http://cvxr.com/cvx, last accessed on May 2016.
Erkelens, J., Heusdens, R., 2010. Correlation-based and model-based blind sin-
gle-channel late-reverberation suppression in noisy time-varying acoustical en-
vironments. IEEE Trans. Audio Speech Language Process. 18 (7), 1746–1765.
Esch, T., 2012. Model-Based Speech Enhancement Exploiting Temporal and Spectral
Dependencies. RWTH Aachen University Ph.D. thesis.
Falk, T.H., Zheng, C., Chan, W.Y., 2010. A non-intrusive quality and intelligibility mea-
sure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lan-
guage Process. 18 (7), 1766–1774.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., Zue, V., 1993.
TIMIT acoustic-phonetic continuous speech corpus LDC93S1. Philadelphia: Lin-
guistic Data Consortium, last accessed on May 2016.
Habets, E.A.P., 2007. Single- and Multi-Microphone Speech Dereverberation using
Spectral Enhancement. Technische Universiteit Eindhoven, Netherlands Ph.D.
thesis.
Habets, E.A.P., Benesty, J., Chen, J., 2012. Multi-microphone noise reduction using
interchannel and interframe correlations. In: Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan,
pp. 305–308.
Habets, E.A.P., Gannot, S., Cohen, I., 2009. Late reverberant spectral variance estima-
Fig. 8. Improvement in SRMR for different WPE-based dereverberation methods. tion based on a statistical model. IEEE Signal Process. Lett. 16 (9), 770–773.
Hager, W.W., 1989. Updating the inverse of a matrix. SIAM Rev. 31 (2), 221–239.
Hu, Y., Loizou, P.C., 2008. Evaluation of objective quality measures for speech en-
hancement. IEEE Trans. Audio Speech Language Process. 16 (1), 229–238.
This advantage is more visible for the moderate values of T60dB . Jukic, A., Doclo, S., 2014. Speech dereverberation using weighted prediction error
There is also a considerable gap between the results obtained by with Laplacian model of the desired signal. In: Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy,
using the proposed approach with the suggested estimation of the
pp. 5172–5176.
IFC matrix and those by using the perfect knowledge of the early Jukic, A., van Waterschoot, T., Gerkmann, T., Doclo, S., 2015. Multi-channel linear
speech, which indicates an avenue for further research for the es- prediction-based speech dereverberation with sparse priors. IEEE/ACM Trans.
Audio Speech Language Process. 23 (9), 1509–1520.
timation of the IFC.
Kinoshita, K., Delcroix, M., Nakatani, T., Miyoshi, M., 2009. Suppression of late rever-
beration effect on speech signal using long-term multiple-step linear prediction.
5. Conclusion IEEE Trans. Audio Speech Language Process. 17 (4), 534–545.
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W.,
Maas, R., 2013. The reverb challenge: a common evaluation framework for dere-
In this work, we proposed a novel WPE dereverberation method verberation and recognition of reverberant speech. In: IEEE Workshop on Appli-
based on an approximate model for the correlation across desired cations of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY,
USA, pp. 1–4.
speech frames, namely the IFC, in the STFT domain. It was shown Lehmann, E. A.,. Image-source method: matlab code implementationAvailable at
that, given an estimate of the IFC matrix, the dereverberation prob- http://www.eric-lehmann.com/, last accessed onMay 2016.
lem of interest can be formulated as a convex quadratic optimiza- Löllmann, H.W., Yilmaz, E., Jeub, M., Vary, P., 2010. An improved algorithm for blind
reverberation time estimation. In: Proceedings of International Workshop on
tion leading to a closed-form Wiener-like solution. Performance Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Israel, pp. 1–4.
evaluations using both recorded and synthesized RIRs reveal that Lu, Y., Loizou, P.C., 2008. A geometric approach to spectral subtraction. Speech Com-
the proposed method considerably outperforms the previous vari- mun. 50 (6), 453–466.
Nakatani, T., Juang, B.H., Yoshioka, T., Kinoshita, K., Delcroix, M., Miyoshi, M., 2008a.
ations of the WPE method.
Speech dereverberation based on maximum-likelihood estimation with time–
It can be concluded that incorporating the statistical model- varying gaussian source model. IEEE Trans. Audio Speech Language Process. 16
based estimation of the desired speech spectral variance (or cor- (8), 1512–1527.
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H., 2008b. Blind
relation matrix in general) into the WPE dereverberation method
speech dereverberation with multi-channel linear prediction based on short
can lead to a better reverberation suppression performance. Such time Fourier transform representation. In: Proceedings of IEEE International
an approach, unlike the original WPE method, results in a non- Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, USA,
iterative estimator for the reverberation prediction weight vector, pp. 85–88.
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H., 2010. Speech dere-
provided that proper estimates of the spectral auto- and cross- verberation based on variance-normalized delayed linear prediction. IEEE Trans.
variance of the desired speech terms are available. According to Audio Speech Language Process. 18 (7), 1717–1731.
M. Parchami et al. / Speech Communication 87 (2017) 49–57 57

Naylor, P., Gaubitch, N. (Eds.), 2010, Speech Dereverberation. Springer-Verlag, Lon- Togami, M., Kawaguchi, Y., 2013. Noise robust speech dereverberation with Kalman
don. smoother. In: Proceedings of IEEE International Conference on Acoustics, Speech
Parchami, M., Zhu, W.P., Champagne, B., 2016. Speech dereverberation using lin- and Signal Processing (ICASSP), Vancouver, BC, Canada, pp. 7447–7451.
ear prediction with estimation of early speech spectral variance. In: Proceed- Varga, A., Steeneken, H.J.M., 1993. Assessment for automatic speech recognition II:
ings of IEEE International Conference on Acoustics, Speech and Signal Process- NOISEX-92: a database and an experiment to study the effect of additive noise
ing (ICASSP), Shanghai, China. on speech recognition systems. Speech Commun. 12 (3), 247–251.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2007. Numerical recipes: Vaseghi, S.V., 2006. Advanced Digital Signal Processing and Noise Reduction. John
the art of scientific computing (3rd ed.). New York: Cambridge University Press. Wiley & Sons. Chapter 17.
Recommendation P.56: Objective measurement of active speech level 1993. ITU-T. Yoshioka, T., Nakatani, T., Miyoshi, M., 2009. Integrated speech enhancement
Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method using noise suppression and dereverberation. IEEE Trans. Audio Speech
method for end-to-end speech quality assessment of narrow-band telephone Language Process. 17 (2), 231–246.
networks and speech codecs 2001. ITU-T. Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., Keller-
Schmid, D., Malik, S., Enzner, G., 2012. An expectation-maximization algorithm for mann, W., 2012. Making machines understand us in reverberant rooms: robust-
multichannel adaptive speech dereverberation in the frequency-domain. In: Pro- ness against reverberation for automatic speech recognition. IEEE Signal Process.
ceedings of IEEE International Conference on Acoustics, Speech and Signal Pro- Mag. 29 (6), 114–126.
cessing (ICASSP), Kyoto, Japan, pp. 17–20.
SimData: dev and eval sets based on WSJCAM0, 2013. REVERB challenge. Available
at http://reverb2014.dereverberation.com/download.html, last accessed on May
2016.

You might also like