Convolutive Blind Source Separation With Wiener Po PDF
Convolutive Blind Source Separation With Wiener Po PDF
net/publication/228941010
CITATIONS READS
0 140
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Diego Milone on 23 May 2014.
3
Department of Knowldge Engineering, Doshisha University, Japan
[email protected], [email protected],
[email protected]
1 Introduction
The objective of blind source separation (BSS) consist of, given a set of sound
field measurements obtained by means of microphones in specified locations,
to obtain a set of signals approximating the original sound sources that have
produced the sound field. In the case of free-field propagation of sound (i.e. in
open spaces without enclosures), the sound wave originated in each source arrives
only one time to each sensor. In such a way, the mixture can be considered as
a linear additive mixture. On the contrary, when the mixture is produced inside
an enclosed environment, the sound waves are reflected by every solid surface in
the room and so each microphone receives not only the direct sound wave but
also all the reflections, and more over, the reflections of all orders until energy of
?
This work is supported by ANPCyT-UNER, under Project PICT N 11-12700, UNL-
CAID 012-72 and CONICET
AST2006. 2006.
2 Leandro Di Persia, Diego Milone, and Masuzo Yanagida
where xj is the j-th microphone signal, si is the i-th source, hji is the impulse
response of the room from source i to microphone j, and ∗ stands for convolution.
This equation can be written in compact form as:
be written as:
Others have proposed the use of fixed-point algorithms derived from FastICA
algorithm [9,10,11]. Some information theory derived algorithms based on mini-
mization of mutual information [12], information maximization (InfoMax) [13] or
Kullback-Leibler divergency [14], combined with Natural Gradient [4] have been
also successfully used. Recently also, some algorithms combining several tech-
niques have been presented, as in [15] where a combination of FastICA followed
by an ICA by InfoMax with Natural Gradient is used.
In this paper the separation is solved by a combination of two methods. Then,
a Wiener time-frequency filter is estimated and applied in order to improve
the separation. In the following sections, the algorithm will be explained in
detail. Then, some experiments to evaluate the quality of the separation will be
presented, followed by results, analysis of results and finally some conclusions
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
2 Separation Algorithms
For each frequency bin, two separation algorithms for complex-valued signals
are sequentially applied. In the first stage, Joint Approximate Diagonalization
of Eigenmatrices (JADE) algorithm [16] is applied to obtain a first estimation
of separation matrix W. Then, this separation matrix is refined by using it as
initial condition for FastICA algorithm [17].
After separation in each frequency bin there are two problems to solve. All
ICA algorithms are able to obtain an estimation of sources up to an scaling and
permutation indeterminacy, what means that for each frequency the resulting
sources will have different scaling and different sorting. Thus, before any other
step, one need to order the sources and obtain a consistent scaling for each
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
frequency.
Following this, a Wiener time-frequency filter estimated from the separated
sources is applied. Finally an inverse STFT is applied to yield time-domain
sources. In the following subsections, these aspects will be discussed in detail.
i=1 i=1
subject to
wiH x wjH x = δij .
E (9)
+
In this equations, E {·} is the expectation operator, and G : R ∪ 0 → R is a
smooth even function. For this work we have used function G(y) = log (α + y),
with α = 0.1.
To achieve this, the contrast function JG (wi ) is maximized to obtain a sep-
aration vector wi , then a deflationary Gram-Schmidt-like decorrelation is used
to eliminate the information of previously obtained sources and this process is
iterated until all desired sources are extracted. It must be noted that matrix W
will have wiH as its i-th row. An alternative to this sequential extraction is to
extract all sources at once, optimizing a matrix W and using an ortonormaliza-
tion method on that matrix after each iteration. In this paper we have used the
deflationary approach.
This is a Newton-like fixed-point iteration with quadratic convergence, and
as such, it is very fast. As all fixed point methods, it depends on good initial
conditions estimation, and we have a good estimation using the output of JADE
algorithm.
AST2006. 2006.
Convolutive BSS with Wiener Post-Filtering for Robust ASR 5
2.3 Indeterminacies
x = W−1 y
= W−1 Wx
= W−1 IWx
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
= W−1 (E1 + · · · + EM ) Wx
= W−1 E1 y + · · · + W−1 EM y
= v1 + . . . + vM . (10)
where Ei is a matrix with a one in the i-th diagonal element and zeros elsewhere.
It is easy to prove that the representation of vi is independent of the scaling in
matrix W.
Now for the permutation problem, the approach makes use of the fact that
the envelopes of different sound signals must be different and also, that if the sig-
nals are independent the correlation between envelopes of the separated sources
must vanish. This must be true for one frequency bin, however one can expect
that successive frequency bins should share the same or similar envelopes. This
is the information used to solve permutation problem: starting from some fre-
quency band, an estimation of the envelope based on previous classified bands is
calculated. Then, for each separated signal in a new frequency bin, correlation
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
between its envelope and the estimated one for pre-classified bins is calculated,
and the signal is assigned to that of maximum correlation value [7].
In the original paper, pre-classified envelopes are estimated as an average
of all the previously classified envelopes in that class. In this paper, instead of
using this approach, we assume that in the averaging process, the last classified
envelopes must have more weight since they will be more similar to the envelopes
following for classification. Therefore instead of a simple averaging of envelopes,
we update that value as
where E (·)j refers to the locally averaged envelope for source class j, and E (·)j
to the last classified envelope for this class.
After this process we obtain time-frequency representations for each of the
source component, in each sensor. That is, for each source we obtain N time-
frequency representations, each one corresponding to the effect of that source in
one of the sensors, isolated from the other sources effects. Since usually only one
representation for each source is needed, in this study we use the alternative of
keeping the one with bigger energy to be used in the following steps.
AST2006. 2006.
6 Leandro Di Persia, Diego Milone, and Masuzo Yanagida
The separation result will not be perfect mainly due to reverberation times and
the simplified time-invariant modeling. When reverberation time increases, the
performance of the algorithms tend to decrease. Therefore we propose the use of
a non-causal time-frequency Wiener filter as post-processing [18]. Without losing
generality, this will be explained for the two sources, two microphones case, and
the generalization to more sources being straightforward.
The short-time Wiener filter HW for a signal generated by the simple additive
noise model is:
2 2
|z̃(ω, τ )| − |ñ(ω, τ )|
HW (ω, τ ) = 2 (12)
|z̃(ω, τ )|
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
where ñ represents the estimated additive noise. In this case we obtain two
signals, ṽ1 and ṽ2 and if the separation process was successful one can uses them
as estimation of the clean sources.
So, in order to eliminate residual information from source v2 on source v1 we
can use the short-time power spectrum of ṽ1 as numerator (estimation of clean
source) and add short-time power spectrum of ṽ1 and ṽ2 as the estimation of
noisy power spectrum in denominator. Moreover, as we know that both signals
will have some information from the other, and this sharing would be not uniform
over the whole time-frequency plane, one can use time-frequency weights to
reduce the effect of the filter, as expressed in the following equation:
2
|ṽ1 (ω, τ )|
HW,1 (ω, τ ) = 2 2 (13)
|ṽ1 (ω, τ )| + C(ω, τ ) |ṽ2 (ω, τ )|
If the time-frequency contents of ṽ1 and ṽ2 are very similar (so for that time-
frequency coordinate the separation was not well done), the weights must be
close to zero, otherwise they must be near to one. There are several ways to set
these weights. One simple way may include dot products to determine time and
frequency similitude of power spectrum.
The short-time Wiener filter to improve source v2 , HW,2 (ω, τ ) is calculated
in a similar way to (13), with the roles of v1 and v2 interchanged.
To test the capabilities of this algorithm, some experiments have been made.
Sentences for the experiments were extracted from Albayzin Spanish speech
corpus [19]. From this big database, a subset of 605 sentences were selected,
and those were divided into a training set of 585 a test set of 20 sentences. The
training set was used to train a recognizer with clean data. The test set has 5
sentences spoken by 4 speakers. Selected sentences were 3001, 3006, 3010, 3018,
3022, from speakers aagp and algp (female), and mogp and nyge (male). Those
20 sentences were recorded in a room according to Fig. 1.
AST2006. 2006.
Convolutive BSS with Wiener Post-Filtering for Robust ASR 7
490
200
100 100
(height : 125)
source noise
100 400
2 1
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
5
(height : 100)
Mixed signal 1
1
0.5
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
−0.5
−1
0 0.5 1 1.5 2 2.5 3
Mixed signal 2
1
0.5
−0.5
−1
0 0.5 1 1.5 2 2.5 3
Fig. 2. Mixed signals. Top: microphone 1; bottom: microphone 2. Signal to noise power
ratio: 0 dB
AST2006. 2006.
8 Leandro Di Persia, Diego Milone, and Masuzo Yanagida
Two loudspeakers were used, one to reproduce desired speech source and the
other to reproduce some kind of noise. The resulting sound field was recorded
with two Ono Sokki MI 1233 omnidirectional measurement microphones, with
flat frequency response between 20 Hz to 20 kHz and with preamplifiers Ono
Sokki model MI 3110.
Interfering sources were of two kinds, speech and white noise. For speech
noise, sentence 3110 and speakers aagp and nyge where selected. To contaminate
female utterances, male speech (nyge) was used, and vice versa. White noise from
Noisex database [20] was used. For both noise kinds, two different signal to noise
output power ratios were selected, 0 dB and 6 dB. A 0 dB output power ratio
means that at speakerphones, both signals were replayed with equal powers 4 .
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
seen how the main structures of the original signal are present in a enhanced
way in the separated signal.
To test performance of the algorithm, we have used a speech recognition
system to estimate the improvement on word recognition rate before and after
separation. For this test, we have used a continuous speech recognizer based
on tied Gaussian-mixtures Hidden Markov Models (HMM). This recognizer was
trained first with 585 sentences from Minigeo subset of Albayzin database (the
training set does not includes any of the phrases used in the test).
After training, we have tested the recognition system with the clean sentences
of our test set. To see how the mixing process degrades recognition output we
have also evaluated recognition accuracy over the mixtures. We have then applied
the separation algorithm without the time-frequency Wiener filter (that test is
denoted as J+F because includes only Jade and FastICA separation) and per-
formed recognition accuracy test. Finally, the algorithm for separation including
Wiener filter was applied (this test is named J+F+W) and recognition accuracy
test was performed on that data. Table 1 shows the results of these tests. In
the table, for each test the word recognition accuracy percentage (WACC %) is
4
In similar way to standard SNR
AST2006. 2006.
Convolutive BSS with Wiener Post-Filtering for Robust ASR 9
Separated source 1
1
0.5
−0.5
−1
0 0.5 1 1.5 2 2.5 3
Separated source 2
1
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
0.5
−0.5
−1
0 0.5 1 1.5 2 2.5 3
Source
0.5
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
−0.5
0 0.5 1 1.5 2 2.5 3
Noise
0.4
0.2
−0.2
−0.4
0 0.5 1 1.5 2 2.5 3
Fig. 4. Source signals. Top: aagp3002, desired source, female; bottom: nyge3110, noise,
male.
AST2006. 2006.
10 Leandro Di Persia, Diego Milone, and Masuzo Yanagida
a)
4000
Frequency [Hz]
3000
2000
1000
0
0 0.5 1 1.5 2 2.5
b)
4000
Frequency [Hz]
3000
2000
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
1000
0
0 0.5 1 1.5 2 2.5
c)
4000
Frequency [Hz]
3000
2000
1000
0
0 0.5 1 1.5 2 2.5
Time [s]
Fig. 5. Spectrograms of a) mixed signal; b) separated signal and c) source signal, for
a mixture of speech with speech interference emitted with equal power (power ratio of
0 dB).
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
calculated as
N −D−S−I
WACC % = 100 (14)
N
where N is the number of words in the reference transcription, D is the number of
deletion errors (words present in the reference transcription that are not present
in the system transcription), S is the number of substitution errors (words that
were substituted by others in the system transcription) and I is the number
of insertion errors (extra words that were in the system transcription but not
in the reference transcriptions). This measure is a more representative figure of
recognizer performance than standard word recognition rate [21]. As can be seen
from these results, word accuracy rate improvements are in the order of 70%.
In this paper an algorithm for blind source separation of convolved sources has
been presented. The use of Wiener post-filtering to improve the output of the
AST2006. 2006.
Convolutive BSS with Wiener Post-Filtering for Robust ASR 11
Table 1. Word accuracy in robust speech recognition. For reference, with clean sources
WACC % = 91.50%. PR: Power ratio in dB.
in the areas were the desired source has low power. As shown by the example
in Fig. 2, 3 and 4, the quality can be enhanced to a great extent even in a very
bad mixture with equal noise power.
Also, robust speech recognition rates shown a very important improvement
in word accuracy, from almost zero percent for mixtures to about 70% after
separation with the proposed algorithm. It must be noted that the use of wiener
filter has a big effect in word accuracy improvement.
There are some issues that must be addressed for future works. First, we
need to explore the capabilities of this algorithm for shorter data. The tests
presented here were performed on data with an average duration of 2 seconds.
Some applications, like remote controlling of home devices via voice commands
or real-time processing for hearing aids, require shorter data to be processed.
The algorithms used for separation in each frequency bin need a large amount
of data to estimate accurately the statistical properties of signals, so we need to
check whether they will still work in cases with less amount of data present.
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)
Finally, some fine tuning of algorithm parameters, like window kind and
length used in calculating short time Fourier transform or parameters for the
Wiener filter, will be explored.
References
1. Kahrs, M., Brandenburg, K., eds.: Applications of Digital Signal Processing to Au-
dio and Acoustics. The Kluwer International Series In Engineering And Computer
Science. Kluwer Academic Publishers (2002)
2. Kinsbury, B., Morgan, N.: Recognizing reverberant speech with RASTA-PLP. In:
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing. (1997) 1259–1262
3. Benesty, J., Makino, S., Chen, J., eds.: Speech Enhancement. Signals and Com-
munication Technology. Springer (2005)
4. Cichocki, A., Amari, S.i.: Adaptive Blind Signal and Image Processing. Learning
Algorithms and applications. John Wiley & Sons (2002)
5. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John
Wiley & Sons, Inc. (2001)
6. Parra, L., Spence, C.: Convolutive blind separation of non-stationary sources.
IEEE Transactions on Speech and Audio Processing 8(3) (2000) 320–327
AST2006. 2006.
12 Leandro Di Persia, Diego Milone, and Masuzo Yanagida
7. Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on
temporal structure of speech signals. Neurocomputing 41(1-4) (2001) 1–24
8. Araki, S., Makino, S., Hinamoto, Y., Mukai, R., Nishikawa, T., Saruwatari, H.:
Equivalence between Frequency-Domain blind source separation and Frequency-
Domain adaptive beamforming for convolutive mixtures. EURASIP Journal on
Applied Signal Processing 2003(11) (2003) 1157–1166
9. Mitianoudis, N., Davies, M.: New fixed-point ica algorithms for convolved mixtures.
In: Proceedins of the Third International Conference on Independent Component
Analysis and Source Separation. (2001) 633–638
10. Prasad, R., Saruwatari, H., Lee, A., Shikano, K.: A fixed-point ica algorithm for
convoluted speech signal separation. In: Proceedings of the Fourth International
Symposium on Independent Component Analysis and Blind Signal Separation.
(2003) 579–584
L. Di Persia, D. H. Milone & Masuzo Yanagida; "Convolutive Blind Source Separation with Wiener post-filtering for robust Speech Recognition"
11. Gotanda, H., K. Nobu, Koya, T., Kaneda, K., Ishibashi, T., Haratani, N.: Permu-
tation correction and speech extraction based on split spectrum through fastica. In:
Proceedings of the Fourth International Symposium on Independent Component
Analysis and Blind Signal Separation. (2003) 379–384
12. Douglas, S.C., Sun, X.: Convolutive blind separation of speech mixtures using the
natural gradient. Speech Communication 39(1-2) (2003) 65–78
13. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solv-
ing the permutation problem of frequency-domain blind source separation. IEEE
Transactions on Speech and Audio Processing 12(5) (2004) 530–538
14. Araki, S., Mukai, R., Makino, S., Nishikawa, T., Saruwatari, H.: The fundamental
limitation of frequency domain blind source separation for convolutive mixtures of
speech. IEEE Transactions on Speech and Audio Processing 11(2) (2003) 109–116
15. Makino, S., Sawada, H., Mukai, R., Araki, S.: Blind Source Separation of Con-
volutive Mixtures of Speech in Frequency Domain. IEICE Trans Fundamentals
E88-A(7) (2005) 1640–1655
16. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE
sinc(i) Laboratory for Signals and Computational Intelligence (http://fich.unl.edu.ar/sinc)