Robust Speech Recognition Using Adaptive Noise Cancellation
Robust Speech Recognition Using Adaptive Noise Cancellation
http://doi.org/10.26692/sujo/2017.12.0078
Abstract- This paper introduces the adaptive noise cancellation technique for the noise reduction in Robust Automatic Speech
Recognition. The adaptive noise cancellation is used as front-end stage to enhance the extracted features for speech recognition under
noisy conditions. More specifically, the Constrained Stability Least Mean Square (CS-LMS) algorithm which is a member of the family
of adaptive filters has been applied. The Hidden Markov Model based Tool Kit (HTK) is used for training and testing the Automatic
Speech Recognizer system. The result obtained shows that the application of adoptive filtering at the front-end enhances the performance
of the system in noisy conditions while the CS-LMS algorithm gives the most superior performance among the family of LMS
algorithms.
signals instead of noise reference (Compernolle. 1992). The design of adaptive filter is much more difficult
This paper addresses the challenge of environmental because of its shift-variant nature. Due to its adaptive
noise by applying an adoptive noise cancellation nature, its coefficients i.e. 𝜔𝑛 are not fixed and keep on
technique. changing. At every iteration the coefficients set is
updated with a new set of optimum filter coefficients.
The rest of the paper is organized as follows. The
The filter coefficients are selected such that it minimizes
next section explains the working of adoptive noise
the mean square error, 𝜉(𝑛) = 𝐸{|𝑒(𝑛)|2 }.
cancellation approach and the variant algorithms of
ANC. Section 3 describes the experimental setup used Where
and discusses the results obtained by using different
ANC algorithms. Section 4 concludes the findings of 𝑒(𝑛) = 𝑑(𝑛) − 𝑦(𝑛) = 𝑑(𝑛) − 𝝎𝑇𝑛 𝒗𝟐 (𝒏) (2)
this research and outlines the future line of research.
2. ADAPTIVE NOISE CANCELLATION Replacing 𝑑(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) in eq. (2) we get
The Adaptive Noise Cancellation (ANC) uses an
adaptive filter for estimating noise using a referenced 𝑒(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) − 𝝎𝑇𝑛 𝒗𝟐 (𝒏) (3)
noise signal which is statistically similar to the additive
noise contained in noisy speech (Górriz et. al. 2009). As can be seen in Figure 1, the input to adaptive
Two microphones are required to capture reference filter is 𝑣2 (𝑛), a reference noise which is correlated
noise and noise corrupted speech signal. Adaptive filter with 𝑣1 (𝑛). The FIR adaptive filter estimates 𝑣1 (𝑛) and
updates its filter coefficients at every incoming signal is then subtracted from 𝑑(𝑛) to get the clear speech (𝑛)
sample using a feedback mechanism. This feedback . So,
mechanism uses a weight update equation for
computing new filter coefficients. A working of a ̂1 = 𝝎𝑇𝑛 𝒗𝟐 (𝒏)
𝑦(𝑛) = 𝑣 (4)
typical ANC mechanism is depicted in Figure 1.
𝑒(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) − 𝑣
̂1 (5)
̂ … . . (6)
𝑒(𝑛) = 𝑠(𝑛) (6)
A relatively new and improved LMS algorithm is VidTIMIT (Sanderson and Paliwal. 2002) dataset is
proposed in (Górriz et. al. 2009) known as Constrained used for training and testing. The dataset consist of the
Stability Least Mean Square (CS-LMS) algorithm/filter pre-recorded 10 sentences of English language for each
is given by: speaker. Out of these 10 sentences 2 sentences are same
for each while the remaining 8 sentences are different
δv2 ∗ (n)
ω(n + 1) = ω(n) + μ δe(n) for each speaker. There are 43 speakers in the dataset
ϵ + ||δv2 (n)|| resulting in a total of 430 sentences.
where δe(n) = e(n) − e(n − 1), and Additive White Gaussian Noise available in
MATLAB is used to make the noisy speech in the range
δv2 (n) = v2 (n) − v2 (n − 1)
from +60 DB to -60 DB.
As explained in (Górriz et. al. 2009), CS-LMS and
HTK, a Hidden Markov Model (HMM) based ASR
NLMS adaptive filters converge to optimal wiener
toolkit is used for feature extraction, and training and
solution, and for similar step size CSLMS shows good
testing of the classifier. A five state left right HMM
filtering results than NLMS by improving EMSE and
model with 3 emitting state is trained using MEL-
miss-adjustments.
Frequency Cepstral Coefficient (MFCC) features from
3. EXPERIMENTS AND RESULTS the speech signals after noise removal.
The experimental carried out in research consist of Three different weight update equations i.e. LMS,
recognizing speech in the noisy environment while NLMS and CS-LMS are used for noise removal.
using adaptive filters for noise filtering. The setup used Using v(n), a primary noise source, two different noise
is shown in Figure 2. patterns namely v1 (n) and v2 (n) are created which are
statically similar to v(n) but have different parameters.
Where v1 (n) is added to the clear speech and v2 (n) is
passed to adaptive filter for estimation of noise
contained in noisy speech. For practical reasons the
impulse response of h1 and h2 are:
H1−1 (z) = 1 − 0.3z −1 − 0.1z −2
H2−1 (z) = 1 − 0.2z −1
The FIR adaptive filter is modeled using 3 different
LMS algorithm with 5 filter taps using step size of 0.01.
The percentage of correctly recognized words using the
three LMS algorithms at different noise level is shown
in Figure 3.
80
CSLMS
70
60 LMS
50
NLMS
40
30
Figure 2: Proposed Robust ASR System 20
10
The input to the system is a noisy speech and output 0
is recognized words. The experimental setup is divided
into two modules, i.e. the adaptive noise cancellation
module which estimates clean speech from the noisy
signal to niose ratio
speech, and the ASR module which converts the
cleaned speech to recognized text. Figure 3: Recognition results
M. WAQAS et al., 898
The results in Fig. 3 shows that for the noisy speech Gales, M. J. F., and S. J. Young, (1993). Parallel model
the percentage of correctly recognized words drops combination for speech recognition in noise. University
rapidly at a fairly high SNR level i.e. bellow 25dB. At of Cambridge, Department of Engineering.
negative SNR levels, the words correct percentage
Gales M. J. F. and S. J. Young, (1993a). “Cepstral
remains around 21%. Using LMS algorithm, the
parameter compensation for HMM recognition in
performance is almost the same as that of the noisy
noise”, Speech Communication, 12, 231–239.
speech. This is probably because the determination of
step size 𝝁 is dependent on the max eigen value of the Górriz, J. M., J. Ramírez, S. Cruces-Alvarez, C. G.
auto correlation matrix of the reference input to the Puntonet, E. W. Lang, and D. Erdogmus, (2009). A
adaptive filter i.e. 𝒗𝟐 (𝒏), and also due to the weak novel LMS algorithm applied to adaptive noise
tracking ability of LMS algorithm. The NLMS cancellation. IEEE Signal Processing Letters, 16(1),
algorithm shows fairly good results even at -10 DB 34-37.
noise level whereas the CSLMS shows the best results
Hermansky, H. and S. Sharma, (1999). Temporal
with a much higher accuracy of 60% at a SNR of -20
patterns (TRAPS) in ASR of noisy speech. 1999 IEEE
dB. This is because; the CSLMS has better tracking
International Conference on Acoustics, Speech, and
ability due to minimum EMSE and miss-adjustments as
Signal Processing. 1: 289-292.
compared to LMS and NLMS.
Kallasjoki, H., K., Magi, C., Alku, P. and Kurimo, M.,
4. CONCLUSION
2009. Noise robust LVCSR feature extraction based on
This paper presented Robust Speech recognition
stabilized weighted linear prediction. 13th International
using adaptive noise cancellation. ANC using LMS,
Conference on Speech and Computer (SPECOM).
NLMS and CSLMS algorithms were investigated and
221-225.
the word recognition performance compared for a range
of SNR levels. This work showed that CSLMS works
Patynen, J. (2009). Feature Enhancement in Automatic
better than the other algorithms due to its efficient
Speech Recognition, in Studies on Noise Robust
weight update mechanism. For recognition HMM based
Automatic Speech Recognition, K. J. Palomaki, U.
classifier was used with MFCC features. The noise type
Remes and M. Kurimo (Eds.): 35-44.
used in this work is Additive White Gaussian Noise
which is the more generic noise type; however it can be Raj, B. and R. M. Stern, (2005). Missing-feature
tested on specific noise types such as car noise, canteen approaches in speech recognition. IEEE Signal
noise, fan noise etc. Various other types of classifier Processing Magazine, 22(5): 101-116.
like LDA, PCA, Hybrid HMM and ANN etc can be
used with other features like PLP, WT and LPC. Rajnoha, J., and P. Pollák, (2011). ASR systems in
REFERENCES: noisy environment: Analysis and solutions for
Crowley, T., and C. Bowern, (2010). An introduction to increasing noise robustness. Radio engineering, 20(1):
historical linguistics. Oxford University Press. 74-83.
Choi, E., (2004). Noise robust front-end for ASR using Smit P., (2009). Model Compensation for Noise Robust
spectral subtraction, spectral flooring and cumulative Speech Recognition. in Studies on Noise Robust
distribution mapping. 10th Australian International Automatic Speech Recognition, K. J. Palomaki, U.
Conference on Speech Science and Technology: Remes and M. Kurimo (Eds.): 45-52.
451-456. Sanderson, C., and K. K. Paliwal, (2002). Polynomial
Compernolle, D. V. (1992). “DSP techniques for speech features for robust face authentication.2002
enhancement”, ESCA Workshop on Speech Processing International Conference on Image Processing. 3:
in Adverse Conditions, 21–30. 997-1000.
De Wet, F., B. Cranen, J. de Veth, and L. Boves, Zhang, C., J. van de Weijer, and J. Cui, (2006). Intra-
(2001). A comparison of LPC and FFT-based acoustic and inter-speaker variations of formant pattern for
features for noise robust ASR. INTERSPEECH: lateral syllables in Standard Chinese. Forensic science
865-868. international, 158(2): 117-124.