0% found this document useful (0 votes)
16 views4 pages

Robust Speech Recognition Using Adaptive Noise Cancellation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views4 pages

Robust Speech Recognition Using Adaptive Noise Cancellation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Sindh Univ. Res. Jour. (Sci. Ser.) Vol.

49 (004) 895-898 (2017)

http://doi.org/10.26692/sujo/2017.12.0078

SI NDH UNIVERSITYRESEARCH JOURNAL (SCIENCESERIES)

Robust Speech Recognition Using Adaptive Noise Cancellation


M. WAQAS, M. A. A. KHAN, M. NAEEM*, A.GUL**, N. AHMAD+
Department of Computer Systems Engineering, University of Engineering and Technology, Peshawar
Received 30th July 2016 and Revised 24st March 2017

Abstract- This paper introduces the adaptive noise cancellation technique for the noise reduction in Robust Automatic Speech
Recognition. The adaptive noise cancellation is used as front-end stage to enhance the extracted features for speech recognition under
noisy conditions. More specifically, the Constrained Stability Least Mean Square (CS-LMS) algorithm which is a member of the family
of adaptive filters has been applied. The Hidden Markov Model based Tool Kit (HTK) is used for training and testing the Automatic
Speech Recognizer system. The result obtained shows that the application of adoptive filtering at the front-end enhances the performance
of the system in noisy conditions while the CS-LMS algorithm gives the most superior performance among the family of LMS
algorithms.

Keywords: Automatic Speech Recognition, Robust Speech Recognition, Adaptive Filtering

1. INTRODUCTION the noise can be created with an assumption that linear


Automatic Speech Recognition (ASR) systems have power versions of the speech HMM and noise HMM be
a wide range of practical applications, however its orthogonal and thus linearly independent and additive.
deployment in many real life application is hindered by The combined linear power version of both HMMs is
a number of issues, such as inter speaker and intra then used in testing (Smith. 2009). A vastly used
speaker invariability, co articulation and cross talk, and compensation model is Parallel Combinational Model
environmental noise (Crowley and Bowern. 2010). By (PCM) (Gales and Young. 1993a). Some robust speech
far, the most challenging problem of ASR is noise. recognition techniques adhere to the notion that clean
Although the ASR systems today are able to achieve a speech is corrupted by unwanted signal or noise i.e.
reasonable level of accuracy in controlled environments humming of engine, another speaker, traffic noise, fan
but its performance degrades drastically with the humming etc, and this unwanted noise can be removed
introduction of noise (Rajnoha and Pollák. 2011). Due from the corrupted speech signal by estimating the noise
to background noise, the features extracted from the test spectra (Patynen. 2009). More complex techniques use
data get significantly different from the similar features the statistical estimation of the noise for its removal
extracted from the training data thus rendering form corrupted speech (Compernolle. 1992). These
inaccurate results (De Wet et. al. 2001, Zhang et. al. methods operate on the assumptions of different noise
2006). types and therefore the methods like compensation
Most of the recent research on the ASR is focused model and feature enhancement methods are sensitive to
mainly to address the issue of speech recognition under the noise type and give highly accurate results to
noisy conditions. Various techniques have been selective background noise where as robust features
developed for speech recognition under real-life extraction and missing data methods gives fair accurate
conditions (Gales and Young. 1993, Hermansky and results in all noise types (Smith. 2009). Single
Sharma. 1999, Choi. 2004). The missing feature microphone techniques estimates and corrects speech
detection methods detect the corrupted spectral features signal with fairly good accuracy, however multiple
and try to either correct these features or otherwise microphone methods do much accurate noise estimation
ignore them (Raj and Stern. 2005). The robust feature as compared to single microphone techniques (Patynen.
extraction methods is based on the extraction of features 2009). This method includes adaptive noise
which are inherently robust to environmental noise such cancellation, beam forming, and blind source
as Linear Predictive (LP) Spectral Estimates, Minimum separation. Adaptive noise cancellation takes into
Variance Distortionless Response (MVDR) modeling account a noise reference signal and using adaptive
and RASTA techniques (Kallasjoki et. al. 2009). Some filter to estimate noise in the corrupted signal and
models known as compensation model are based on the removes it from the corrupted signal to get the clean
idea that along with an HMM of speech, an HMM for signal or speech. Beam forming uses multiple degraded
+
Corresponding Author email: N. Ahmad, [email protected]
*Department of Computer Science, University of Peshawar
** Department of Statistics, Shaheed Benazir Bhutto Women University Peshawar
M. WAQAS et al., 896

signals instead of noise reference (Compernolle. 1992). The design of adaptive filter is much more difficult
This paper addresses the challenge of environmental because of its shift-variant nature. Due to its adaptive
noise by applying an adoptive noise cancellation nature, its coefficients i.e. 𝜔𝑛 are not fixed and keep on
technique. changing. At every iteration the coefficients set is
updated with a new set of optimum filter coefficients.
The rest of the paper is organized as follows. The
The filter coefficients are selected such that it minimizes
next section explains the working of adoptive noise
the mean square error, 𝜉(𝑛) = 𝐸{|𝑒(𝑛)|2 }.
cancellation approach and the variant algorithms of
ANC. Section 3 describes the experimental setup used Where
and discusses the results obtained by using different
ANC algorithms. Section 4 concludes the findings of 𝑒(𝑛) = 𝑑(𝑛) − 𝑦(𝑛) = 𝑑(𝑛) − 𝝎𝑇𝑛 𝒗𝟐 (𝒏) (2)
this research and outlines the future line of research.
2. ADAPTIVE NOISE CANCELLATION Replacing 𝑑(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) in eq. (2) we get
The Adaptive Noise Cancellation (ANC) uses an
adaptive filter for estimating noise using a referenced 𝑒(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) − 𝝎𝑇𝑛 𝒗𝟐 (𝒏) (3)
noise signal which is statistically similar to the additive
noise contained in noisy speech (Górriz et. al. 2009). As can be seen in Figure 1, the input to adaptive
Two microphones are required to capture reference filter is 𝑣2 (𝑛), a reference noise which is correlated
noise and noise corrupted speech signal. Adaptive filter with 𝑣1 (𝑛). The FIR adaptive filter estimates 𝑣1 (𝑛) and
updates its filter coefficients at every incoming signal is then subtracted from 𝑑(𝑛) to get the clear speech (𝑛)
sample using a feedback mechanism. This feedback . So,
mechanism uses a weight update equation for
computing new filter coefficients. A working of a ̂1 = 𝝎𝑇𝑛 𝒗𝟐 (𝒏)
𝑦(𝑛) = 𝑣 (4)
typical ANC mechanism is depicted in Figure 1.
𝑒(𝑛) = 𝑠(𝑛) + 𝑣1 (𝑛) − 𝑣
̂1 (5)

̂ … . . (6)
𝑒(𝑛) = 𝑠(𝑛) (6)

There are a number of adaptive algorithms with the


very basic one being the LMS adaptive algorithm/filter.
The weight update equation of LMS adaptive filter is:

Figure 1: ANC (Adaptive Noise Canceller) ω(n + 1) = ω(n) + μe(n)v2 ∗ (n)


For good tracking ability and convergence to mean,
Here s(n) is the clear or primary speech and d(n) is 2
the step size μ should be 0 < 𝜇 < . Where λmax is
noisy speech signal corrupted by v1 (n); a statistically λmax
similar noise to v2 (n) where v2 (n) is reference noise the maximum eigenvalue of auto correlation matrix,R v2 ,
input to the adaptive filter. e(n) is error signal and of v2 . Adaptive filters are compared and characterized
feedback for adaptive filter. A typical FIR adaptive in terms of their miss-adjustments and EMSE (Excess
filter, as shown in Figure 1 can be expressed as, Mean Square Error). Decreasing the miss-adjustments
𝑝 and EMSE minimizes mean square error, and thus
improves the performance of adaptive filter.
𝑦(𝑛) = ∑ 𝜔𝑛 (𝑙)𝑣2 (𝑛 − 𝑙) (1)
𝑙=0 The problem with using LMS algorithm is that, λmax
and R v2 are not known and are therefore estimated. To
In terms vector notation, cope up with this problem another variant of LMS
𝑦(𝑛) = 𝝎𝑇𝑛 𝒗𝟐 (𝒏) Algorithm known as Normalized LMS (NLMS) is used.
The weight update equation of NLMS is given by:
Where ωn (l) is the value of lth adaptive filter
𝒗𝟐 ∗ (𝒏)
coefficient at time n, 𝝎(𝒏 + 𝟏) = 𝝎(𝒏) + 𝜷 𝒆(𝒏)
𝝐 + ||𝒗𝟐 (𝒏)||
𝝎𝑛 = [𝜔𝑛 (0), 𝜔𝑛 (1) … . 𝜔𝑛 (𝑙)] 𝑻
Where 0 < 𝛽 < 2 is the normalized step size and ϵ
𝒗𝟐 (𝑛) = [𝑣2 (𝑛), 𝑣2 (𝑛 − 1) … . 𝑣2 (𝑛 − 𝑙)]𝑻 is a small positive number.
Robust Speech Recognition Using Adaptive… 897

A relatively new and improved LMS algorithm is VidTIMIT (Sanderson and Paliwal. 2002) dataset is
proposed in (Górriz et. al. 2009) known as Constrained used for training and testing. The dataset consist of the
Stability Least Mean Square (CS-LMS) algorithm/filter pre-recorded 10 sentences of English language for each
is given by: speaker. Out of these 10 sentences 2 sentences are same
for each while the remaining 8 sentences are different
δv2 ∗ (n)
ω(n + 1) = ω(n) + μ δe(n) for each speaker. There are 43 speakers in the dataset
ϵ + ||δv2 (n)|| resulting in a total of 430 sentences.
where δe(n) = e(n) − e(n − 1), and Additive White Gaussian Noise available in
MATLAB is used to make the noisy speech in the range
δv2 (n) = v2 (n) − v2 (n − 1)
from +60 DB to -60 DB.
As explained in (Górriz et. al. 2009), CS-LMS and
HTK, a Hidden Markov Model (HMM) based ASR
NLMS adaptive filters converge to optimal wiener
toolkit is used for feature extraction, and training and
solution, and for similar step size CSLMS shows good
testing of the classifier. A five state left right HMM
filtering results than NLMS by improving EMSE and
model with 3 emitting state is trained using MEL-
miss-adjustments.
Frequency Cepstral Coefficient (MFCC) features from
3. EXPERIMENTS AND RESULTS the speech signals after noise removal.
The experimental carried out in research consist of Three different weight update equations i.e. LMS,
recognizing speech in the noisy environment while NLMS and CS-LMS are used for noise removal.
using adaptive filters for noise filtering. The setup used Using v(n), a primary noise source, two different noise
is shown in Figure 2. patterns namely v1 (n) and v2 (n) are created which are
statically similar to v(n) but have different parameters.
Where v1 (n) is added to the clear speech and v2 (n) is
passed to adaptive filter for estimation of noise
contained in noisy speech. For practical reasons the
impulse response of h1 and h2 are:
H1−1 (z) = 1 − 0.3z −1 − 0.1z −2
H2−1 (z) = 1 − 0.2z −1
The FIR adaptive filter is modeled using 3 different
LMS algorithm with 5 filter taps using step size of 0.01.
The percentage of correctly recognized words using the
three LMS algorithms at different noise level is shown
in Figure 3.

Results for step size=0.01,Filter Tap=5


100
90 noisy
Percentage of words correct

80
CSLMS
70
60 LMS
50
NLMS
40
30
Figure 2: Proposed Robust ASR System 20
10
The input to the system is a noisy speech and output 0
is recognized words. The experimental setup is divided
into two modules, i.e. the adaptive noise cancellation
module which estimates clean speech from the noisy
signal to niose ratio
speech, and the ASR module which converts the
cleaned speech to recognized text. Figure 3: Recognition results
M. WAQAS et al., 898

The results in Fig. 3 shows that for the noisy speech Gales, M. J. F., and S. J. Young, (1993). Parallel model
the percentage of correctly recognized words drops combination for speech recognition in noise. University
rapidly at a fairly high SNR level i.e. bellow 25dB. At of Cambridge, Department of Engineering.
negative SNR levels, the words correct percentage
Gales M. J. F. and S. J. Young, (1993a). “Cepstral
remains around 21%. Using LMS algorithm, the
parameter compensation for HMM recognition in
performance is almost the same as that of the noisy
noise”, Speech Communication, 12, 231–239.
speech. This is probably because the determination of
step size 𝝁 is dependent on the max eigen value of the Górriz, J. M., J. Ramírez, S. Cruces-Alvarez, C. G.
auto correlation matrix of the reference input to the Puntonet, E. W. Lang, and D. Erdogmus, (2009). A
adaptive filter i.e. 𝒗𝟐 (𝒏), and also due to the weak novel LMS algorithm applied to adaptive noise
tracking ability of LMS algorithm. The NLMS cancellation. IEEE Signal Processing Letters, 16(1),
algorithm shows fairly good results even at -10 DB 34-37.
noise level whereas the CSLMS shows the best results
Hermansky, H. and S. Sharma, (1999). Temporal
with a much higher accuracy of 60% at a SNR of -20
patterns (TRAPS) in ASR of noisy speech. 1999 IEEE
dB. This is because; the CSLMS has better tracking
International Conference on Acoustics, Speech, and
ability due to minimum EMSE and miss-adjustments as
Signal Processing. 1: 289-292.
compared to LMS and NLMS.
Kallasjoki, H., K., Magi, C., Alku, P. and Kurimo, M.,
4. CONCLUSION
2009. Noise robust LVCSR feature extraction based on
This paper presented Robust Speech recognition
stabilized weighted linear prediction. 13th International
using adaptive noise cancellation. ANC using LMS,
Conference on Speech and Computer (SPECOM).
NLMS and CSLMS algorithms were investigated and
221-225.
the word recognition performance compared for a range
of SNR levels. This work showed that CSLMS works
Patynen, J. (2009). Feature Enhancement in Automatic
better than the other algorithms due to its efficient
Speech Recognition, in Studies on Noise Robust
weight update mechanism. For recognition HMM based
Automatic Speech Recognition, K. J. Palomaki, U.
classifier was used with MFCC features. The noise type
Remes and M. Kurimo (Eds.): 35-44.
used in this work is Additive White Gaussian Noise
which is the more generic noise type; however it can be Raj, B. and R. M. Stern, (2005). Missing-feature
tested on specific noise types such as car noise, canteen approaches in speech recognition. IEEE Signal
noise, fan noise etc. Various other types of classifier Processing Magazine, 22(5): 101-116.
like LDA, PCA, Hybrid HMM and ANN etc can be
used with other features like PLP, WT and LPC. Rajnoha, J., and P. Pollák, (2011). ASR systems in
REFERENCES: noisy environment: Analysis and solutions for
Crowley, T., and C. Bowern, (2010). An introduction to increasing noise robustness. Radio engineering, 20(1):
historical linguistics. Oxford University Press. 74-83.

Choi, E., (2004). Noise robust front-end for ASR using Smit P., (2009). Model Compensation for Noise Robust
spectral subtraction, spectral flooring and cumulative Speech Recognition. in Studies on Noise Robust
distribution mapping. 10th Australian International Automatic Speech Recognition, K. J. Palomaki, U.
Conference on Speech Science and Technology: Remes and M. Kurimo (Eds.): 45-52.
451-456. Sanderson, C., and K. K. Paliwal, (2002). Polynomial
Compernolle, D. V. (1992). “DSP techniques for speech features for robust face authentication.2002
enhancement”, ESCA Workshop on Speech Processing International Conference on Image Processing. 3:
in Adverse Conditions, 21–30. 997-1000.
De Wet, F., B. Cranen, J. de Veth, and L. Boves, Zhang, C., J. van de Weijer, and J. Cui, (2006). Intra-
(2001). A comparison of LPC and FFT-based acoustic and inter-speaker variations of formant pattern for
features for noise robust ASR. INTERSPEECH: lateral syllables in Standard Chinese. Forensic science
865-868. international, 158(2): 117-124.

You might also like