0% found this document useful (0 votes)
7 views11 pages

13 Spectral Warping and Data Augmentation For Low Resource Language ASR

This paper addresses challenges in Automatic Speech Recognition (ASR) for low-resource languages like Punjabi, particularly focusing on children's speech, which suffers from acoustic and linguistic variations and data scarcity. The authors propose a spectral warping technique using formant modification and data augmentation through Tacotron 2 to improve ASR performance. Additionally, they introduce a hybrid feature extraction approach combining Mel-Frequency Cepstral Coefficients (MFCC) and Frequency Domain Linear Prediction (FDLP) to enhance the robustness of the ASR system.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

13 Spectral Warping and Data Augmentation For Low Resource Language ASR

This paper addresses challenges in Automatic Speech Recognition (ASR) for low-resource languages like Punjabi, particularly focusing on children's speech, which suffers from acoustic and linguistic variations and data scarcity. The authors propose a spectral warping technique using formant modification and data augmentation through Tacotron 2 to improve ASR performance. Additionally, they introduce a hybrid feature extraction approach combining Mel-Frequency Cepstral Coefficients (MFCC) and Frequency Domain Linear Prediction (FDLP) to enhance the robustness of the ASR system.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Applied Acoustics 190 (2022) 108643

Contents lists available at ScienceDirect

Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust

Spectral warping and data augmentation for low resource language ASR
system under mismatched conditions
Mohit Dua a, Virender Kadyan b,⇑, Neha Banthia c, Akshit Bansal a, Tanya Agarwal a
a
Department of Computer Engineering, National Institute of Technology, Kurukshetra, India
b
Speech and Language Research Centre (SLRC), School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
c
Department of Computer Engineering, Indian Institute of Information Technology, Sonepat, India

a r t i c l e i n f o a b s t r a c t

Article history: The performance of an Automatic Speech Recognition System (ASR) system deteriorates while using it on
Received 10 May 2021 children speech, due to large variations and mismatch of acoustic and linguistic variables between spo-
Received in revised form 14 December 2021 ken utterances of adults and children. Another important reason for the low efficiency of ASR models is
Accepted 13 January 2022
the data scarcity of children speech data for low resource-language like Punjabi. The proposed work in
Available online 29 January 2022
this paper tries to address the both challenges i.e. acoustic and linguistic variations challenge, and data
scarcity problem, thereby improves performance of a children speech ASR system for Punjabi language.
Keywords:
To handle the first issue of acoustic and linguistic variations, the proposed work uses formant modifica-
Children speech recognition
Formant modification
tion as a spectral warping technique to reduce the variation between children speech and adult speech.
TDNN Further, to address the second issue of data scarcity, this paper discusses training of ASR models on aug-
TTS mented children speech data. Also, the work combines well established Mel-Frequency Cepstral
MFCC Coefficients (MFCC) features extraction technique with Frequency Domain Linear Prediction (FDLP) to
FDLP propose MFCC-FDLP hybrid approach for front end feature extraction. For implementing the data aug-
mentation, Tacotron 2, an end-to-end Text to Speech (TTS) generative model has been used. The proposed
work uses MFCC, FDLP and hybrid MFCC + FDLP techniques for front end feature extraction, Time Delay
Neural Network (TDNN) for backend acoustic modeling, and trigram language model to implement con-
tinuous Punjabi language ASR systems. To increase robustness of the proposed ASR system, we have
included a batch of lexically diverse words in our pronunciation model which achieved a relative
improvement of 29.44%.
Ó 2022 Elsevier Ltd. All rights reserved.

1. Introduction few decades [3,4]. A lot of research, studies and data collection
have been done for high resource languages like English, Spanish
Speech is considered as the primary means of communication etc. However, there is a big scope of improvement for the low
between humans. But nowadays communication is not just limited resource and regional languages such as Punjabi [5]. There are
to humans, but even to machines. Automatic Speech Recognition many challenges in building ASR systems in regional languages
(ASR) is a technique used to facilitate interaction between machi- such as data scarcity, high cost involved in building transcripts
nes and humans. In the modern era, consumers reap the benefits and acoustic variability.
and ease provided by the device that utilizes Automatic Speech For effective training of ASR models, a large amount of speech
Recognition. For instance, speech based virtual assistants like Ama- data is required. In low resource languages like Punjabi, the ASR
zon Alexa, Google Assistant and Apple Siri are very popular, offer- performance drops dramatically when the amount of training data
ing a wide variety of services like controlling smart home devices, is reduced [6]. Collecting such a large amount of data has its own
following voice commands to perform different tasks [1,2]. challenges. Data scarcity causes an overfitting problem, since train-
Research on automation of simple tasks that require human ing data is not sufficient for the ASR system to work properly. Data
machine interaction has attracted a lot of attention in the last scarcity can be tackled using data augmentation, which is a very
popular method to incorporate speaker/acoustic variability in
⇑ Corresponding author. training speech data to increase robustness of ASR systems. Fur-
E-mail addresses: [email protected] (M. Dua), [email protected]
ther, inter speaker variability such as age, gender, accent, speaking
(V. Kadyan). rate and formant frequencies of the speakers also pose difficulties.

https://doi.org/10.1016/j.apacoust.2022.108643
0003-682X/Ó 2022 Elsevier Ltd. All rights reserved.
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Data augmentation also handles the speaker variability problems fact, it is estimated that for only about 1% of the world languages,
by generating multiple readings of the training utterances from required speech corpus, which is needed to train an ASR, is avail-
different speakers. There are various existing Data augmentation able [21].
techniques such as Generative Adversarial Networks (GAN) [7], Potamianos et al. [22] performed one of the oldest works in the
Prosody Modification [8], spectrogram augmentation [9]. Data aug- field of child ASR models. They implemented age-dependent
mentation is widely categorized as in-domain and out domain acoustic variability that reduced WER (Word Error Rate) by 10%.
techniques. In-domain data augmentation is done by using voice Shahnawazuddin et al. [23] explored prosody modification to
conversion (VC) to alter the acoustic attributes. It includes tech- map mismatch between adult speech and children’s speech. The
niques like GAN [7], Vocal Tract Length Perturbation [10] and experiment resulted in improvement of the baseline system
Stochastic Feature Mapping (SFM) [11]. Out-domain data augmen- approximately by 50%. Kumar et al. [17] proposed a study that
tation uses unseen utterances to enhance existing dataset like highlighted the large variation and mismatch in acoustic and lin-
speed prosody modification [8]. To make the ASR model to recog- guistic attributes between children’s and adults’ speech. The paper
nize acoustic and lexically diverse utterances, TTS (end to end implemented a linear predictive coding (LPC)-based formant mod-
speech synthesis model) is used to include unseen utterances in ification [24] method to reduce the difference between adults’ and
the training data, resulting in a more robust recognizer. children’s speech, which in turn improved the performance of chil-
There is a lot of research done on ASR for Punjabi language dren speech ASR systems. The proposed technique improved the
using adult speech corpus, exploring feature extraction techniques system performance over a hybrid DNN-HMM baseline model,
like MFCC [12] and Perceptually based Processing (PLP) [13], FDLP vocal tract length normalization (VTLN) and speaking rate adapta-
[14], acoustic models methods such as Gaussian Mixture Model- tion (SRA). Also, the proposed method was tested for noisy chan-
Hidden Markov Model (GMM-HMM) and/or Deep Neural Networks nels as well.
(DNN) [15], and language modeling approaches like tri-phone, It is a challenging task to collect a traditional text-to-speech
mono-phone phoneme models and N-grams. Most of the ASR sys- corpus for a low-resource language. It is difficult to synthesize
tem available publicly works well with Punjabi adults’ speech i.e., speech on the ‘‘found” data [25], that is why Cooper proposed to
high-SNR (Speech to Noise Ratio) speech [16]. However, in case of use various sources of available ‘‘found’’ data. Oord et al. [26] intro-
low-SNR speech or in the case of children speech, the system per- duced Wavenet, a deep neural network of time domain waveforms,
formance collapses. These mismatched conditions occur due to which is significantly used in the complete Text-to-Speech synthe-
training on adults’ speech and then testing on children’s speech sis model. Their paper implemented Wavenet in the TTS system for
[17]. The difference in the vocal tract dimension between the English and Mandarin languages. Li et al. [27] proved that the cor-
adults and children is the root cause of this mismatch [18]. Kumar rect ratio of synthetic data and natural data can also improve the
et al. proved that formant frequencies decrease with the increase in results. The authors accomplished their work via Global Style
vocal tract [17]. The range and amount of change in formant fre- Token (GST). The improvements were done by increasing the depth
quencies are smaller among older age groups than younger age of the recognition network and through hyper parameter tuning.
groups [19]. Hence, issues such as linguistic variations and data Acoustic data perturbation, semi-supervised training, multilingual
scarcity are still posing a challenge for the low resource language processing and speech synthesis are some of the techniques for
such as Punjabi. data augmentation, which were proposed by Ragni et al. [28].
Motivated by these issues, the work in this paper contributes by Recently, Rosenberg et al. [29] have used speech synthesis archi-
using formant modification as a spectral warping technique for tecture of Tacotron to improve performance of speech recognition
handling linguistic variation problem, and implementing training system. The work used speech synthesis mechanism to enhance
of ASR models on augmented children speech data for addressing acoustic diversity and lexical diversity by creating new training
data scarcity problem. Also, for front end feature extraction, the utterances and synthesizing training data with different speaker
work proposes hybrid approach by combining well established characteristics, which in turn generates new training utterances,
Mel-Frequency Cepstral Coefficients (MFCC) features extraction respectively. The proposed study was implemented on two corpora
technique with Frequency Domain Linear Prediction (FDLP) to pro- from distinct domains for high resource language i.e., English.
pose MFCC-FDLP hybrid approach. For implementing the data aug- In last few years, some researchers have explored the different
mentation, Tacotron 2, an end-to-end Text to Speech (TTS) possibilities to improve performance of Punjabi ASR by doing
generative model has been used. The proposed continuous Punjabi improvements in traditional feature extraction techniques and
language ASR systems is developed using Time Delay Neural Net- using advanced backend models. Guglani and Mishra [16] explored
work (TDNN) based acoustic modeling, and trigram based language various feature extraction techniques like MFCC, PLP and compared
model. Further, the work includes a batch of lexically diverse their performance for Punjabi dataset. The paper also experi-
words in pronunciation model which increase robustness of the mented with the mono-phone and tri-phone model using the N-
proposed ASR system. gram language model and did a comparative study to find the best
The remainder of the paper is organized as: Section 2 discusses alternative. Gerosa et al. proposed several methods for speaker
the literature Survey. Section 3 describes fundamentals of Tacotron adaptive acoustic modeling to cope with inter speaker variability
2 and formant modification. Section 4 discusses the proposed to make the ASR model more efficient [30]. The paper claimed that
architecture. Section 5 discusses experimental setup and analyses vocal-tract length normalization (VLTN) method with constrained
the results, followed by Section 6 that concludes the proposal. MLLR based speaker normalization (CMLSN) method performs bet-
ter than other methods. Kadyan et al. [31] discussed a comparative
study of deep neural network-based Punjabi-ASR systems. The
2. Literature survey authors experimented for robust feature extraction techniques to
bring high performance in Punjabi speech recognition systems.
ASR systems require a large amount of training data for a sys- Kadyan et al. [32] also reduced acoustic mismatch by working on
tem to work reasonably well. High resource languages like English, VLTN, explicit pitch and duration modification. All the three
Spanish, Chinese have many state of the art speech corpuses for explored techniques proved to be effective.
effective training of ASR Systems. However, such large amount of Further, Shen et al. [33] implemented Tacotron 2, a neural net-
data of regional languages like Hindi, Punjabi etc., is not available work architecture for text to speech synthesis models. This paper
and still, only a few researchers are working for the same [20]. In combines a sequence-to-sequence Tacotron-model that generates
2
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

mel spectrogram and the modified Wavenet vocoder. The proposed (Text to Speech) is used as a data augmentation process for
model can be trained directly on the data and significantly improve expanding speech dataset. Through this, we can first train our
the performance of ASR models by producing state-of-the-art TTS model and then supply text to it to get corresponding audios.
sound quality that is very close to natural human speech. TTS pro- For data augmentation, several techniques are available like GAN,
duces significant results in producing synthetic speech that sounds prosody modification, spectrogram augmentation etc. Tacotron 2
almost like human speech. However, to train ASR models on single- used in the proposed work, is also a TTS technique. The most
speaker, abundant training data from a professional voice talent is important advantage of using Tacotron 2 is that it is able to gener-
required [34]. Deng et al. [35] proposed in his study that multi- ate natural sounding speech directly from text. We do not need to
speaker models can outperform single-speaker models when large train it providing complex acoustic and linguistic features as input.
amounts of training data of single-speakers are not available. It incorporates the ideas of Tacotron [29] and Wavenet [26]. It
The front end feature extraction also plays an important role in works by mapping a sequence of characters to a sequence of fea-
implementation of an ASR system. For many years, MFCC remained tures that encode the audio sample. It uses a sequence-to-
the only choice for developing ASR systems. However, in last two sequence model that optimizes the text-to-speech process. The
decades many other feature extraction methods such as GFCC features are an 80-dimensional audio spectrogram with frames
[20], FDLP [14] have shown their presence in ASR field. FDLP computed every 12.5 ms. These features capture word pronuncia-
method was first introduced by J. Herre et al. [36] as a method tion as well as characteristics of human speech like speed, volume
for efficient coding of transients in transform coders. In [14], Athi- and intonation. It then uses a Wavenet-like structure to convert
neos et al. has proposed a novel representation of the temporal these features to a 24 KHz waveform [26].
envelope in different frequency bands by exploring the dual of con- Fig. 1 shows system architecture for Tacotron 2. It has two
ventional linear prediction method. With this technique of parts: (1) a recurrent sequence-to-sequence network to predict
frequency-domain linear prediction (FDLP), the ‘poles’ of the model features, which is able to predict a chain of mel spectrogram frames
describe temporal peaks. In [37], Thomas et al. highlights the new from an input character sequence; (2) an improved Wavenet that
feature extraction technique which utilized short-term spectral generates time-domain waveforms based on the mel spectrograms.
envelope and modulation frequency features. These features are In other words, a sequence-to-sequence Tacotron style model is
derived from sub-band temporal envelopes of the estimated able to generate mel spectrograms and a modified Wavenet voco-
speech using FDLP method. Also, recently, researchers have shown der is able to utilize these spectrograms to generate time-domain
improvements in the performance of ASR systems by using hybrid waveform pertaining to the audio sample.
features for implementing front end feature extraction [20]. The first part of Tacotron 2 is also called the encoder, whose first
These recent speech synthesis advancements can also help in layer is the embedding layer with 512 dimensional vectors. The
improving the performance of a low resource language such as output of this first layer is directed to a block of three one-
Punjabi. In this paper, a small self-created children speech corpus dimensional convolution layers having 512 filters with a length
of Punjabi language has been used. Rosenberg et al. [29] found that of 5. The next block in this encoder block consists of a bidirectional
improvements in speech recognition performance is achievable by long short-term memory (LSTM). The second part of Tacotron 2 is
augmenting training data with synthesized material. Their results called the decoder. At each decoding step, ‘‘attention” forms the
show the relative gain of 4% to WER using Tacotron-2 on the context vector and updates the attention weight. The context vec-
LIBRISPEECH corpus [38]. Motivated by the work, we have imple- tor given by cv i is the product of the encoder’s output (eo) and
mented Tacotron 2 to increase the robustness of the ASR system. attention weights (w). It is mathematically expressed as:
With ample amounts of training data, overfitting and acoustic vari- XT
ability issues are tackled. Further, most of the ASR models are cv i ¼ j¼1 wij eoj ð1Þ
trained on adults’ speech and don’t prove to be robust on children’s
speech. There is a big difference in acoustic variability between The attention weights are calculated using the following
children’s speech and adults’ speech. Kathania et al. [17] indicated formula:

eexpðeij Þ
an improvement of 27% in children ASR system performance using
formant modification tested on PF-STAR [39]. This inspired us to wij ¼ PT ð2Þ
apply formant modification on our ASR system on children speech. k¼1 expðeik Þ

Hence, the proposed work in this paper is to build ASR models on Here, eij represents energy. For calculating energy, we use:
children’s speech of Punjabi language using augmented data as  
well as formant modification. eij ¼ wT tanh Api1 þ Beoj þ Clij þ d ð3Þ

where, pi1 represents previous hidden state of LSTM network, eoj is


3. Preliminaries th
the j hidden encoder state,
A; B; C; D are trained parameters. The variable lij represents loca-
This section discusses the fundamentals of Tacotron 2 and for- tion signs, which are calculated as:
mant modification used in the implemented work to improve the
performance of Punjabi ASR system. li ¼ F  wi1 ð4Þ
where, F being the convolution operation and wi1 is the previous
3.1. Tacotron 2 attention weight.

Neural networks are not smart enough to begin with, as a 3.2. Formant modification
poorly trained neural network is not able to produce the desired
results. This happens due to the lack of adequate training data. The performance of an ASR system degrades if the training and
Hence, to expand the available dataset, minor alterations are done testing conditions are different. One of these mismatch conditions
to the existing training data. This process is called data augmenta- occurs, when we train our ASR system with adults’ speech and test
tion. Data augmentation enables us to add relevant data, which is it on children’s speech. The size of vocal tract of adult and child are
related to the way with which neural networks learn. A neural net- different, and the children’s speech is low-SNR speech. Due to this,
work keeps on becoming better as we feed more data to it. TTS the formants and other important features of speech are lost while
3
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

training and testing, which in turn drops the accuracy of the ASR transformed to tri-phones. They are passed to the decoder. Pronun-
system. ciation model, also known as lexicon, is created by linguists. It
A formant is the broad spectral maximum that results from an matches phones to words and outputs the probability of the possi-
acoustic resonance of the human vocal tract. It is usually defined ble words. The language model uses these words to select the one
as a broad peak, or local maximum, in the spectrum. Studies show which generates proper meaning in the context of the sentence.
the change in formant frequencies in people of different age groups Speaker adaptive training is done to make the model robust against
[17]. Since the length of vocal tract is inversely proportional to for- unseen speakers. These models work together with the decoder to
mant frequencies, the increase in vocal tract decreases the formant perform speech recognition. This generates our ASR system which
frequencies. is tested on the testing part of children speech corpus. The follow-
Formant modification uses warping on LP (Linear Prediction) ing subsection describe these steps in detail.
spectrum. The resulting LP spectrum is denoted by RC ð f Þ. It is
obtained when we apply warping function wð f Þ on the original
LPC (Linear Predictive Coding) spectrum which is denoted by 4.1. Data augmentation
Rð f Þ. Here, C is the warping factor.
The proposed system uses a TTS model based on Tacotron 2.
RC ð f Þ ¼ RðwC ð f ÞÞ ð5Þ
Fig. 3 shows the architecture of Tacotron 2 model used in the pro-
An estimate of speech signal denoted by RðmÞ is obtained as a posed system implementation. As described earlier, Tacotron 2 is a
linear combination of the M samples values obtained before; this combination of encoder-decoder network with an attention mech-
is classical working of the LPC method [24]. anism, and a Wavenet based Vocoder. It takes input as a sequence
XM of text in Punjabi language, which is encoded by encoder. In the
r ðmÞ ¼ j¼1
Cr ðm  jÞ ð6Þ first part of the encoder, the character sequence is converted into
a word embedding vector. The input text sequence embedding is
Then we take its Z-transform:
encoded by 3 convolution layers each containing 512 filters of
XM 
b ð zÞ ¼ shape 5  1, followed by a bidirectional LSTM layer of 250 units
R j¼1
Cj zj RðzÞ ð7Þ
for each direction. Tacotron 2 uses ’Local sensitive attention’ which
Here, zj are j unit delay filters r j are the LPC filter coefficients. We takes the encoder output as input and tries to summarize the full
encoded sequence as a fixed length context vector for each decoder
use these to calculate LPC spectrum. The unit delay filter is replaced
output step.
by an all-pass filter F ðzÞ to wrap the LPC spectrum. We use first
Decoder is an autoregressive recurrent neural network which
order all-pass filter which is given by
  predicts a mel spectrogram from the encoded input sequence one
z1  C frame at a time. The output of the attention layer is passed through
F ð zÞ ¼ ð8Þ
ð1  Cz1 Þ a small pre-net containing 2 fully connected layers of 256 hidden
ReLU (Rectified Linear Unit) units. The pre-net output and atten-
Here, C is the warping factor which lies in the range of 1 and 1.
tion context vector are concatenated and passed through a stack
1 < C < 1 ð9Þ of 2 unidirectional LSTM layers with 1024 units. The output of
the LSTM layer is projected through a linear transform to predict
The warped frequency scale matches psycho-acoustic scale
the target spectrogram frame.
with proper value of ℾ which is based on auditory perception.
The predicted mel spectrogram is passed through 5-layer con-
The formants can be shifted systematically on the application of
volution postnet layers, each composed of 512 filters with shape
warping function on the LPC coefficients. If C is positive, formant
  5  1 filters with batch normalisation, followed by tanh activation
frequencies shift to lower frequencies. The residual r ðmÞ  b r ðmÞ on all but the final layer. The postnet layer predicts a residual to
and the modified LPC coefficients are then used by standard LPC add to the prediction to improve the overall reconstruction. Finally,
synthesizer to synthesize the speech signal which is called formant the mel spectrogram is transformed into time domain waveforms
modified signal. This signal is used as input to the ASR system. by modified Wavenet vocoder. The mel spectrograms are mapped
to a fixed-dimensional embedding vector, known as deep speaker
4. Proposed approach vectors (d-vectors). These d-vectors are frame-level speaker dis-
criminative features that represent the speaker characteristics.
This section describes the architecture of the proposed Algorithm 1 gives the pseudo code for the data augmentation pro-
approach. Fig. 2 shows the architecture of the proposed ASR sys- cess used in the proposed work. The proposed system uses follow-
tem. The Punjabi children speech corpus is split into two sets; train ing three different approaches to generate d-vectors for inference
and test sets, with 80% in training and rest in testing. The training to handle speaker diversity in the synthesized data.
speech data along with its transcriptions is used to train Tacotron
2. Three types of augmented datasets are synthesized – original,  Original: In this case, d-vector is derived from the training
random and sampled. Formant modification is applied on adult utterance itself. If synthesized utterances are identical to the
speech data to reduce acoustic variabilities. This formant modified source, that implies this is perfect synthesis.
adult data is combined with augmented data, training adult data  Sampled: Here, we use a d-vector from some other utterance
and training child data to serve as training data for our ASR system. that was used during training for inference. In this case, speaker
MFCC technique is used to extract features that along with the representations will be seen by synthesizer, but the source
training speech data are input to the acoustic model. FDLP is rep- utterance and synthesized utterance will have different speaker
resentation of the temporal envelope in different frequency bands characteristics.
by exploring the dual of conventional linear prediction (LPC) when  Random: D-vector is generated by a random 256-dimensional
applied in the transform domain [34]. The proposed approach fuses vector, and then projects it to the unit-hypersphere via L2-
FDLP and MFCC features. The hybrid features generate better normalization. Random sampling is effective when d-vectors
results. The acoustic model generates mono-phones which are are evenly distributed.

4
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Algorithm 1:DataAugmentation Algorithm 2:FormantModif ication


Input:Audiowav files þ metadata Input:SpeechWav eform
Code: Code:
==Initialisation Trainingsamples : Audiowav filesþ == Load a speech wav eform
metadatathatisusedtotrainTacotron  2 rd ¼ Readðspeech wav eformÞ
==Initialisehyper  parametersinhparamsfile wt ¼ create output file
encoder n conv olutions ¼ 3 ==Encoder for½line in rd :
for ½iinrangeencoder n conv olutions: ½sampled data; sampled rate ¼ audioreadðlineÞ
conv layer ¼ Initialiseiconv olutionlayers == Fit LPC to short time segments
==BidirectionalLSTMlayeriscreated: ==Decoder ==0 x0 is a stretch of signal; fit order 0 p0 LPC mod els
Ineachiteration : Producemelspectrogram == Return the successiv e all
0
whichisfedbacktoitsinput pole coefficients as rows of a0
whileTrue : decoder inputs ¼ Prenetðdecoder input Þ == Return the per  frame gains in0 g 0
decoder outputs ¼ Decodeðdecoder inputsÞ and the residual excitation in0 e0
gate output ¼ decoder outputs½1 ½a; g; e ¼ lpcfitðsampled data ; no LPC mod elÞ
ifsigmoidð gate output Þ > gate threshold : == Choose a warping factor ðalphaÞ between  1 to 1
break alpha ¼ 0:1
decoder input ¼ decoder outputs½0 ==Warpolesfunctionwarpsanall
==Theloopstopswhenagiv enthreshold polepolynomialbysubstitution
forthestoptokenisreached: ==Itisdefinedbyrowsofafirst orderwarpfactoralpha
==BoththeEncoderandDecoderuseLSTMlayers: ==Negativ ealphashiftspolesupinfrequency
==PostnetLayerpostnet n conv olution ¼ 5 ½B; A ¼ warpolesða; alphaÞ
conv olution list ¼ ½ for½iinpostnet n conv olution : ==LycsynthfunctionresynthesizefromLPCrepresentation
conv olution list:appendðconv olutionlayerÞ ==Eachrowof 0 a0 isaLPCfittoah  pointframeofdata

0 0
e isanexcitationsignal
==Itreturns0 d0 asaresultingLPCresynthesis
4.2. Formant modification dw ¼ filterðBð1; :Þ; 1; lycsynthð A; g; eÞ
==scalesaudiosignalytothespeakeratsamplerateFs
Formant modification is used to lighten the difference between soundscðsamplingdata; samplingrateÞ
children’s speech and adults’ speech. LPC based formant modifica- soundscðdw; samplingrateÞ
tion produces an improvement of 27% compared to baseline with writeðdw; sr Þinoutputfile
DNN. Therefore, we have used this process in our proposed
approach. It is carried out to the LP spectrum wield wrapping. LP
analysis is performed on speech signal that is fed to LPC coeffi-
cients and LP residual. Those LPC coefficients give the input to 4.3. Feature extraction
the wrapping function for the modified LPC process. The output
of LP Residual and modified LP process generated the Formant In feature extraction step, features are extracted from the for-
modified speech signal going through LP synthesis. mant modified speech signal using fusion of FDLP (Frequency

Fig. 1. Tacotron 2 system architecture [40].

5
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Fig. 2. Proposed ASR system architecture.

Domain Linear Prediction) and Mel-frequency cepstral coefficients and is particularly useful when training data is limited. It uses sub-
(MFCC). Speech Corpus is fed to the A=D convertor to digitize it, sampling to exclude the duplicate weights. It is independent of the
and the amount of energy is boosted into the high frequencies in relationship between the number of sequence steps and the length
the per-emphasis phase. Windowing is responsible for slicing the of input. Since the duplicate updates are reduced, the amount of
audio waveform into sliding frames. DFT (Discrete Fourier Trans- training drops [41,42].
form) and mel filter bank are responsible for bringing out informa- The nodes and weights in TDNN are updated only when sub-
tion in frequency domain and mapping the measured frequency, sampling is used. Some inputs in the hidden layers are not con-
respectively. The log of power spectrum obtained from mel filter nected, thus, providing space between the frames. If the interval
bank is taken, and then in the cepstrum, the glottal source and between frames is allowed, the model can learn all input features
the filter is separated. On other hand, FDLP is also used for feature because TDNN has a long context going up to the upper layer [31].
extraction where Modulation feature extraction is employed for Acoustic modeling in Kaldi is a pipeline process. Firstly, GMM-
long term modulation features. Later, the efficiency of these fea- HMM are used to form a context independent acoustic model.
tures are explored on proposed formant modification based data Acoustic models are trained using the extracted MFCC features
augmentation approach. Our proposed methods uses sharpness including 13 static and other delta and delta delta features. Sec-
of the FDLP poles which takes the location of the sub-band tempo- ondly, this model is used to train another GMM-HMM model called
ral envelopes poles into account, the proposed methods mainly the tri1 model. This tri1 is a stronger model which can be used for
focus on the amplitude of the sub-band time-frequency envelopes. training more complex models. This is a context dependent acous-
In FDLP as well as in MFCC, 39 coefficients are generated by tic model. The tri1 model is converted to tri3 using the best align-
expanding the first 13 coefficients. The first 13 derivatives are sta- ments. This sets the baseline for training TDNN based models using
tic features and the others are dynamic features generated by tak- Kaldi. TDNN produces phone sequences which are given to the
ing first (D) and second order derivatives (D D) known as delta decoder for speech recognition.
feature and delta delta feature, respectively. Although individual Triphone based models perform better since articulation
39 FDLP features as well as 39 MFCC features convey richer infor- depends on phones before and after too. The acoustic realizations
mation. In each approach initial 13th parameter is the energy in of a phoneme can occur as a result of coarticulation beyond the
each frame which is used for identifying phones. The combination word boundaries.
of MFCC and FDLP brings improvement in the ASR system.

4.5. Speaker adaptive training


4.4. Acoustic modeling
Speaker Adaptive Training (SAT) [43] is used to reduce inter
The baseline system has been developed using TDNN acoustic speaker variability. Variability in speaker independent acoustic
modeling. TDNN is a Time Delay Neural Network. It converges fast models is attributed to both phonetic variation and variation
6
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Fig. 3. Block diagram represents generation of TTS through external data augmentation approach.

among the speakers of the training population, thus, independent 16 KHz sampling frequency. To enhance lexical diversity, we have
of the information content of the speech signal. These two varia- supplied additional words in our pronunciation dictionary and
tion sources are decoupled that helps in making the model suitable compared its performance.
for unseen speakers as well.
4.7. Language model and decoder
4.6. Lexicon
There is an important role of the language model in speech
recognition. It assigns a probability estimate to word sequences
Statistical modeling requires a sufficient number of examples to
and defines what the speaker may say, the vocabulary, the proba-
get a good estimate of the relationship between speech input and
bility over possible sequences, by training on some texts. The tri-
the parts of words. Pronunciation lexicon models the sequence of
gram model is used in our current system. The idea behind the
phones of a word. Phone is a basic sub-word unit that makes up
trigram model is to truncate the word history to the last 3 words,
a word.
and therefore approximate the history of the word. Decoder
Pronunciation model uses Markov chains. The HMM model
receives all the outputs from the acoustic, lexicon and language
aligns phones with the observed audio frames using self-looping.
model. It then uses these outputs to recognize the word spoken.
This provides flexibility in handling time-variance in pronuncia-
tion. The pronunciation dictionary is written by human experts.
The pronunciation of words is typically stored in a lexical tree, a 5. Experimental setup & result analysis
data structure that allows us to share histories between words in
the lexicon. Phones are not homogeneous. The amplitudes of fre- Kaldi toolkit is used to implement the modeling classifiers.
quencies change from the start to the end. Also, variations in gen- Table 1 describes the details of the dataset used. The adult data
der, pitch, accent, age etc. bring a change in the way a person set consists of isolated words, continuous, and long-contextual
utters. Therefore, this model gives the likelihood of the words to sentence types spoken by 42 speakers of age groups 17–28 years.
the decoder but cannot detect the exact word. In our case, we have The child data set is composed of continuous sentence types spo-
collected Punjabi speech data from the children of age group 6– ken by 66 speakers of age groups 5–13 years in clean environment.
13 years with a mix of female and male children. The training data Both data set have been recorded in same recording .wav format.
is expanded using data augmentation so that all phonemes of the The total data duration of adult data set recording is of 14 h and
language are covered. The speech data is digitally recorded with 36 min, whereas duration of child data set recording is of 11 h
7
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Table 1
Database Description.

Specification Adult Speech Corpora Children Speech Corpora


No. of speakers 42 66
Recording Parameters .wav file using mono-channel .wav file using mono-channel
Recording Environment Studio, Microphone Dictaphone, Microphone with both Open and Closed Environment
Sentences Types Isolated Words, Continuous, and Long-Contextual Continuous
Age group 17–28 years 6–13 years
Total Hours 14 h and 36 min 11 h and 54 min
Gender 19 male and 23 females 32 male and 34 females

and 54 min. It has already been estimated by some of the earlier matched conditions is tested with different combinations of adult
proposed research works that using speech data size between 10 speech and children speech.
and 40 h to train a baseline Tacotron produces good synthesis It can be observed from the results that, when we train our ASR
[44]. The proposed system is using K-fold cross validation tech- system on adult speech and test it on children speech, the ASR sys-
nique. The children speech corpus is divided into train and test tem performs poorly. Using children’s data for training improves
with 80% in training i.e Train_Child, and rest 20% in testing i.e., the system performance by a huge margin. When we combine
Test_Child. Formant modification is applied on the adult speech adult as well as children’s data for training, we saw an increase
Corpus i.e. Original_Adult and the resultant dataset is given the in accuracy of the system. This is mainly attributed to the availabil-
alias as Formanted_Adult. ity of a good amount of training data since using only children’s
The system performance is calculated by using Word Error Rate data for training is not sufficient and causes overfitting, thus,
(WER) metric that uses the concept of ‘‘Percentage Correct (PC)” degrading the performance. Further, TDNN based acoustic models
and ‘‘Percentage Accuracy (PA)‘‘. Percentage correction (PC) gives prove to be better on every combination of training data than the
word correction rate and Percentage Accuracy (PA) gives word DNN based acoustic model. A relative gain of approximately 32%
accuracy rate. Eqs. (10)–(12) define these metrics where, on using TDNN instead of DNN while training on Original_Adult
N ¼ numberofwordsintestset, D ¼ numberofdeletions, and Train_Child.
S ¼ numberofsubstitutions, I ¼ numberofinsertions.
5.2. Performance analysis using data augmentation
PC ¼ ðN  D  SÞ=N  100 ð10Þ
In the proposed work, augmented data is added to the adults’
PA ¼ ðN  D  S  IÞ=N  100 ð11Þ
and children’s data for improving performance. This is done for
controlling the speaker diversity in the synthesized data. The pro-
WordErrorRateðWERÞ ¼ 100%  Percentageaccuracy ð12Þ
posed system performs sampling over Tacotron_Child to generate
In the proposed approach, a combination of feature vectors is original, random and sampled datasets, where Tacotron_Child is
created before classification. To deal with inter-speaker variability, the dataset of synthesized utterances that is received from Taco-
the features are processed in multiple phases. In the first phase, tron 2 model. In sampled data augmentation case, d-vector is gen-
mono-phone (mono) models are produced for the corresponding erated from some utterance seen during training but the source
training samples. In the second phase, tri-phone models are used utterance and synthesized one have some changes in speaker char-
for the computation of delta features (tri1) and delta delta fea- acteristics. These changes in the acoustic characteristics along with
tures(tri2) which helps in the production of 13-dimensional fea- increase in dataset leads to better performance.
tures across 4 frames. As a result, 117 dimensional vectors are Table 3 describes the performance of our ASR system after data
generated. Linear discriminant analysis (LDA) [45] and Maximum augmentation is applied. It is evident from the results that data
likelihood linear transformation (MLLT) [46] estimation (tri3) is generated using sampling technique gives the better results. WER
applied to reduce the dimensions from 117 to 30. To normalize is reduced to 6.94% from 7.98% using data augmentation through
inter speaker variability, global fMLLR [47] (Feature space Maxi- sampled data. This gives a relative improvement of 13% compared
mum Likelihood Linear Regression) is used so that reduced dimen- to our baseline system.
sions are aligned.
5.3. Performance analysis using formant modification
5.1. Performance analysis of baseline system on varying front-end and
modeling approaches Formant modification has been performed to further enhance
the baseline system’s performance, Original_Adult and Train_Child
The baseline system uses the original adult speech corpus along corpus in training process is considered as the baseline system.
with the training part of the child speech corpus as training data Formant modification is done on adult speech to mitigate it’s dif-
and TDNN based acoustic model. The comparative analysis of ference with children speech.
DNN-based acoustic models and TDNN-based acoustic models is As shown in Table 4, we experimented with different values of
demonstrated in Table 2. In addition to this, the effect of mis- warping factor in formant modification. The application of formant
modification improves the efficiency of the ASR model as WER is

Table 2
Performance Analysis of Baseline System. Table 3
Performance analysis using data augmentation.
Training Type Testing TYPE WER (%) WER (%)
using DNN using TDNN Dataset Augmentation WER (%)
Train_Child Test_Child 12.73 9.18 Original_Adult + Train_Child + Tacotron_Child Original 7.02
Original_Adult 38.75 36.21 Random 7.56
Original_Adult + Train_Child 11.76 7.98 Sampled 6.94

8
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Table 4 Sampled data fixed, we observed a relative gain of 9% by adding


Performance analysis using formant modification on MFCC front end approach. formant modified adult data in training our system.
Formant Modification Dataset WER (%) To further analyse the system performance, the augmented
Original_Adult + Train_Child 7.98 audios obtained on original signals are combined which are pro-
Original_Adult + Formanted_Adult F1 (0.15) + Train_Child 6.78 cessed with different combination of front end feature vectors
Original_Adult + Formanted_Adult F2 (0.20) + Train_Child 6.67 approaches. The role of these features are to produce robust fea-
Original_Adult + Formanted_Adult F3 (0.25) + Train_Child 6.72 ture vector, it is only possible by analysing the efficiency of each
individual features which are later combined with MFCC feature
vectors. The final WER obtained on pooled dataset is as shown in
reduced from 7.98% to 6.67%. We tweaked the warping factor (in Fig. 4 where MFCC combined FDLP features performed better in
the range between 1 to 1) out of which 0.20 outperforms the comparison to that of earlier MFCC based system only.
other values. It results in lowest WER i.e., 6.67%. Hence, we have Here, we are expanding the system’s vocabulary by introducing
used 0.20 as the value of the warping factor for further experimen- lexically diverse words i.e. we supplied additional words in our
tation. A relative gain of 16.4% is observed on using formant mod- pronunciation dictionary and compared its performance.
ified adult speech corpus along with the Original_Adult and This empowers the pronunciation model to match phones to
Train_Child dataset. words. Table 6 shows different result scenarios. It can be observed
that after adding 10 k extra words, WER is at its lowest. Introduc-
ing lexical diversity made our model perform better with WER
5.4. Performance analysis using formant modification with data dropping to 5.87% from 5.63%. Thus, a relative improvement of
augmentation 4% is noticed. Overall in comparison to initial baseline system
which achieved a WER of 7.98% which is later improved by pooling
In order to further validate the effectiveness of formant modifi- original and formanted augmented dataset using hybrid front end
cation, we combine it with data augmentation. These two tech- features (FDLP + MFCC). It achieved a final WER of 5.63% with a rel-
niques tackle the issue of data scarcity. Table 5 shows the results ative improvement of 29.44% respectively.
after applying augmentation as well as formant modification. We
are using the three augmented data (original, random and sam-
5.5. Discussion
pled) along with formant modified adult data with a warping factor
of 0.2 to test the performance of the system.
Initially, the performance of the proposed continuous ASR sys-
It can be seen that using the sampled dataset is the most suited.
tem for Punjabi language has been analysed by varying front-end
WER dropped to 6.31% using formant modification as well as sam-
and modeling approaches, and it can be observed from the results
pled augmented dataset. Keeping Original_Adult, Train_Child and
that TDNN based acoustic models outperforms DNN with relative
gain of approximately 32% while training on Original_Adult and
Table 5
Performance analysis using formant modification with data augmentation.
Table 6
Augmentation + Formant Modification using MFCC approach WER Performance analysis with lexically diverse words.
(%)
No. of words added WER (%)
Original_Adult + Train_Child + Sampled 6.94
Using Adult + Child + Adult Formanted F2
Original_Adult + Formanted_Adult F2 6.52
(0.20) + Sampled
(0.20) + Train_Child + Original
Original_Adult + Formanted_Adult F2 6.79 0 5.87
(0.20) + Train_Child + Random 5k 5.71
Original_Adult + Formanted_Adult F2 6.31 10 k 5.63
(0.20) + Train_Child + Sampled 20 k 5.79

Fig. 4. Performance analysis using formant modification on hybrid front end approaches.

9
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

Table 7
Analysis and comparison of proposed approach with existing works.

Approach Feature extraction Acoustic Data-set Performance rate Remarks


modeling (WER)
Kadyan et al. (2021) [48] MFCC DNN Punjabi Child R.I: 50.10% Children speech corpus is augmented using
Speech Corpus Prosody and tactron2 augmentation techniques
Kaur and Kadyan (2020) [49] MFCC boosted Punjabi Child R.I. 22–26% A small corpus has been framed for Punjabi
Maximum Speech Corpus children’s speech. System has been processed
Mutual using discriminative approaches
Information
Kadyan et al. (2021) [50] MF-GFCC + DNN-HMM, Punjabi Child RI: 20.59% on Sentence level medium size Punjabi children ASR
pitch + VTLN GMM-HMM Speech Corpus noisy and 19.39% system. Testing has been performed in clean and
on clean noisy conditions
environment
Bawa and Kadyan (2021) [51] GFCC + Pitch + VTLN DNN-HMM Punjabi Adult R.I. 30.94% Gender based selection with medium size
Speech Corpus, Sentence level Punjabi children ASR system is
Punjabi Children employed using mismatched and varying
Speech Corpus environment conditions.
Proposed system MFCC + FDLP TDNN Punjabi Adult R.Is 29.44% Punjabi ASR system for children speech by using
Speech Corpus, data augmentation and formant modification.
Punjabi Children
Speech Corpus

Train_Child. Secondly, the performance of this ASR system is part of children speech corpus, sampled augmented children cor-
improved by application of data augmentation and formant modi- pus, original adult speech corpus as well as formant adult speech
fication, separately. It has been observed that data augmentation corpus. Feature extraction is done using multiple front end
reduces WER to 6.94% from 7.98% giving a relative improvement approaches: MFCC, FDLP-S, FDLP-M and later best output of these
of 13% compared to our baseline system, and formant modification approaches are combined as MFCC + FDLP-M to generate robust
results in lowest WER i.e., 6.67%. and relative gain of 16.4%. Then, front end features. We have used TDNN as the acoustic model gen-
analysis has been carried out using Formant modification with erating mono-phones, trigram language model along with supply-
Data Augmentation, which resulted in WER dropped to 6.31%. ing 10,000 words to the pronunciation model to make it lexically
and relative gain further increased to 9%. Finally, this improved diverse. Speaker Adaptive Training is also done to make the model
system has been tested with MFCC + FDLP hybrid feature set and more robust for unseen speakers. The proposed hybrid front end
expanded the system’s vocabulary, which helped in achieving a feature based children ASR system gives WER of 5.63% when tested
final WER of 5.63% with a relative improvement of 29.44% on testing part of only children speech corpus whereas, using train-
respectively. ing part of children speech corpus with adult speech corpus gives
WER of 7.98% using TDNN acoustic model. Overall, we achieved
relative gain of 29.44% from the baseline model using our proposed
5.6. Discussion and comparative analysis with earlier proposed
approach. Further work can be extended by employing spectro-
techniques
gram augmentation and deep conversion method to artificially
enhance training data and accordingly increases system efficiency.
Most of the research works in ASR have been around high
resource languages like English, Spanish, Mandarin etc., because
of technological advancements in these language speaking regions. CRediT authorship contribution statement
Hence, ample amount of speech data is available for studies in
these languages. However, such state of the art dataset is not avail- Mohit Dua: Supervision, Software, Validation, Investigation.
able for Indian languages such as Hindi, Punjabi, Dogri etc. Hence, Virender Kadyan: Writing – original draft, Visualization. Neha
data scarcity remains a big challenge in developing state-of-the-art Banthia: Conceptualization, Validation. Akshit Bansal: Methodol-
ASR systems for these languages. Table 7 gives the comparative ogy, Software. Tanya Agarwal: Data curation, Formal analysis.
analysis of the proposed work with some existing state of the art
ASR systems implemented for Punjabi language. The research woks Declaration of Competing Interest
proposed in [48-50] use Punjabi child speech corpus, and the work
proposed in [54] uses both Punjabi child and adult speech corpus. The authors declare that they have no known competing finan-
It can be clearly observed from the given comparison that the pro- cial interests or personal relationships that could have appeared
posed work of this paper i.e. combining data augmentation with to influence the work reported in this paper.
formant modification and hybrid MFCC-FDLP-M features, outper-
forms existing works.
References

[1] Lopatovska I, Rink K, Knight I, Raines K, Cosenza K, Williams H, et al. Talk to


6. Conclusion me: Exploring user interactions with the Amazon Alexa. J Librarianship Inf Sci
2019;51(4):984–97.
A novel approach to improve the performance of ASR systems [2] Sharma AS, Bhalley R. ASR—A real-time speech recognition on portable
devices. In 2016 2nd International Conference on Advances in Computing,
for children targeting low resource languages has been proposed.
Communication, & Automation (ICACCA)(Fall): IEEE; 2016. pp. 1-4.
Data augmentation has been done using Tacotron 2 to tackle the [3] Janssen CP, Donker SF, Brumby DP, Kun AL. History and future of human-
issue of data scarcity. Sampled data augmentation on children automation interaction. Int J Hum Comput Stud 2019;131:99–107.
speech corpus has proved to give best results. Formant modifica- [4] Sheridan TB, Parasuraman R. Human-automation interaction. Rev. Human
Factors Ergon. 2015;vol. 1:41.
tion is applied on adult speech corpus to mitigate the acoustic [5] Bachate RP, Sharma A. Automatic Speech Recognition Systems for Regional
and linguistic variabilities. The combined dataset includes training Languages in India. Int J Recent Technol Eng 585-592.

10
M. Dua, V. Kadyan, N. Banthia et al. Applied Acoustics 190 (2022) 108643

[6] Moore RK. A comparison of the data requirements of automatic speech [30] Gerosa M, Giuliani D, Brugnara F. Acoustic variability and automatic
recognition systems and human listeners. Eighth European Conference on recognition of children’s speech. Speech Commun 2007;49(10-11):847–60.
Speech Communication and Technology, 2003. [31] Kadyan V, Mantri A, Aggarwal RK, Singh A. A comparative study of deep neural
[7] Antoniou A, Storkey A, Edwards H. Data augmentation generative adversarial network based Punjabi-ASR system. Int J Speech Technol 2019;22(1):111–9.
networks; 2017. arXiv preprint arXiv:1711.04340. [32] Kadyan V, Shanawazuddin S, Singh A. Developing children’s speech
[8] Kathania H, Singh M, Grósz T, Kurimo M. Data augmentation using prosody recognition system for low resource Punjabi language. Appl Acoust
and false starts to recognize non-native children’s speech; 2020. arXiv preprint 2021;178:108002. https://doi.org/10.1016/j.apacoust.2021.108002.
arXiv:2008.12914. [33] Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, et al. Natural tts synthesis
[9] Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV. Specaugment: A by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE
simple data augmentation method for automatic speech recognition; 2019. International Conference on Acoustics, Speech and Signal Processing
arXiv preprint arXiv:1904.08779. (ICASSP). IEEE; 2018. p. 4779–83.
[10] Jaitly N, Hinton GE. Vocal tract length perturbation (VTLP) improves speech [34] Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J, Ren F, Wu Y. Transfer learning from
recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and speaker verification to multispeaker text-to-speech synthesis; 2018. arXiv
Language (vol. 117); 2013. preprint arXiv:1806.04558.
[11] Cui X, Goel V, Kingsbury B. Data augmentation for deep neural network [35] Deng Y, He L, Soong F. Modeling multi-speaker latent space to improve neural
acoustic modeling. IEEE/ACM Trans Audio Speech Lang Process 2015;23 tts: Quick enrolling new speaker and enhancing premium voice; 2018. arXiv
(9):1469–77. preprint arXiv:1812.05253.
[12] Ittichaichareon C, Suksri S, Yingthawornsuk T. Speech recognition using MFCC. [36] Herre J, Johnston JD. Enhancing the performance of perceptual audio coders by
In: International conference on computer graphics, simulation and modeling. using temporal noise shaping (TNS). In Audio Engineering Society Convention
p. 135–8. 101. Audio Engineering Society; 1996.
[13] Hermansky H, Tsuga K, Makino S, Wakita H. Perceptually based processing in [37] Thomas S, Ganapathy S, Hermansky H. Phoneme recognition using spectral
automatic speech recognition. In: IEEE International Conference on Acoustics, envelope and modulation frequency features. In: 2009 IEEE International
Speech, and Signal Processing. IEEE; 1986. p. 1971–4. Conference on Acoustics, Speech and Signal Processing. IEEE; 2009. p. 4453–6.
[14] Athineos M, Ellis DP. Frequency-domain linear prediction for temporal [38] Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based
features. In 2003 IEEE Workshop on Automatic Speech Recognition and on public domain audio books. In: 2015 IEEE international conference on
Understanding (IEEE Cat. No. 03EX721) (pp. 261-266). IEEE; 2003. acoustics, speech and signal processing (ICASSP). IEEE; 2015. p. 5206–10.
[15] Zhang Z, Geiger J, Pohjalainen J, Mousa A-D, Jin W, Schuller B. Deep learning for [39] Russell M. The pf-star british english childrens speech corpus. The Speech Ark
environmentally robust speech recognition: An overview of recent Limited; 2006.
developments. ACM Trans Intell Syst Technol 2018;9(5):1–28. [40] Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Saurous RA.
[16] Guglani J, Mishra AN. Continuous Punjabi speech recognition model based on Tacotron: Towards end-to-end speech synthesis; 2017. arXiv preprint
Kaldi ASR toolkit. Int J Speech Technol 2018;21(2):211–6. arXiv:1703.10135.
[17] Kathania HK, Kadiri SR, Alku P, Kurimo M. Study of Formant Modification [41] Park H, Lee D, Lim M, Kang Y, Oh J, Kim JH. A Fast-Converged Acoustic
for Children ASR. In ICASSP 2020-2020 IEEE International Conference Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay
on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7429-7433). IEEE; Neural Network; 2018. arXiv preprint arXiv:1807.05855.
2020. [42] Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ. Phoneme recognition
[18] Sunil Y, Prasanna SRM, Sinha R. Children’s Speech Recognition Under using time-delay neural networks. IEEE Trans Acoust Speech Signal Process
Mismatched Condition: A Review. IETE J Educ 2016;57(2):96–108. 1989;37(3):328–39.
[19] Huber JE, Stathopoulos ET, Curione GM, Ash TA, Johnson K. Formants of [43] Anastasakos T, McDonough J, Schwartz R, Makhoul J. A compact model for
children, women, and men: The effects of vocal intensity variation. J Acoust speaker-adaptive training. In Proceeding of Fourth International Conference on
Soc Am 1999;106(3):1532–42. Spoken Language Processing. ICSLP’96 (vol. 2, pp. 1137-1140). IEEE; 1996.
[20] Dua M, Aggarwal RK, Biswas M. GFCC based discriminatively trained noise [44] Chung YA, Wang Y, Hsu WN, Zhang Y, Skerry-Ryan RJ. Semi-supervised
robust continuous ASR system for Hindi language. J Ambient Intell Hum training for improving data efficiency in end-to-end speech synthesis. In:
Comput 2019;10(6):2301–14. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
[21] Adda G, Stüker S, Adda-Decker M, Ambouroue O, Besacier L, Blachon D, et al. Signal Processing (ICASSP). IEEE; 2019. p. 6940–4.
Breaking the unwritten language barrier: The BULB project. Proc Comput Sci [45] Haeb-Umbach, R., & Ney, H. (1992, March). Linear discriminant analysis for
2016;81:8–14. improved large vocabulary continuous speech recognition. In Proc. ICASSP
[22] Potamianos A, Narayanan S, Lee S. Automatic speech recognition for children. (Vol. 1, pp. 13-16). USA: ICASSP.
Fifth European Conference on Speech Communication and Technology, 1997. [46] Gales MJF. Maximum likelihood linear transformations for HMM-based speech
[23] Shahnawazuddin S, Adiga N, Kathania HK. Effect of prosody modification on recognition. Comput Speech Lang 1998;12(2):75–98.
children’s ASR. IEEE Signal Process Lett 2017;24(11):1749–53. [47] Parthasarathi SHK, Hoffmeister B, Matsoukas S, Mandal A, Strom N, Garimella
[24] O’Shaughnessy D. Linear predictive coding. IEEE Potentials 1988;7(1):29–32. S. fMLLR based featurespace speaker adaptation of DNN acoustic models. In
[25] Cooper EL. Text-to-speech synthesis using found data for low-resource Sixteenth annual conference of the international speech communication
languages (Doctoral dissertation). Columbia University; 2019. association, 2015.
[26] Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kavukcuoglu [48] Kadyan V, Kathania H, Govil P, Kurimo M. Synthesis Speech Based Data
K. Wavenet: A generative model for raw audio; 2016. arXiv preprint Augmentation for Low Resource Children ASR. In: International Conference on
arXiv:1609.03499. Speech and Computer. Cham: Springer; 2021. p. 317–26.
[27] Li J, Gadde R, Ginsburg B, Lavrukhin V. Training neural speech recognition [49] Kaur H, Kadyan V. April). Feature Space Discriminatively Trained Punjabi
systems with synthetic speech augmentation; 2018. arXiv preprint Children Speech Recognition System Using Kaldi Toolkit. Proceedings of the
arXiv:1811.00707. International Conference on Innovative Computing & Communications (ICICC),
[28] Ragni A, Knill KM, Rath SP, Gales MJ. Data augmentation for low resource 2020.
languages. In: INTERSPEECH 2014: 15th Annual Conference of the [50] Kadyan V, Bawa P, Hasija T. In domain training data augmentation on noise
International Speech Communication Association. p. 810–4. robust Punjabi Children speech recognition. J Ambient Intell Hum Comput
[29] Rosenberg A, Zhang Y, Ramabhadran B, Jia Y, Moreno P, Wu Y, et al. Speech 2021:1–17.
recognition with augmented synthesized speech. In: 2019 IEEE Automatic [51] Bawa P, Kadyan V. Noise robust in-domain children speech enhancement for
Speech Recognition and Understanding Workshop (ASRU). IEEE; 2019. p. automatic Punjabi recognition system under mismatched conditions. Appl
996–1002. Acoust 2021;175:107810. https://doi.org/10.1016/j.apacoust.2020.107810.

11

You might also like