Proceedings of the International Conference on Industrial Engineering and Operations Management,
7th Bangladesh Conference on Industrial Engineering and Operations Management
December 21-22, 2024
Publisher: IEOM Society International, USA DOI: 10.46254/BA07.20240161
Published: December 21, 2024
Enhancing Bangla Local Speech-to-Text Conversion Using
Fine-Tuning Wav2vec 2.0 with OpenSLR and Self-
Compiled Datasets Through Transfer Learning
Sk Muktadir Hossain, Md Rahat Rihan
Ahmed Imtiaz
Department of Computer Science and Engineering
American International University-Bangladesh (AIUB)
Dhaka, Bangladesh
[email protected],
[email protected] [email protected] Pritam Khan Boni, Dipta Justin Gomes
Lecturer, Department of Computer Science
American International University-Bangladesh (AIUB)
Dhaka, Bangladesh
[email protected],
[email protected] Abstract
An improved method to create an enhanced Bangla standard and local speech. The wav2vec 2.0 model has been
fine-tuned using additional datasets collected alongside OpenSLR data. Our findings have shown that there are
gains in transcription accuracy of as much as eleven percent, which is impressive given the low resources and
languages employed, proving the merits of transfer learning and fine-tuning. The work of the research is aimed at
expanding the knowledge base concerning the use of novel deep learning algorithms in small languages in the
field of speech technology. The evaluation metrics included Word Error Rate (WER) and Character Error Rate
(CER), with the fine-tuned model achieving an overall WER of 11.27% and CER of 6.03%. Comparative analysis
with previous work shows a significant improvement from baseline models, highlighting the efficacy of the
wav2vec 2.0 model in leveraging large and diverse datasets. The experimental setup was supported by a cluster
computing environment with NVIDIA CUDA-compatible GPUs, underscoring the computational resources
required for effective Automatic Speech Recognition (ASR) model training. The results demonstrate substantial
advancements in ASR performance for Bengali, with the fine-tuned model outperforming previous benchmarks
and showcasing the benefits of self-supervised learning approaches.
Keywords
Bangla Speech Recognition, wav2vec 2.0, Transfer Learning, Speech Technology, Automatic Speech
Recognition (ASR).
1. Introduction
With the emergence of Artificial Intelligence, various facets of human life have improved, thus creating a scope
of further exploration and research. In the field of Speech Recognition, Speech-to-Text (SST) and voice
recognition is a well-established field of research addressing some important problems in daily life. Therefore,
Speech-to-text (STT) is an essential domain of research that has huge impact in real life, including an aid to help
hearing-impaired people follow through conversations, a method of detecting child abuse (Vásquez-Correa and
Álvarez Muniain 2023), a tool for transcribing speeches into writings, and voice-controlled interfaces. Although
there has been active development work in STT systems for large languages like English and Mandarin (Zhang,
Haddow and Sennrich 2022), (Li et al. 2023) a lot of work remains to be done in the case of many Low Resource
Language (LRL) (Sinha et al. 2024), including Bangla (Akther and Debnath 2022). More than 290 million people
© IEOM Society International
983
Proceedings of the International Conference on Industrial Engineering and Operations Management,
speak Bangla worldwide (Sadhu et al. 2021) especially in Bangladesh and West Bengal of India, but current state-
of-the-art Automatic Speech Recognition (ASR) systems lack the viability of widely spoken languages.
Specifically, the latest innovations within deep learning, where transformers make up a significant part of ASR
models, have demonstrated considerable potential for real-world applications (Vásquez-Correa and Álvarez
Muniain 2023). Among these models, one of the most successful is called Wav2vec2 (Park et al. 2008) (Sinha et
al. 2024). Wav2vec’s proposed method is trained using self-supervision and has recently shown the highest
accuracy in various speech recognition tasks at its Facebook parent AI lab (Schneider et al.2019). Different parts
of the country speak various local languages, but in all circumstances, Bangla is their first or second language.
The Chittagonian language is very important as it is spoken in a key tourist destination of Bangladesh. While these
languages have considerable cultural and regional significance, this is an under-researched area. In this study, we
try to improve the methods that were used for other languages spoken in Bangladesh and Chittagong. These
enhancements should make it easier to write things such as natural language processing, text translation, and
speech recognition software. This research thus also fills a critical gap in efforts to increase access and
communication for speakers of local languages, both with respect to the development of tools and resources in
technological interfaces as well as the field of travel and tourism. The main contributions of the paper have been
listed below:
• Dialect-Specific Dataset Creation: Development of a dedicated Chittagong language dataset to support future
research in speech recognition and natural language processing for low-resource languages.
• Improved ASR for Chittagong Dialect: Enhancement of state-of-the-art Automatic Speech Recognition (ASR)
systems has been carried out in this research. Particularly Wav2vec 2.0, to better handle dialectical peculiarities
found in the Chittagong language through improving transcription accuracy.
• Framework for Low-Resource Languages: Proposal of a comprehensive framework for managing and
interpreting the Chittagong dialect within the broader context of Bangla, offering a model for other regional
languages in Bangladesh.
• Technological Empowerment of Regional Languages: Contribution towards promoting regional
languages using speech technology, increasing accessibility and technological engagement for speakers of local
languages, especially in the context of tourism and regional communication. As a result, regional languages have
been promoted using technology and research efforts.
1.1 Objectives
Develop a dialect-specific dataset for Chittagonian speech.
Improve ASR performance for Bangla and its dialects using Wav2vec 2.0.
Provide a framework for low-resource language ASR development.
Promote regional languages through advanced speech technology.
2. Literature Review
Speech-to-text systems have also developed from their rule based primitive methods to more advanced machine-
learning models. In early models, extensive use of phonetic dictionaries and handcrafted rules was made, which
were time-consuming and not very reliable (Bakiri 1991). The introduction of statistical models like Hidden
Markov Models (HMMs) and Gaussian Mixture Models (GMMs) brought considerably. development and allowed
for the building more precise and large-scale speech recognition systems (Baker et al. 2009) and (Jelinek1997).
In the recent past, deep learning has ushered in the next era of ASR with the introduction of neural network-based
architecture (Graves, Mohamed, and Hinton 2013) (Hinton et al. (2012). As for temporal features, convolutional
neural networks (CNNs) and recurrent neural networks (RNNs) have been widely employed to capture the
temporal dependencies of speech signals (Abadi et al.2016) (Hannun et al. 2014). Some of the most famous
examples are DeepSpeech, which employs RNN with Connectionist Temporal Classification (CTC) for end-to-
end speech recognition (Amodei et al. 2016). Other Transformer-based models have taken the field to a new level
by enhancing the parallelization process and improving the modelling of long-distance relations in datasets
(Vaswani et al. 2017) (Dong, Xu, and Xu 2018). The wav2vec 2.0 model that uses a transformer architecture for
self-supervised learning has shown remarkable improvements in ASR systems (Schneider et al.2019). However,
the majority of the research and development has primarily targeted high-resource languages rather than low-
resource languages such as Bangla (Gupta et al. 2023).).
In the research by (Sinha et al. 2024) evaluated the prototypical speech modification methods such as Wav2Vec2to
study the changes in speech of children and the pattern of modification using PF-Star and CMU Kids dataset. In
their study it was found that for children aged below 10 years, the Wav2Vec2 ASR model underperforms, thus a
alteration of speech using other methods performs best for other scenarios. The researchers studied across various
age groups and highlighted their findings in their research. In conclusion, the authors also provided the indication
© IEOM Society International
984
Proceedings of the International Conference on Industrial Engineering and Operations Management,
that the PF-Star performs with better results with respect to the CMU-Kids dataset. In the research by Juan et al.
(Vásquez-Correa and Álvarez Muniain 2023), online child exploitation by Law Enforcement Agencies (LEAs)
have utilized audio materials to detect keywords deemed to child abuse. The researchers proposed a model based
on Wav2Vec2 and Whisper to enhance the accuracy of keyword detection. The model performs well under various
scenarios providing a promising result of a word error rate of 11% and 25%. The accuracies vastly depend on the
language of the text and also achieve a true-positive rate of 82% and 98% for both of the models. The findings
suggest that a federated learning approach proves a far better approach than centralized learning through neural
networks.
2.1 Current Bangla STT System
A limited amount of work on Bangla STT has been done in comparison to other high-resource languages. Initial
studies employed HMMs and GMMs mainly, as exemplified by Roy et al. (R., D., and G.2002), who created a
Bangla ASR system with these models. Recent studies have incorporated deep learning approaches. Hossain et
al. (Hossain et al.2013) attempted to use RNNs for Bangla speech recognition with moderate success, although
they found the data shortage a significant issue. The availability of datasets such as the Bengali Common Voice
corpus by Mozilla (Alam et al.2022) and the OpenSLR Bengali dataset (Alam et al.,2022) made further
advanced research possible. However, these datasets alone are not enough for training very accurate models, and
this makes it necessary to gather and incorporate more data.
The paper by Samiul et al. details the Bengali Common Voice Corpus v9.0, comprising 231,120 samples collected
from 19,817 contributors, totaling 399 hours of speech recordings. Of these, 56 hours 14% have been validated.
Each audio clip comes with a sentence annotation and additional metadata including upvotes, downvotes, age,
gender, and accents. The paper compares the Common Voice Bengali Corpus with other Bengali speech datasets.
It stands out for its larger size (399 hours) and greater feature diversity compared to datasets like OpenSLR. In
feature analysis, Geneva and SpeechVGG features show that Common Voice exhibits more diverse features, with
significant differences in sound levels, voiced segments, and pitch compared to OpenSLR. 7. In terms of
benchmark, in ASR Evaluation, the dataset is evaluated using the Kaldi (HMM-GMM) model, which achieved a
WER of 24.4. Comparison with the Wev2vec2 model trained on the OpenSLR dataset showed a WER of 39.291
and a CER of 13.856.
3. Methodology
3.1 Data Collection
The OpenSLR dataset has been selected for this research, which is publicly available and contains a substantial
amount of Bangla speech data. Additionally, our study utilised many datasets of Bangla speech, consisting of
684,510 audio files obtained from various collections: Files in the Standard Bangla Language 240918, Chittagong
Local Bangla Language 247565, and Sylhet local Bangla Language 196027 audio file collections have been
collected where each of the data in these datasets comprises FLAC audio files.
3.2 Model training
1) OpenSSL Dataset: A larger dataset available from the OpenSLR is the Bengali Common Voice dataset, which
is thousands of hours of Bangla speech data from volunteers. This results in a broad spectrum of accents, dialects,
and speaking styles that can be accounted for while using this dataset to train STT models (Alam et al.,2022). To
further enrich the OpenSLR dataset, more audios have been sourced from the following areas:
• Television Broadcasts: We obtained 77K segments from news programs, discussion shows, and interviews.
• Online Videos: The above illustrated videos were sourced from YouTube and other similar forums that featured
educational lessons, lectures, and public speeches.
• Public Speeches: Some of the patients accompanied their responses with recordings of political speeches,
cultural events, or public announcements.
2) Manual Transcription: Audio data were also logged within the same recording session, and all collected
audio data were manually transcribed to ensure high-quality labels. For training, we used Hugging Face’s
implementation of the wav2vec 2.0 model, especially facebook/wav2vec2-large-xlsr53 (F. AI,2021), which was
pre-trained on multilingual voice data. 56 epochs were run, resulting in approximately 50 hours of training
utilising an Nvidia RTX 4090 GPUs.
3)Training plateau: Around epoch 56, a Word Error Rate (WER) of 0.11 was recorded, with further training
exhibiting diminishing returns.
4) Hyperparameter Details: Learning Rate: 1×10^-5
© IEOM Society International
985
Proceedings of the International Conference on Industrial Engineering and Operations Management,
• Batch Size: 32
• Epochs: 56
• Dropout Rate: 0.1
• Gradient Clipping Threshold: 1.0
• Optimizer: AdamW with weight decay regularization.
3.3 Data Preprocessing
Pre-processing steps are crucial to preparing the dataset for training and include:
1) Normalization: As common pre-processing steps, text normalisation included the conversion of all text to
lowercase, the removal of punctuation, and the standardization of misspellings (Aliero et al. 2023).
2) Noise Reduction: The specific workflow with audio files involved noise reduction algorithms in order to
improve the speech quality.
3) Segmentation: Long audio files had to be clipped into shorter realistic sections, with most of them taking
between five and thirty seconds.
3.4 Model Selection and Fine-Tuning
The wav2vec 2.0 model was chosen since it showed better performance in various ASR tasks. The model employs
a transformer-based structure and self-training to learn clean speech features directly from the raw waveform
(Schneider et al.,2019). The model used in this research has been illustrated in Figure 1 (Figure 1).
3.5 Fine-Tuning Process
The pre-trained wav2vec 2.0 We fine-tuned the model on the combined dataset of the two groups. The process
involved:
3.6 Learning Rate Scheduling
To increase the robustness of the training process and guarantee convergence, adaptive learning rate scheduling
was used (Xu et al. 2019).
1) Gradient Clipping: To avoid this, gradient clipping was used to control the maximum allowable gradients,
which can cause fluctuations in the training process (Zhang et al. 2019).
2) Data Augmentation: Signal processing techniques like time stretching and pitch scaling have been used, as
well as background noise has been added, to increase the amount of training data and enhance model resilience
(Chung and Mak 2021).
3.7 Reliability and Validity
To ensure the internal validity and external validity of the research, the following steps has been carried out:
Dataset Validation: The cholesterol values and their corresponding labels after data cleaning as before were
divided randomly into three various sets with the aim of reducing overfitting and cross-checking the results;
the first set is the training set, the second is the validation set, and the final one is the test set.
Cross-Validation: In order to test the model as a combination of different data sets, we employed cross-validation
(k-fold cross-validation) to test the reliability of the model that was developed (Yang et al. 2019).
Inter-Rater Reliability: To measure the level of interobserver agreement, a part of the transcripts was coded by
more than two coders in order to calculate inter-observer reliability and ensure fidelity with regard to the coding
was maintained (Kim,2018).
4. Experimental Setup
4.1 Training Environment
The experiments were performed on a cluster computing setup with access to several NVIDIA CUDA-
compatible graphics processing units. That is why the use of GPUs contributed a large proportion to training the
models.
© IEOM Society International
986
Proceedings of the International Conference on Industrial Engineering and Operations Management,
Figure 1. Bengali Language Model Design
4.2 Evaluation metrics
Word Error Rate (WER) and Character Error Rate (CER) were utilized as two primary measures for comparing
the effectiveness of different models. WER calculates the extent of word accuracy accomplished by the model,
while CER calculates the percentage accuracy of character by the model. To evaluate the model’s overall
performance, we employed:
The formula for Word Error Rate (WER) (Park et al.,2008) is given below where Insertions are extra words in the
predicted transcript does not present in the reference, Deletions are words missing in the predicted transcript that
are present in the reference and Substitutions are incorrectly recognized words in the predicted transcript.:
Insertions + Deletions + Substitutions
WER = (1)
Number of words in the reference
© IEOM Society International
987
Proceedings of the International Conference on Industrial Engineering and Operations Management,
Similarly, the formula for Character Error Rate (CER) is:
Insertions + Deletions + Substitutions
CER = (2)
Number of characters in the reference
WER ranges from 0 to 1 (or 0% to 100%), with 0 indicating a perfect transcription. The overall performance of
WER and CER is presented in Table 1.
Table 1. Wer and cer for overall performance
All Language WER % CER %
Overall 11.27 6.03
Performance
The overall transcription performance using Word Error Rate (WER) and Character Error Rate (CER) is
summarized in Table 1. A strong model accuracy at word and character levels is indicated by WER values of
11.27% and CER of 6.03%. Overall, these results demonstrate the model's effectiveness at avoiding transcription
errors for all the languages used.
5. Results
5.1 Baseline Performance
To determine the performance level before fine-tuning on the target language, the experiment utilized the pre-
initialized wav2vec 2.0 model without fine-tuning. It is found that, as per the evaluation set, the WER used by the
model was recorded at 25.24%. A comparison of the fine-tuned model with without fine-tuned wav2vec 2.0 is
provided in Table 2.
Table 2. Analysis of Bangla Local Language WORKS
Research Model Dataset WER CER Key
Study Used Used (%) (%) Results/Notes
Wav2vec 2.0 Wav2vec OpenSLR, 25.24 13.29 Establish the performance
Model 2.0 self- limitations of the model before
Without compiled domain-specific adaptation,
fine- cross emphasizing the necessity for
tuned validates fine-tuning to improve accuracy
datasets for Bangla language datasets.
(684,510
files)
Proposed Wav2vec OpenSLR, 11.27 6.03 Significant improvement with
Model 2.0 self- larger datasets and advanced self-
fine- compiled supervised learning.
tuned cross
validates
datasets
(684,510
files)
Table 2: Provides a comparison of WER and CER metrics of the baseline models and the proposed model fine-
tuned using wav2vec 2.0. It highlights the significant improvements achieved by the proposed approach,
especially in WER and CER percentages.
5.2 Subsequent Contingent Analysis and Graphical Representation
The performance of the fine-tuned model was tested on subsets of the dataset for different accents and dialects to
maximize its versatility across domains. The results are summarized as follows:
© IEOM Society International
988
Proceedings of the International Conference on Industrial Engineering and Operations Management,
Figure 2. Bengali Language Model Performance
Figure 2 visualizes the performance comparison of the proposed model against baseline systems in terms of WER
and CER. It shows a clear reduction in error rates for the proposed model, emphasizing its efficacy across standard
Bangla and dialect-specific datasets.
Table 3. wer and cer for overall performance
All Language WER % CER %
Overall 11.27 6.03
Performance
The System-Level Performance in all the languages is shown in Table 3. The Word Error Rate (WER) is 11.27%,
and the Character Error Rate (CER) is = 6.03% The efficiency of the proposed approach in recognizing input
(minimizing recognition errors) is clearly evident from these results.
Dialect-Specific Performance:
Table 4. wer and cer for different dialects
Language WER % CER %
Standard 7.43 3.25
Bangla
Chittagonian 12.00 7.54
Sylheti 14.38 7.29
The performance metrics were plotted to visually compare the baseline and fine-tuned model performance.
Table 4: Summarizes the WER and CER metrics for different dialects (Standard Bangla, Chittagonian, and
Sylheti). It shows how the proposed model performs across dialects, with Standard Bangla showing the lowest
error rates and Sylheti presenting higher challenges.
© IEOM Society International
989
Proceedings of the International Conference on Industrial Engineering and Operations Management,
6. Discussion
The promise of transfer learning for low-resource languages like Bangla is demonstrated by the notable decrease
in WER and CER. Our method makes use of the wav2vec 2.0 model’s advantages and emphasizes the value of
diverse, high-quality datasets for building resilient ASR systems. The increase in accuracy shows that optimizing
a strong model can have significant advantages even with little data.
6.1 Difficulties
Even if our results are encouraging, there were a number of difficulties and restrictions: Data Quality: It took a lot
of work and time to ensure that the gathered data had high-quality transcriptions. Bangla has a wide variety of
accents and dialects, which makes it difficult to achieve consistent correctness across all variants. Computing
Capabilities: Large models like wav2vec 2.0 require substantial computational resources for training and fine-
tuning, which may not be available to all researchers.
6.2 Areas for Further Investigation
Potential future research endeavours could investigate various avenues to enhance Bangla speech-to-text (STT)
systems: Expanding the training dataset in terms of both size and diversity has the potential to improve the
performance of the model, especially for accents and dialects that are not well-represented. Implementing
advanced data augmentation techniques has the potential to enhance the robustness of the model to different noise
conditions and speaking styles. By incorporating language models trained on extensive Bangla text corpora, it is
possible to rectify common transcription errors and enhance overall accuracy.
This study emphasises the practicality and efficiency of refining the wav2vec 2.0 model for converting Bangla
speech into written text. Through the strategic utilization of both existing and recently acquired speech data, we
successfully attained notable enhancements in the precision of transcriptions. The significance of employing
sophisticated deep learning methods and extensive datasets to construct resilient ASR systems for languages with
limited resources is emphasised by our research.
7. Conclusion
This research highlights the successful application of the wav2vec 2.0 model in developing an effective Bangla
local speech-to-text system. The improved accuracy achieved through fine-tuning and the use of additional data
demonstrates the potential for advancements in ASR for low-resource languages. Continued efforts in this
direction could bridge the gap and bring more languages to the forefront of speech technology.
In this study, the usage of the Wav2vec 2.0 model and shift in pre-trained models have been proven to boost
Bangla speech-to-text (STT) accuracy. Training the model on more than 680,000 audio files and optimising
performance, we decreased word error rate (WER) and character error rate (CER) to WER = 11.27% and CER =
6.03%. That is a considerable step forward over the earlier models, which suffered from low data sets and sparse
representation of dialects. Challenges faced throughout this project owing to variances between the spoken and
written words led us to realise how crucial it is to have a broad dataset in the case of languages like Bangla, where
dialects among regions (such as Chittagonian and Sylheti) vary. To validate this performance increase, we
developed new dialect-targeted datasets for both validations.
Nevertheless, future research must address limits imposed by data quality and the computational requirements of
training huge models. In future studies, one of the strategies to make Bangla STT more resilient can be achieved
by increasing the dataset size further and employing better augmentation techniques. In addition, transcription
accuracy could be further enhanced by adding language models learnt using largescale Bangla text corpora. While
we have made significant progress in the field of low-resource language processing, open difficulties are still
evident, and more scalable, accessible models that can really cater to regional dialects will fill this gap. Instead,
this research paper is a stepping stone for Bangla voice recognition, and additional research could lead to usage
in technology, accessibility, and communication.
Acknowledgements
This work is dedicated in memoriam to Pritam Khan Boni, Lecturer, Department of Computer Science,
American International University-Bangladesh (AIUB) by the authors. We want to honor his vision and
efforts in promoting local languages through the use of technology and this research is inspired by him. His
passion for this area of research kept us motivated during this journey, and invaluable feedback paved the
paths for this work. We would like to thank Dipta Justin Gomes, Lecturer, Department of Computer Science,
AIUB, for his assistance and encouragement in completing this study. We are also thankful to our peers and
colleagues for their help and providing valuable discussions.
© IEOM Society International
990
Proceedings of the International Conference on Industrial Engineering and Operations Management,
References
Zhang, B., Haddow, B., & Sennrich, R. "Revisiting end-to-end speech-to-text translation from scratch."
Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning
Research, vol. 162, PMLR, 17–23 July 2022, pp. 26193–26205. Available:
https://proceedings.mlr.press/v162/zhang22i.html.
Li, T., Hu, C., Cong, J., Zhu, X., Li, J., Tian, Q., Wang, Y., & Xie, L. "Diclet-tts: Diffusion model-based cross-
lingual emotion transfer for text-to-speech — a study between English and Mandarin." IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, 2023, pp. 3418–3430.
Akther, A., & Debnath, R. "Automated speech-to-text conversion systems in Bangla language: A systematic
literature review." Khulna University Studies, 2022, pp. 566–583.
Sadhu, S., He, D., Huang, C.-W., Mallidi, S. H., Wu, M., Rastrow, A., Stolcke, A., Droppo, J., & Maas, R.
"Wav2vec-c: A self-supervised model for speech representation learning." 2021. Available:
https://arxiv.org/abs/2103.08393.
Schneider, S., Baevski, A., Collobert, R., & Auli, M. "wav2vec: Unsupervised pre-training for speech
recognition." Interspeech, 2019. Available: https://doi.org/10.21437/Interspeech.20192826.
Bakiri, G. "Converting English text to speech: A machine learning approach." Ph.D. Thesis, Oregon State
University, 1991.
Baker, J. K., Makhoul, J., Schwartz, R., & Cole, R. Readings in Speech Recognition. Morgan Kaufmann, 2009.
Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1997.
Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.
Available: https://doi.org/10.1109/icassp.2013.6638947.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T. N., et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal
Processing Magazine, vol. 29, no. 6, 2012, pp. 82–97. Available:
https://doi.org/10.1109/msp.2012.2205597.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
M., et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." arXiv
preprint, 2016. Available: https://doi.org/10.48550/arxiv.1603.04467.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,
Coates, A., & Ng, A. Y. "Deepspeech: Scaling up end-to-end speech recognition." arXiv preprint, 2014.
Available: https://doi.org/10.48550/arxiv.1412.5567.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng,
Q., Chen, G., et al. "Deep speech 2: End-to-end speech recognition in English and Mandarin." arXiv
preprint, 2016. Available: https://doi.org/10.48550/arxiv.1512.02595.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.
"Attention is all you need." Advances in Neural Information Processing Systems (NeurIPS), 2017.
Available: https://doi.org/10.48550/arxiv.1706.03762.
Dong, L., Xu, S., & Xu, B. "Speech-transformer: A no-recurrence sequence-to-sequence model for speech
recognition." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018, pp. 5884–5888. Available: https://doi.org/10.1109/icassp.2018.8461910.
R. K., D. D., & G., A. M. "Development of the speech recognition system using artificial neural network."
Proceedings of the 5th International Conference on Computer and Information Technology (ICCIT02),
2002, pp. 118–122.
Hossain, M., Rahman, M., Prodhan, U. K., & Khan, M. "Implementation of back-propagation neural network for
isolated Bangla speech recognition." arXiv preprint, 2013. Available: https://arxiv.org/abs/1308.3785.
Alam, S., Sushmit, A., Abdullah, Z., Nakkhatra, S., Ansary, M. N., Hossen, S. M., Mehnaz, S. M., Reasat, T., &
Humayun, A. I. "Bengali common voice speech dataset for automatic speech recognition." 2022.
Available: https://arxiv.org/abs/2206.14053.
F. AI. "Wav2vec 2.0 large xlsr-53." 2021. Available: https://huggingface.co/facebook/wav2vec2-large-xlsr-53.
Accessed: 2024-09-19.
Aliero, A., Bashir, S., Aliyu, H., Tafida, A., Kangiwa, B., & Dankolo, N. "Systematic review on text normalization
techniques and its approach to non-standard words." International Journal of Computer Applications,
vol. 185, September 2023, pp. 975–8887.
Xu, Z., Dai, A. M., Kemp, J., & Metz, L. "Learning an adaptive learning rate schedule." arXiv preprint, 2019.
Available: https://arxiv.org/abs/1909.09712.
Zhang, J., Tianxing, H., Sra, S., & Jadbabaie, A. "Analysis of gradient clipping and adaptive scaling with a relaxed
smoothness condition." May 2019.
Chung, R., & Mak, B. "On-the-fly data augmentation for text-to-speech style transfer." 2021 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 634–641.
© IEOM Society International
991
Proceedings of the International Conference on Industrial Engineering and Operations Management,
Yang, L., Li, Y., Wang, J., & Tang, Z. "Post text processing of Chinese speech recognition based on bidirectional
LSTM networks and CRF." Electronics, vol. 8, October 2019, p. 1248.
Kim, S. "Exploring media literacy: Enhancing English oral proficiency and autonomy using media technology."
Studies in English Education, vol. 23, July 2018.
Park, Y., Patwardhan, S., Visweswariah, K., & Gates, S. "An empirical analysis of word error rate and keyword
error rate." September 2008, pp. 2070–2073.
Sinha, Abhijit, Mittul Singh, Sudarsana Reddy Kadiri, Mikko Kurimo, and Hemant Kumar Kathania. 2024.
“Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition.” Pp. 1–5 in
2024 International Conference on Signal Processing and Communications (SPCOM). Bangalore, India:
IEEE.
Vásquez-Correa, Juan Camilo, and Aitor Álvarez Muniain. 2023. “Novel Speech Recognition Systems Applied
to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper.” Sensors 23(4):1843. doi:
10.3390/s23041843.
Gupta, S., Motepalli, K. S. S., Kumar, R., Narasinga, V., Mirishkar, S. G., & Vuppala, A. K. "Enhancing Language
Identification in Indian Context Through Exploiting Learned Features with Wav2Vec2.0." Proceedings
of the International Conference on Speech and Computer. Springer, 2023, pp. 503–512.
Biographies
Sk Muktadir Hossain is a passionate Data Scientist, Data Analyst, and Machine Learning enthusiast committed
to leveraging data and computational technologies for impactful solutions. His research focuses on machine
learning, data analytics, and computer vision, emphasizing data-driven frameworks to optimize decision-making
and enhance visual intelligence. Adept at predictive modelling and advanced image analysis, Muktadir addresses
critical challenges in automation, healthcare, and smart cities. A lifelong learner, he is driven by the
transformative potential of data and computer vision to reshape industries and society.
MD Rahat Rihan is a Machine Learning and NLP enthusiast dedicated to harnessing data and computational
technologies to drive impactful solutions. His research centers on machine learning, data analytics, and
computer vision, with a strong focus on developing data-driven frameworks to improve decision-making and
advance visual intelligence.
Ahmed Imtiaz is a passionate researcher in Natural Language Processing (NLP), Computer Vision and Pattern
Recognition (CVPR), and Data Analysis. He works fundamentally on creating new solutions by bringing
advanced technologies from these fields together. Specializing in text manipulation, vision understanding and
analytics, Ahmed hopes to make a significant impact solving the real-world problems spanning across a
multitude of fields.
Pritam Khan Boni (late) was a dedicated and accomplished researcher specializing in algorithms, data mining,
artificial intelligence, and pattern recognition. He was associated with the American International University-
Bangladesh (AIUB), where he made significant contributions to the scientific community. His notable works
include research on heart attack risk prediction and mobile robot path planning, reflecting his innovative
approach to solving critical problems. His academic contributions are recognized with 56 citations, an h-index
of 4, and an i10-index of 2, underscoring the impact of his work. Despite his untimely passing, his legacy lives
on, inspiring current and future researchers to explore and innovate in these domains.
Dipta Justin Gomes is a dedicated academic and researcher specializing in Machine Learning, Computer Vision,
Algorithms, Image Processing, and Natural Language Processing. He is currently pursuing a Doctor of Philosophy
(Ph.D.) in Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET),
focusing on non-planar graphs and advanced graph algorithms. Mr. Gomes earned his Master of Science (M.Sc.)
in Computer Science from the University of Ulster, Belfast, Northern Ireland, in January 2023. Prior to this, he
completed another M.Sc. in Computer Science, specializing in Intelligent Systems, at American International
University-Bangladesh (AIUB), where he graduated summa cum laude for achieving a CGPA above 3.95. He also
holds a Bachelor of Science in Computer Information Systems from AIUB, completed in December 2017.His
research encompasses deep learning algorithms, convolutional neural networks, scheduling algorithms, graceful
labeling of graphs, machine learning for predictive analytics, image processing optimizations, data mining
algorithms, underwater image processing, and natural language processing. He has presented his work at various
international conferences, including the International Conference on Computing Advancements (ICCA) and the
International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST). In his professional
career, Mr. Gomes has served as a Lecturer in the Department of Computer Science at AIUB since September
2019. Before this role, he was a Lecturer in the ICT Department at Notre Dame College, Dhaka, from February
© IEOM Society International
992
Proceedings of the International Conference on Industrial Engineering and Operations Management,
2018 to August 2019. Throughout his academic journey, Mr. Gomes shown excellence, receiving Dean's List
Honors during his undergraduate studies and the Daily Star Award for achieving more than six A's in his GCSE
O Levels in 2013. He also secured the 2nd Runner-Up position at the National Hackathon 2016, organized by the
ICT Ministry of Bangladesh. His fields of interest include Graph Theory, Computation Learning, Computer
Vision, and Algorithms. Mr. Gomes continues to contribute to these areas through his research, teaching, and
active participation in academic conferences and workshops.
© IEOM Society International
993