0% found this document useful (0 votes)

10 views11 pages

Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning

The document presents research on enhancing Bangla local speech-to-text conversion using a fine-tuned wav2vec 2.0 model, achieving significant improvements in transcription accuracy. The study focuses on creating a dialect-specific dataset for the Chittagong dialect and demonstrates the effectiveness of transfer learning in low-resource languages. Results indicate a Word Error Rate of 11.27% and a Character Error Rate of 6.03%, showcasing advancements in Automatic Speech Recognition for Bengali and its dialects.

Uploaded by

Naznin Jahan Noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning

Uploaded by

Naznin Jahan Noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Proceedings of the International Conference on Industrial Engineering and Operations Management,

7th Bangladesh Conference on Industrial Engineering and Operations Management

December 21-22, 2024
Publisher: IEOM Society International, USA DOI: 10.46254/BA07.20240161
Published: December 21, 2024

Enhancing Bangla Local Speech-to-Text Conversion Using

Fine-Tuning Wav2vec 2.0 with OpenSLR and Self-
Compiled Datasets Through Transfer Learning
Sk Muktadir Hossain, Md Rahat Rihan
Ahmed Imtiaz
Department of Computer Science and Engineering
American International University-Bangladesh (AIUB)
Dhaka, Bangladesh
[email protected], [email protected]
[email protected]

Pritam Khan Boni, Dipta Justin Gomes

Lecturer, Department of Computer Science
American International University-Bangladesh (AIUB)
Dhaka, Bangladesh
[email protected],[email protected]

Abstract

An improved method to create an enhanced Bangla standard and local speech. The wav2vec 2.0 model has been
fine-tuned using additional datasets collected alongside OpenSLR data. Our findings have shown that there are
gains in transcription accuracy of as much as eleven percent, which is impressive given the low resources and
languages employed, proving the merits of transfer learning and fine-tuning. The work of the research is aimed at
expanding the knowledge base concerning the use of novel deep learning algorithms in small languages in the
field of speech technology. The evaluation metrics included Word Error Rate (WER) and Character Error Rate
(CER), with the fine-tuned model achieving an overall WER of 11.27% and CER of 6.03%. Comparative analysis
with previous work shows a significant improvement from baseline models, highlighting the efficacy of the
wav2vec 2.0 model in leveraging large and diverse datasets. The experimental setup was supported by a cluster
computing environment with NVIDIA CUDA-compatible GPUs, underscoring the computational resources
required for effective Automatic Speech Recognition (ASR) model training. The results demonstrate substantial
advancements in ASR performance for Bengali, with the fine-tuned model outperforming previous benchmarks
and showcasing the benefits of self-supervised learning approaches.

Keywords
Bangla Speech Recognition, wav2vec 2.0, Transfer Learning, Speech Technology, Automatic Speech
Recognition (ASR).

1. Introduction
With the emergence of Artificial Intelligence, various facets of human life have improved, thus creating a scope
of further exploration and research. In the field of Speech Recognition, Speech-to-Text (SST) and voice
recognition is a well-established field of research addressing some important problems in daily life. Therefore,
Speech-to-text (STT) is an essential domain of research that has huge impact in real life, including an aid to help
hearing-impaired people follow through conversations, a method of detecting child abuse (Vásquez-Correa and
Álvarez Muniain 2023), a tool for transcribing speeches into writings, and voice-controlled interfaces. Although
there has been active development work in STT systems for large languages like English and Mandarin (Zhang,
Haddow and Sennrich 2022), (Li et al. 2023) a lot of work remains to be done in the case of many Low Resource
Language (LRL) (Sinha et al. 2024), including Bangla (Akther and Debnath 2022). More than 290 million people

© IEOM Society International

983
Proceedings of the International Conference on Industrial Engineering and Operations Management,

speak Bangla worldwide (Sadhu et al. 2021) especially in Bangladesh and West Bengal of India, but current state-
of-the-art Automatic Speech Recognition (ASR) systems lack the viability of widely spoken languages.

Specifically, the latest innovations within deep learning, where transformers make up a significant part of ASR
models, have demonstrated considerable potential for real-world applications (Vásquez-Correa and Álvarez
Muniain 2023). Among these models, one of the most successful is called Wav2vec2 (Park et al. 2008) (Sinha et
al. 2024). Wav2vec’s proposed method is trained using self-supervision and has recently shown the highest
accuracy in various speech recognition tasks at its Facebook parent AI lab (Schneider et al.2019). Different parts
of the country speak various local languages, but in all circumstances, Bangla is their first or second language.
The Chittagonian language is very important as it is spoken in a key tourist destination of Bangladesh. While these
languages have considerable cultural and regional significance, this is an under-researched area. In this study, we
try to improve the methods that were used for other languages spoken in Bangladesh and Chittagong. These
enhancements should make it easier to write things such as natural language processing, text translation, and
speech recognition software. This research thus also fills a critical gap in efforts to increase access and
communication for speakers of local languages, both with respect to the development of tools and resources in
technological interfaces as well as the field of travel and tourism. The main contributions of the paper have been
listed below:
• Dialect-Specific Dataset Creation: Development of a dedicated Chittagong language dataset to support future
research in speech recognition and natural language processing for low-resource languages.
• Improved ASR for Chittagong Dialect: Enhancement of state-of-the-art Automatic Speech Recognition (ASR)
systems has been carried out in this research. Particularly Wav2vec 2.0, to better handle dialectical peculiarities
found in the Chittagong language through improving transcription accuracy.
• Framework for Low-Resource Languages: Proposal of a comprehensive framework for managing and
interpreting the Chittagong dialect within the broader context of Bangla, offering a model for other regional
languages in Bangladesh.
• Technological Empowerment of Regional Languages: Contribution towards promoting regional
languages using speech technology, increasing accessibility and technological engagement for speakers of local
languages, especially in the context of tourism and regional communication. As a result, regional languages have
been promoted using technology and research efforts.

1.1 Objectives
Develop a dialect-specific dataset for Chittagonian speech.
Improve ASR performance for Bangla and its dialects using Wav2vec 2.0.
Provide a framework for low-resource language ASR development.
Promote regional languages through advanced speech technology.

2. Literature Review
Speech-to-text systems have also developed from their rule based primitive methods to more advanced machine-
learning models. In early models, extensive use of phonetic dictionaries and handcrafted rules was made, which
were time-consuming and not very reliable (Bakiri 1991). The introduction of statistical models like Hidden
Markov Models (HMMs) and Gaussian Mixture Models (GMMs) brought considerably. development and allowed
for the building more precise and large-scale speech recognition systems (Baker et al. 2009) and (Jelinek1997).
In the recent past, deep learning has ushered in the next era of ASR with the introduction of neural network-based
architecture (Graves, Mohamed, and Hinton 2013) (Hinton et al. (2012). As for temporal features, convolutional
neural networks (CNNs) and recurrent neural networks (RNNs) have been widely employed to capture the
temporal dependencies of speech signals (Abadi et al.2016) (Hannun et al. 2014). Some of the most famous
examples are DeepSpeech, which employs RNN with Connectionist Temporal Classification (CTC) for end-to-
end speech recognition (Amodei et al. 2016). Other Transformer-based models have taken the field to a new level
by enhancing the parallelization process and improving the modelling of long-distance relations in datasets
(Vaswani et al. 2017) (Dong, Xu, and Xu 2018). The wav2vec 2.0 model that uses a transformer architecture for
self-supervised learning has shown remarkable improvements in ASR systems (Schneider et al.2019). However,
the majority of the research and development has primarily targeted high-resource languages rather than low-
resource languages such as Bangla (Gupta et al. 2023).).

In the research by (Sinha et al. 2024) evaluated the prototypical speech modification methods such as Wav2Vec2to
study the changes in speech of children and the pattern of modification using PF-Star and CMU Kids dataset. In
their study it was found that for children aged below 10 years, the Wav2Vec2 ASR model underperforms, thus a
alteration of speech using other methods performs best for other scenarios. The researchers studied across various
age groups and highlighted their findings in their research. In conclusion, the authors also provided the indication

© IEOM Society International

984
Proceedings of the International Conference on Industrial Engineering and Operations Management,

that the PF-Star performs with better results with respect to the CMU-Kids dataset. In the research by Juan et al.
(Vásquez-Correa and Álvarez Muniain 2023), online child exploitation by Law Enforcement Agencies (LEAs)
have utilized audio materials to detect keywords deemed to child abuse. The researchers proposed a model based
on Wav2Vec2 and Whisper to enhance the accuracy of keyword detection. The model performs well under various
scenarios providing a promising result of a word error rate of 11% and 25%. The accuracies vastly depend on the
language of the text and also achieve a true-positive rate of 82% and 98% for both of the models. The findings
suggest that a federated learning approach proves a far better approach than centralized learning through neural
networks.

2.1 Current Bangla STT System

A limited amount of work on Bangla STT has been done in comparison to other high-resource languages. Initial
studies employed HMMs and GMMs mainly, as exemplified by Roy et al. (R., D., and G.2002), who created a
Bangla ASR system with these models. Recent studies have incorporated deep learning approaches. Hossain et
al. (Hossain et al.2013) attempted to use RNNs for Bangla speech recognition with moderate success, although
they found the data shortage a significant issue. The availability of datasets such as the Bengali Common Voice
corpus by Mozilla (Alam et al.2022) and the OpenSLR Bengali dataset (Alam et al.,2022) made further
advanced research possible. However, these datasets alone are not enough for training very accurate models, and
this makes it necessary to gather and incorporate more data.

The paper by Samiul et al. details the Bengali Common Voice Corpus v9.0, comprising 231,120 samples collected
from 19,817 contributors, totaling 399 hours of speech recordings. Of these, 56 hours 14% have been validated.
Each audio clip comes with a sentence annotation and additional metadata including upvotes, downvotes, age,
gender, and accents. The paper compares the Common Voice Bengali Corpus with other Bengali speech datasets.
It stands out for its larger size (399 hours) and greater feature diversity compared to datasets like OpenSLR. In
feature analysis, Geneva and SpeechVGG features show that Common Voice exhibits more diverse features, with
significant differences in sound levels, voiced segments, and pitch compared to OpenSLR. 7. In terms of
benchmark, in ASR Evaluation, the dataset is evaluated using the Kaldi (HMM-GMM) model, which achieved a
WER of 24.4. Comparison with the Wev2vec2 model trained on the OpenSLR dataset showed a WER of 39.291
and a CER of 13.856.

3. Methodology
3.1 Data Collection
The OpenSLR dataset has been selected for this research, which is publicly available and contains a substantial
amount of Bangla speech data. Additionally, our study utilised many datasets of Bangla speech, consisting of
684,510 audio files obtained from various collections: Files in the Standard Bangla Language 240918, Chittagong
Local Bangla Language 247565, and Sylhet local Bangla Language 196027 audio file collections have been
collected where each of the data in these datasets comprises FLAC audio files.

3.2 Model training

1) OpenSSL Dataset: A larger dataset available from the OpenSLR is the Bengali Common Voice dataset, which
is thousands of hours of Bangla speech data from volunteers. This results in a broad spectrum of accents, dialects,
and speaking styles that can be accounted for while using this dataset to train STT models (Alam et al.,2022). To
further enrich the OpenSLR dataset, more audios have been sourced from the following areas:
• Television Broadcasts: We obtained 77K segments from news programs, discussion shows, and interviews.
• Online Videos: The above illustrated videos were sourced from YouTube and other similar forums that featured
educational lessons, lectures, and public speeches.
• Public Speeches: Some of the patients accompanied their responses with recordings of political speeches,
cultural events, or public announcements.

2) Manual Transcription: Audio data were also logged within the same recording session, and all collected
audio data were manually transcribed to ensure high-quality labels. For training, we used Hugging Face’s
implementation of the wav2vec 2.0 model, especially facebook/wav2vec2-large-xlsr53 (F. AI,2021), which was
pre-trained on multilingual voice data. 56 epochs were run, resulting in approximately 50 hours of training
utilising an Nvidia RTX 4090 GPUs.
3)Training plateau: Around epoch 56, a Word Error Rate (WER) of 0.11 was recorded, with further training
exhibiting diminishing returns.

4) Hyperparameter Details: Learning Rate: 1×10^-5

© IEOM Society International

985
Proceedings of the International Conference on Industrial Engineering and Operations Management,

• Batch Size: 32
• Epochs: 56
• Dropout Rate: 0.1
• Gradient Clipping Threshold: 1.0
• Optimizer: AdamW with weight decay regularization.
3.3 Data Preprocessing
Pre-processing steps are crucial to preparing the dataset for training and include:
1) Normalization: As common pre-processing steps, text normalisation included the conversion of all text to
lowercase, the removal of punctuation, and the standardization of misspellings (Aliero et al. 2023).
2) Noise Reduction: The specific workflow with audio files involved noise reduction algorithms in order to
improve the speech quality.
3) Segmentation: Long audio files had to be clipped into shorter realistic sections, with most of them taking
between five and thirty seconds.

3.4 Model Selection and Fine-Tuning

The wav2vec 2.0 model was chosen since it showed better performance in various ASR tasks. The model employs
a transformer-based structure and self-training to learn clean speech features directly from the raw waveform
(Schneider et al.,2019). The model used in this research has been illustrated in Figure 1 (Figure 1).

3.5 Fine-Tuning Process

The pre-trained wav2vec 2.0 We fine-tuned the model on the combined dataset of the two groups. The process
involved:

3.6 Learning Rate Scheduling

To increase the robustness of the training process and guarantee convergence, adaptive learning rate scheduling
was used (Xu et al. 2019).
1) Gradient Clipping: To avoid this, gradient clipping was used to control the maximum allowable gradients,
which can cause fluctuations in the training process (Zhang et al. 2019).
2) Data Augmentation: Signal processing techniques like time stretching and pitch scaling have been used, as
well as background noise has been added, to increase the amount of training data and enhance model resilience
(Chung and Mak 2021).

3.7 Reliability and Validity

To ensure the internal validity and external validity of the research, the following steps has been carried out:
Dataset Validation: The cholesterol values and their corresponding labels after data cleaning as before were
divided randomly into three various sets with the aim of reducing overfitting and cross-checking the results;
the first set is the training set, the second is the validation set, and the final one is the test set.
Cross-Validation: In order to test the model as a combination of different data sets, we employed cross-validation
(k-fold cross-validation) to test the reliability of the model that was developed (Yang et al. 2019).
Inter-Rater Reliability: To measure the level of interobserver agreement, a part of the transcripts was coded by
more than two coders in order to calculate inter-observer reliability and ensure fidelity with regard to the coding
was maintained (Kim,2018).

4. Experimental Setup
4.1 Training Environment
The experiments were performed on a cluster computing setup with access to several NVIDIA CUDA-
compatible graphics processing units. That is why the use of GPUs contributed a large proportion to training the
models.

© IEOM Society International

986
Proceedings of the International Conference on Industrial Engineering and Operations Management,

Figure 1. Bengali Language Model Design

4.2 Evaluation metrics

Word Error Rate (WER) and Character Error Rate (CER) were utilized as two primary measures for comparing
the effectiveness of different models. WER calculates the extent of word accuracy accomplished by the model,
while CER calculates the percentage accuracy of character by the model. To evaluate the model’s overall
performance, we employed:
The formula for Word Error Rate (WER) (Park et al.,2008) is given below where Insertions are extra words in the
predicted transcript does not present in the reference, Deletions are words missing in the predicted transcript that
are present in the reference and Substitutions are incorrectly recognized words in the predicted transcript.:

Insertions + Deletions + Substitutions

WER = (1)
Number of words in the reference

© IEOM Society International

987
Proceedings of the International Conference on Industrial Engineering and Operations Management,

Similarly, the formula for Character Error Rate (CER) is:

Insertions + Deletions + Substitutions

CER = (2)
Number of characters in the reference

WER ranges from 0 to 1 (or 0% to 100%), with 0 indicating a perfect transcription. The overall performance of
WER and CER is presented in Table 1.

Table 1. Wer and cer for overall performance

All Language WER % CER %

Overall 11.27 6.03
Performance

The overall transcription performance using Word Error Rate (WER) and Character Error Rate (CER) is
summarized in Table 1. A strong model accuracy at word and character levels is indicated by WER values of
11.27% and CER of 6.03%. Overall, these results demonstrate the model's effectiveness at avoiding transcription
errors for all the languages used.

5. Results
5.1 Baseline Performance
To determine the performance level before fine-tuning on the target language, the experiment utilized the pre-
initialized wav2vec 2.0 model without fine-tuning. It is found that, as per the evaluation set, the WER used by the
model was recorded at 25.24%. A comparison of the fine-tuned model with without fine-tuned wav2vec 2.0 is
provided in Table 2.

Table 2. Analysis of Bangla Local Language WORKS

Research Model Dataset WER CER Key

Study Used Used (%) (%) Results/Notes
Wav2vec 2.0 Wav2vec OpenSLR, 25.24 13.29 Establish the performance
Model 2.0 self- limitations of the model before
Without compiled domain-specific adaptation,
fine- cross emphasizing the necessity for
tuned validates fine-tuning to improve accuracy
datasets for Bangla language datasets.
(684,510
files)
Proposed Wav2vec OpenSLR, 11.27 6.03 Significant improvement with
Model 2.0 self- larger datasets and advanced self-
fine- compiled supervised learning.
tuned cross
validates
datasets
(684,510
files)

Table 2: Provides a comparison of WER and CER metrics of the baseline models and the proposed model fine-
tuned using wav2vec 2.0. It highlights the significant improvements achieved by the proposed approach,
especially in WER and CER percentages.
5.2 Subsequent Contingent Analysis and Graphical Representation
The performance of the fine-tuned model was tested on subsets of the dataset for different accents and dialects to
maximize its versatility across domains. The results are summarized as follows:

© IEOM Society International

988
Proceedings of the International Conference on Industrial Engineering and Operations Management,

Figure 2. Bengali Language Model Performance

Figure 2 visualizes the performance comparison of the proposed model against baseline systems in terms of WER
and CER. It shows a clear reduction in error rates for the proposed model, emphasizing its efficacy across standard
Bangla and dialect-specific datasets.

Table 3. wer and cer for overall performance

All Language WER % CER %

Overall 11.27 6.03
Performance

The System-Level Performance in all the languages is shown in Table 3. The Word Error Rate (WER) is 11.27%,
and the Character Error Rate (CER) is = 6.03% The efficiency of the proposed approach in recognizing input
(minimizing recognition errors) is clearly evident from these results.

Dialect-Specific Performance:

Table 4. wer and cer for different dialects

Language WER % CER %

Standard 7.43 3.25
Bangla
Chittagonian 12.00 7.54
Sylheti 14.38 7.29

The performance metrics were plotted to visually compare the baseline and fine-tuned model performance.
Table 4: Summarizes the WER and CER metrics for different dialects (Standard Bangla, Chittagonian, and
Sylheti). It shows how the proposed model performs across dialects, with Standard Bangla showing the lowest
error rates and Sylheti presenting higher challenges.

989
Proceedings of the International Conference on Industrial Engineering and Operations Management,

6. Discussion
The promise of transfer learning for low-resource languages like Bangla is demonstrated by the notable decrease
in WER and CER. Our method makes use of the wav2vec 2.0 model’s advantages and emphasizes the value of
diverse, high-quality datasets for building resilient ASR systems. The increase in accuracy shows that optimizing
a strong model can have significant advantages even with little data.
6.1 Difficulties
Even if our results are encouraging, there were a number of difficulties and restrictions: Data Quality: It took a lot
of work and time to ensure that the gathered data had high-quality transcriptions. Bangla has a wide variety of
accents and dialects, which makes it difficult to achieve consistent correctness across all variants. Computing
Capabilities: Large models like wav2vec 2.0 require substantial computational resources for training and fine-
tuning, which may not be available to all researchers.

6.2 Areas for Further Investigation

Potential future research endeavours could investigate various avenues to enhance Bangla speech-to-text (STT)
systems: Expanding the training dataset in terms of both size and diversity has the potential to improve the
performance of the model, especially for accents and dialects that are not well-represented. Implementing
advanced data augmentation techniques has the potential to enhance the robustness of the model to different noise
conditions and speaking styles. By incorporating language models trained on extensive Bangla text corpora, it is
possible to rectify common transcription errors and enhance overall accuracy.
This study emphasises the practicality and efficiency of refining the wav2vec 2.0 model for converting Bangla
speech into written text. Through the strategic utilization of both existing and recently acquired speech data, we
successfully attained notable enhancements in the precision of transcriptions. The significance of employing
sophisticated deep learning methods and extensive datasets to construct resilient ASR systems for languages with
limited resources is emphasised by our research.
7. Conclusion
This research highlights the successful application of the wav2vec 2.0 model in developing an effective Bangla
local speech-to-text system. The improved accuracy achieved through fine-tuning and the use of additional data
demonstrates the potential for advancements in ASR for low-resource languages. Continued efforts in this
direction could bridge the gap and bring more languages to the forefront of speech technology.

In this study, the usage of the Wav2vec 2.0 model and shift in pre-trained models have been proven to boost
Bangla speech-to-text (STT) accuracy. Training the model on more than 680,000 audio files and optimising
performance, we decreased word error rate (WER) and character error rate (CER) to WER = 11.27% and CER =
6.03%. That is a considerable step forward over the earlier models, which suffered from low data sets and sparse
representation of dialects. Challenges faced throughout this project owing to variances between the spoken and
written words led us to realise how crucial it is to have a broad dataset in the case of languages like Bangla, where
dialects among regions (such as Chittagonian and Sylheti) vary. To validate this performance increase, we
developed new dialect-targeted datasets for both validations.

Nevertheless, future research must address limits imposed by data quality and the computational requirements of
training huge models. In future studies, one of the strategies to make Bangla STT more resilient can be achieved
by increasing the dataset size further and employing better augmentation techniques. In addition, transcription
accuracy could be further enhanced by adding language models learnt using largescale Bangla text corpora. While
we have made significant progress in the field of low-resource language processing, open difficulties are still
evident, and more scalable, accessible models that can really cater to regional dialects will fill this gap. Instead,
this research paper is a stepping stone for Bangla voice recognition, and additional research could lead to usage
in technology, accessibility, and communication.
Acknowledgements
This work is dedicated in memoriam to Pritam Khan Boni, Lecturer, Department of Computer Science,
American International University-Bangladesh (AIUB) by the authors. We want to honor his vision and
efforts in promoting local languages through the use of technology and this research is inspired by him. His
passion for this area of research kept us motivated during this journey, and invaluable feedback paved the
paths for this work. We would like to thank Dipta Justin Gomes, Lecturer, Department of Computer Science,
AIUB, for his assistance and encouragement in completing this study. We are also thankful to our peers and
colleagues for their help and providing valuable discussions.

990
Proceedings of the International Conference on Industrial Engineering and Operations Management,

References
Zhang, B., Haddow, B., & Sennrich, R. "Revisiting end-to-end speech-to-text translation from scratch."
Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning
Research, vol. 162, PMLR, 17–23 July 2022, pp. 26193–26205. Available:
https://proceedings.mlr.press/v162/zhang22i.html.
Li, T., Hu, C., Cong, J., Zhu, X., Li, J., Tian, Q., Wang, Y., & Xie, L. "Diclet-tts: Diffusion model-based cross-
lingual emotion transfer for text-to-speech — a study between English and Mandarin." IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, 2023, pp. 3418–3430.
Akther, A., & Debnath, R. "Automated speech-to-text conversion systems in Bangla language: A systematic
literature review." Khulna University Studies, 2022, pp. 566–583.
Sadhu, S., He, D., Huang, C.-W., Mallidi, S. H., Wu, M., Rastrow, A., Stolcke, A., Droppo, J., & Maas, R.
"Wav2vec-c: A self-supervised model for speech representation learning." 2021. Available:
https://arxiv.org/abs/2103.08393.
Schneider, S., Baevski, A., Collobert, R., & Auli, M. "wav2vec: Unsupervised pre-training for speech
recognition." Interspeech, 2019. Available: https://doi.org/10.21437/Interspeech.20192826.
Bakiri, G. "Converting English text to speech: A machine learning approach." Ph.D. Thesis, Oregon State
University, 1991.
Baker, J. K., Makhoul, J., Schwartz, R., & Cole, R. Readings in Speech Recognition. Morgan Kaufmann, 2009.
Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1997.
Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.
Available: https://doi.org/10.1109/icassp.2013.6638947.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T. N., et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal
Processing Magazine, vol. 29, no. 6, 2012, pp. 82–97. Available:
https://doi.org/10.1109/msp.2012.2205597.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
M., et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." arXiv
preprint, 2016. Available: https://doi.org/10.48550/arxiv.1603.04467.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,
Coates, A., & Ng, A. Y. "Deepspeech: Scaling up end-to-end speech recognition." arXiv preprint, 2014.
Available: https://doi.org/10.48550/arxiv.1412.5567.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng,
Q., Chen, G., et al. "Deep speech 2: End-to-end speech recognition in English and Mandarin." arXiv
preprint, 2016. Available: https://doi.org/10.48550/arxiv.1512.02595.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.
"Attention is all you need." Advances in Neural Information Processing Systems (NeurIPS), 2017.
Available: https://doi.org/10.48550/arxiv.1706.03762.
Dong, L., Xu, S., & Xu, B. "Speech-transformer: A no-recurrence sequence-to-sequence model for speech
recognition." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018, pp. 5884–5888. Available: https://doi.org/10.1109/icassp.2018.8461910.
R. K., D. D., & G., A. M. "Development of the speech recognition system using artificial neural network."
Proceedings of the 5th International Conference on Computer and Information Technology (ICCIT02),
2002, pp. 118–122.
Hossain, M., Rahman, M., Prodhan, U. K., & Khan, M. "Implementation of back-propagation neural network for
isolated Bangla speech recognition." arXiv preprint, 2013. Available: https://arxiv.org/abs/1308.3785.
Alam, S., Sushmit, A., Abdullah, Z., Nakkhatra, S., Ansary, M. N., Hossen, S. M., Mehnaz, S. M., Reasat, T., &
Humayun, A. I. "Bengali common voice speech dataset for automatic speech recognition." 2022.
Available: https://arxiv.org/abs/2206.14053.
F. AI. "Wav2vec 2.0 large xlsr-53." 2021. Available: https://huggingface.co/facebook/wav2vec2-large-xlsr-53.
Accessed: 2024-09-19.
Aliero, A., Bashir, S., Aliyu, H., Tafida, A., Kangiwa, B., & Dankolo, N. "Systematic review on text normalization
techniques and its approach to non-standard words." International Journal of Computer Applications,
vol. 185, September 2023, pp. 975–8887.
Xu, Z., Dai, A. M., Kemp, J., & Metz, L. "Learning an adaptive learning rate schedule." arXiv preprint, 2019.
Available: https://arxiv.org/abs/1909.09712.
Zhang, J., Tianxing, H., Sra, S., & Jadbabaie, A. "Analysis of gradient clipping and adaptive scaling with a relaxed
smoothness condition." May 2019.
Chung, R., & Mak, B. "On-the-fly data augmentation for text-to-speech style transfer." 2021 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 634–641.

991
Proceedings of the International Conference on Industrial Engineering and Operations Management,

Yang, L., Li, Y., Wang, J., & Tang, Z. "Post text processing of Chinese speech recognition based on bidirectional
LSTM networks and CRF." Electronics, vol. 8, October 2019, p. 1248.
Kim, S. "Exploring media literacy: Enhancing English oral proficiency and autonomy using media technology."
Studies in English Education, vol. 23, July 2018.
Park, Y., Patwardhan, S., Visweswariah, K., & Gates, S. "An empirical analysis of word error rate and keyword
error rate." September 2008, pp. 2070–2073.
Sinha, Abhijit, Mittul Singh, Sudarsana Reddy Kadiri, Mikko Kurimo, and Hemant Kumar Kathania. 2024.
“Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition.” Pp. 1–5 in
2024 International Conference on Signal Processing and Communications (SPCOM). Bangalore, India:
IEEE.
Vásquez-Correa, Juan Camilo, and Aitor Álvarez Muniain. 2023. “Novel Speech Recognition Systems Applied
to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper.” Sensors 23(4):1843. doi:
10.3390/s23041843.
Gupta, S., Motepalli, K. S. S., Kumar, R., Narasinga, V., Mirishkar, S. G., & Vuppala, A. K. "Enhancing Language
Identification in Indian Context Through Exploiting Learned Features with Wav2Vec2.0." Proceedings
of the International Conference on Speech and Computer. Springer, 2023, pp. 503–512.

Biographies

Sk Muktadir Hossain is a passionate Data Scientist, Data Analyst, and Machine Learning enthusiast committed
to leveraging data and computational technologies for impactful solutions. His research focuses on machine
learning, data analytics, and computer vision, emphasizing data-driven frameworks to optimize decision-making
and enhance visual intelligence. Adept at predictive modelling and advanced image analysis, Muktadir addresses
critical challenges in automation, healthcare, and smart cities. A lifelong learner, he is driven by the
transformative potential of data and computer vision to reshape industries and society.

MD Rahat Rihan is a Machine Learning and NLP enthusiast dedicated to harnessing data and computational
technologies to drive impactful solutions. His research centers on machine learning, data analytics, and
computer vision, with a strong focus on developing data-driven frameworks to improve decision-making and
advance visual intelligence.

Ahmed Imtiaz is a passionate researcher in Natural Language Processing (NLP), Computer Vision and Pattern
Recognition (CVPR), and Data Analysis. He works fundamentally on creating new solutions by bringing
advanced technologies from these fields together. Specializing in text manipulation, vision understanding and
analytics, Ahmed hopes to make a significant impact solving the real-world problems spanning across a
multitude of fields.

Pritam Khan Boni (late) was a dedicated and accomplished researcher specializing in algorithms, data mining,
artificial intelligence, and pattern recognition. He was associated with the American International University-
Bangladesh (AIUB), where he made significant contributions to the scientific community. His notable works
include research on heart attack risk prediction and mobile robot path planning, reflecting his innovative
approach to solving critical problems. His academic contributions are recognized with 56 citations, an h-index
of 4, and an i10-index of 2, underscoring the impact of his work. Despite his untimely passing, his legacy lives
on, inspiring current and future researchers to explore and innovate in these domains.

Dipta Justin Gomes is a dedicated academic and researcher specializing in Machine Learning, Computer Vision,
Algorithms, Image Processing, and Natural Language Processing. He is currently pursuing a Doctor of Philosophy
(Ph.D.) in Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET),
focusing on non-planar graphs and advanced graph algorithms. Mr. Gomes earned his Master of Science (M.Sc.)
in Computer Science from the University of Ulster, Belfast, Northern Ireland, in January 2023. Prior to this, he
completed another M.Sc. in Computer Science, specializing in Intelligent Systems, at American International
University-Bangladesh (AIUB), where he graduated summa cum laude for achieving a CGPA above 3.95. He also
holds a Bachelor of Science in Computer Information Systems from AIUB, completed in December 2017.His
research encompasses deep learning algorithms, convolutional neural networks, scheduling algorithms, graceful
labeling of graphs, machine learning for predictive analytics, image processing optimizations, data mining
algorithms, underwater image processing, and natural language processing. He has presented his work at various
international conferences, including the International Conference on Computing Advancements (ICCA) and the
International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST). In his professional
career, Mr. Gomes has served as a Lecturer in the Department of Computer Science at AIUB since September
2019. Before this role, he was a Lecturer in the ICT Department at Notre Dame College, Dhaka, from February

992
Proceedings of the International Conference on Industrial Engineering and Operations Management,

2018 to August 2019. Throughout his academic journey, Mr. Gomes shown excellence, receiving Dean's List
Honors during his undergraduate studies and the Daily Star Award for achieving more than six A's in his GCSE
O Levels in 2013. He also secured the 2nd Runner-Up position at the National Hackathon 2016, organized by the
ICT Ministry of Bangladesh. His fields of interest include Graph Theory, Computation Learning, Computer
Vision, and Algorithms. Mr. Gomes continues to contribute to these areas through his research, teaching, and
active participation in academic conferences and workshops.

993

2023.banglalp-1.16 Pseudo
No ratings yet
2023.banglalp-1.16 Pseudo
11 pages
Bengali Speech Recognition Model
No ratings yet
Bengali Speech Recognition Model
5 pages
133-138, Tesma0810, IJEAST
No ratings yet
133-138, Tesma0810, IJEAST
6 pages
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
No ratings yet
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
12 pages
Revised Paper Bangla Recognition-2
No ratings yet
Revised Paper Bangla Recognition-2
19 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
Classification of Bangla Regional Languages and Recognition of Artificial Bangla Speech Using Deep Learning
No ratings yet
Classification of Bangla Regional Languages and Recognition of Artificial Bangla Speech Using Deep Learning
92 pages
Performance Analysis of Different Acoustic Features Based On LSTM For Bangla Speech Recognition
No ratings yet
Performance Analysis of Different Acoustic Features Based On LSTM For Bangla Speech Recognition
9 pages
STT Mathematical Problems in Engineering - 2022 - Hassan - Improvement in Automatic Speech Recognition of South Asian Accent
No ratings yet
STT Mathematical Problems in Engineering - 2022 - Hassan - Improvement in Automatic Speech Recognition of South Asian Accent
12 pages
ASR Survey Presentation
No ratings yet
ASR Survey Presentation
14 pages
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
No ratings yet
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
25 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
Bangladeshi Bangla Speech Corpus For Automatic Speech Recognition Research
No ratings yet
Bangladeshi Bangla Speech Corpus For Automatic Speech Recognition Research
14 pages
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
No ratings yet
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
13 pages
Automatic Speech Recognition History
No ratings yet
Automatic Speech Recognition History
9 pages
Wa0002.
No ratings yet
Wa0002.
10 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
A Speech Recognition System For Bengali Language Using Recurrent Neural Network
No ratings yet
A Speech Recognition System For Bengali Language Using Recurrent Neural Network
4 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
No ratings yet
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
4 pages
31 Multilingual Automatic Speech
No ratings yet
31 Multilingual Automatic Speech
9 pages
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
No ratings yet
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
5 pages
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
No ratings yet
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
10 pages
Myanmar ASR via CNN Optimization
No ratings yet
Myanmar ASR via CNN Optimization
10 pages
Expanding Speech Tech to 1,000+ Languages
No ratings yet
Expanding Speech Tech to 1,000+ Languages
41 pages
Applsci 13 05389
No ratings yet
Applsci 13 05389
2 pages
VP ReserachPaper 10
No ratings yet
VP ReserachPaper 10
4 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
Voice-to-Text via Deep Learning
No ratings yet
Voice-to-Text via Deep Learning
6 pages
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
No ratings yet
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
13 pages
2023 Arabicnlp-1 10
No ratings yet
2023 Arabicnlp-1 10
8 pages
Speech Recognition & Sentiment Analysis
No ratings yet
Speech Recognition & Sentiment Analysis
23 pages
Bengali Isolated Speech Recognition: Bachelor of Science Computer Science and Engineering
No ratings yet
Bengali Isolated Speech Recognition: Bachelor of Science Computer Science and Engineering
34 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
2514-Article Text-11375-1-10-20220919
No ratings yet
2514-Article Text-11375-1-10-20220919
12 pages
An In-Depth Analysis of Automatic Speech Recognition System
No ratings yet
An In-Depth Analysis of Automatic Speech Recognition System
5 pages
AI Smart Reception for Police Services
No ratings yet
AI Smart Reception for Police Services
14 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
13 Spectral Warping and Data Augmentation For Low Resource Language ASR
No ratings yet
13 Spectral Warping and Data Augmentation For Low Resource Language ASR
11 pages
Bengali Biomedical ASR System
No ratings yet
Bengali Biomedical ASR System
9 pages
Development and Suitability of Indian Languages Speech Database For Building Watson Based ASR System
No ratings yet
Development and Suitability of Indian Languages Speech Database For Building Watson Based ASR System
7 pages
Chapter One
No ratings yet
Chapter One
13 pages
Racial Disparities in Automated Speech Recognition
No ratings yet
Racial Disparities in Automated Speech Recognition
6 pages
Wavllm: Towards Robust and Adaptive Speech Large Language Model
No ratings yet
Wavllm: Towards Robust and Adaptive Speech Large Language Model
21 pages
Speech Recognition Using Deep Learning Techniques
No ratings yet
Speech Recognition Using Deep Learning Techniques
5 pages
Speech Recognition As Emerging Revolutionary Technology
No ratings yet
Speech Recognition As Emerging Revolutionary Technology
4 pages
Iot Based Personal Voice Assistant: Research Paper
No ratings yet
Iot Based Personal Voice Assistant: Research Paper
7 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
No ratings yet
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
9 pages
Final Review - Kannada Accent Recognition
No ratings yet
Final Review - Kannada Accent Recognition
27 pages
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Sensors 20 02326 PDF
No ratings yet
Sensors 20 02326 PDF
19 pages
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
No ratings yet
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
5 pages
"Echo Lingual - Voice-Activated Translation2
No ratings yet
"Echo Lingual - Voice-Activated Translation2
11 pages
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
No ratings yet
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
12 pages
Voice Recognition & Text-to-Speech
No ratings yet
Voice Recognition & Text-to-Speech
6 pages
Cse238 Lab 3 (213-115-015)
No ratings yet
Cse238 Lab 3 (213-115-015)
9 pages
Acoustic Analysis of Bangla Vowel Invent
No ratings yet
Acoustic Analysis of Bangla Vowel Invent
9 pages
Perception of Vowels and Dental Consonan
No ratings yet
Perception of Vowels and Dental Consonan
3 pages
Cepstral and Mel Cepstral Frequency Meas
No ratings yet
Cepstral and Mel Cepstral Frequency Meas
11 pages
Word and Syllable Boundary of Sylheti PH
No ratings yet
Word and Syllable Boundary of Sylheti PH
5 pages
Chakraborty 2020
No ratings yet
Chakraborty 2020
6 pages
BanSpeech A Multi-Domain Bangla Speech Recognition Benchmark Toward Robust Performance in Challenging Conditions
No ratings yet
BanSpeech A Multi-Domain Bangla Speech Recognition Benchmark Toward Robust Performance in Challenging Conditions
12 pages
Speech Recognition of Isolated Words Usi
No ratings yet
Speech Recognition of Isolated Words Usi
10 pages
Phonological Variation and Linguistic Diversity in Bangladeshi Dialects: An Exploration of Sound Patterns and Sociolinguistic Significance
No ratings yet
Phonological Variation and Linguistic Diversity in Bangladeshi Dialects: An Exploration of Sound Patterns and Sociolinguistic Significance
23 pages
AI's Impact on Jobs: A Student's View
No ratings yet
AI's Impact on Jobs: A Student's View
17 pages
Ciml Mini Project
No ratings yet
Ciml Mini Project
19 pages
Deep Learning LAB
No ratings yet
Deep Learning LAB
47 pages
Artificial Intelligence Research Paper
No ratings yet
Artificial Intelligence Research Paper
13 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Nca Genl Demo
No ratings yet
Nca Genl Demo
13 pages
ChatGPT's Inconsistent Moral Influence
No ratings yet
ChatGPT's Inconsistent Moral Influence
20 pages
842imguf - Lbsim Corporate Brochure 2024
No ratings yet
842imguf - Lbsim Corporate Brochure 2024
32 pages
ML Ops
100% (1)
ML Ops
19 pages
PIB Year End Review 2024
No ratings yet
PIB Year End Review 2024
676 pages
NPAI v10 Briefing
No ratings yet
NPAI v10 Briefing
43 pages
K-Means and K-NN Methods For Determining Student Interest
No ratings yet
K-Means and K-NN Methods For Determining Student Interest
13 pages
Digitalization and Artificial Intelligence - Use by and in - Thomas Schneider - 2023 - Springer Gabler - 9783658403829 - Anna's Archive
No ratings yet
Digitalization and Artificial Intelligence - Use by and in - Thomas Schneider - 2023 - Springer Gabler - 9783658403829 - Anna's Archive
99 pages
Sai University Prospectus 2025
No ratings yet
Sai University Prospectus 2025
44 pages
H13-211-EnU HCIA-Intelligent Computing V1.0 Dumps
No ratings yet
H13-211-EnU HCIA-Intelligent Computing V1.0 Dumps
11 pages
Recruitment & Selection Insights
No ratings yet
Recruitment & Selection Insights
16 pages
Stackjunior Course Catalog 2025
No ratings yet
Stackjunior Course Catalog 2025
15 pages
Job Roadmap For Students (No Degree Required) .XLSX - Roadmap
No ratings yet
Job Roadmap For Students (No Degree Required) .XLSX - Roadmap
8 pages
AI Efficiency in Medical Imaging
No ratings yet
AI Efficiency in Medical Imaging
16 pages
Mindsdb
No ratings yet
Mindsdb
3 pages
Whitepaper - Revolutionizing GRC With AI Harnessing The Power of LLM and RAG Technologies
No ratings yet
Whitepaper - Revolutionizing GRC With AI Harnessing The Power of LLM and RAG Technologies
20 pages
Project Report (Minor Project)
No ratings yet
Project Report (Minor Project)
9 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
3 pages
Lang Chain
No ratings yet
Lang Chain
7 pages
Stand-Up, Drama and Spambots - The Creative World Takes On A.I. - The New York Ti
No ratings yet
Stand-Up, Drama and Spambots - The Creative World Takes On A.I. - The New York Ti
11 pages
Fuzzy Logic Mobile Robot Navigation
No ratings yet
Fuzzy Logic Mobile Robot Navigation
5 pages
Atos Atos-Cybersecurity-Brochure-Cyber-Tech-Radar-Infographics-25-10-2021
No ratings yet
Atos Atos-Cybersecurity-Brochure-Cyber-Tech-Radar-Infographics-25-10-2021
20 pages
Digital Grievance Redressal For Cleaner, Smarter India
100% (1)
Digital Grievance Redressal For Cleaner, Smarter India
8 pages
Content Writing
No ratings yet
Content Writing
3 pages
Unit I
No ratings yet
Unit I
38 pages

Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning

Uploaded by

Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning

Uploaded by

Proceedings of the International Conference on Industrial Engineering and Operations Management,

7th Bangladesh Conference on Industrial Engineering and Operations Management

Enhancing Bangla Local Speech-to-Text Conversion Using

Pritam Khan Boni, Dipta Justin Gomes

© IEOM Society International

© IEOM Society International

2.1 Current Bangla STT System

3.2 Model training

4) Hyperparameter Details: Learning Rate: 1×10^-5

© IEOM Society International

3.4 Model Selection and Fine-Tuning

3.5 Fine-Tuning Process

3.6 Learning Rate Scheduling

3.7 Reliability and Validity

© IEOM Society International

Figure 1. Bengali Language Model Design

4.2 Evaluation metrics

Insertions + Deletions + Substitutions

© IEOM Society International

Similarly, the formula for Character Error Rate (CER) is:

Insertions + Deletions + Substitutions

Table 1. Wer and cer for overall performance

All Language WER % CER %

Table 2. Analysis of Bangla Local Language WORKS

Research Model Dataset WER CER Key

© IEOM Society International

Figure 2. Bengali Language Model Performance

Table 3. wer and cer for overall performance

All Language WER % CER %

Table 4. wer and cer for different dialects

Language WER % CER %

© IEOM Society International

6.2 Areas for Further Investigation

© IEOM Society International

© IEOM Society International

© IEOM Society International

© IEOM Society International

You might also like