0% found this document useful (0 votes)
103 views7 pages

Design of A Voice Recognition System Using Artificial Intelligence

The document discusses the design of a voice recognition system using artificial neural networks. It involves collecting audio samples from multiple speakers to train a neural network model to classify speakers. The system is implemented using Python with libraries like Keras and TensorFlow. It is integrated into a web application using Flask for a user interface.

Uploaded by

ADEMOLA KINGSLEY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views7 pages

Design of A Voice Recognition System Using Artificial Intelligence

The document discusses the design of a voice recognition system using artificial neural networks. It involves collecting audio samples from multiple speakers to train a neural network model to classify speakers. The system is implemented using Python with libraries like Keras and TensorFlow. It is integrated into a web application using Flask for a user interface.

Uploaded by

ADEMOLA KINGSLEY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH

VOL. 4, NO. 1, 2024

Design of a Voice Recognition System Using


Artificial Neural Network
Mayowa O. Daniel1,*, Ibukunoluwa A. Olajide2
1
Department of Computer Engineering, The Federal University of Technology, Akure, Nigeria
Email: [email protected]
2
Department of Electrical and Electronics Engineering, The Federal University of Technology, Akure, Nigeria
Email: [email protected]
*
Corresponding author

Abstract: Voice recognition systems have gained significant user IDs, and PINs. Nonetheless, these systems lack
prevalence in our everyday lives, encompassing a wide range of robustness due to the susceptibility of PIN codes to hacking
applications, from virtual assistants on smartphones to voice- and the potential theft and replication of ID cards [2].
controlled home automation systems. This research paper
Consequently, the emergence of a novel technology called
presents a comprehensive design and implementation of a voice
recognition security system employing artificial neural biometrics is expected to engender enhanced trust in security
networks. The system's training involved a dataset consisting of systems. Biometrics comprises various techniques employed
900 audio samples collected from 10 distinct speakers, enabling for identifying individuals based on their distinctive physical
the resulting model to accurately classify the speaker of a given and behavioural traits. Examples of such identifying features
audio sample. For the implementation of the voice recognition include fingerprints, voice patterns, facial characteristics,
system, Python serves as the primary programming language.
retinal and iris scans, signatures, hand geometry, and wrist
The system leverages the Keras library, which offers a high-level
interface for constructing and training neural networks, with veins [2].
efficient computation facilitated by the TensorFlow back-end. Biometric technology operates by employing a user's
Additionally, the Flask framework, a Python-based web unique physical attribute as the password or feature
framework, was utilized to create a user interface in the form of parameter. Human characteristics such as voice, face, and
a web application for the voice recognition system. To effectively fingerprints are commonly utilized as feature parameters. The
train the artificial neural network, the audio data undergoes
inherent absence of identical twins ensures user confidence
preprocessing, involving the extraction of relevant features from
the audio samples. Subsequently, during the preprocessing and safety when employing voice recognition systems [2].
phase, the audio data is labelled, and the neural network is According to a survey conducted by Unisys, 32% of
trained on this labelled dataset to learn the classification of respondents favored voice recognition, 27% preferred
different speakers. The trained model was rigorously tested on fingerprints, 20% opted for facial scans, 12% favored hand
a set of previously unseen audio samples, yielding an impressive geometries, and 10% expressed a preference for iris scans [3].
classification accuracy exceeding 96%. The finalized model will
Given that the human voice embodies the most pervasive and
be integrated into the web application, enabling users to upload
audio files and receive accurate predictions regarding the instinctive form of human communication, voice recognition
speaker's identity. This paper demonstrates the efficacy of stands as a leading biometric technology. Consequently,
artificial neural networks in the context of voice recognition voice recognition systems hold potential benefits for securing
systems, while also providing a practical framework for doors, vaults, confidential laboratories, and other restricted
constructing such systems using readily available tools and areas.
libraries.

Keywords: ANN, Kera Libraries, Neural Networks, Security,


II. STATEMENT OF PROBLEM
Voice recognition. Ensuring security is a paramount concern for both
individuals and organizations, and with the continuous
I. INTRODUCTION advancement of technology, various approaches have been
Regarding interpersonal communication, speech stands out employed to safeguard lives and assets. Traditional methods
as the predominant, inherent, and highly efficient means. The of protection, such as PINs and passwords, have long been
process of automatic speech recognition (ASR) entails relied upon as the foundation for securing properties.
transforming an acoustic signal, captured via a microphone or However, these conventional security measures are
telephone, into a sequence of words through the utilization of vulnerable to hacking and unauthorized access due to their
a specific algorithm that can be implemented as a system inherent limitations and lack of robustness [4]. Furthermore,
program [1]. This technique also encompasses the commonly used biometric technologies themselves are not
interpretation of spoken words by analyzing voice recordings immune to shortcomings. For instance, in the case of
to ascertain their intended meaning. Contemporary fingerprint recognition, the security of the system can be
residential and commercial security measures commonly compromised if a user's finger is forcibly removed [5].
encompass diverse protective methods, such as passwords, Similarly, facial recognition systems can be deceived by

1
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

utilizing a user's photograph to bypass security measures [6]. audio signal, such as prosodic and spectral features. These
Moreover, these methods often prove inadequate and features are then used to develop a speaker voice model using
incompatible for individuals with physical disabilities, posing machine learning algorithms. The speaker recognition system
additional challenges [7]. compares the input audio to a speaker database in order to
identify the specific speaker. Fig. 1 provides a visual
representation of the step-by-step process to achieve the
III. LITERATURE REVIEW desired outcome. The implementation stages include data
acquisition, pre-processing, feature extraction, training and
A. Speech Recognition Systems
evaluation, and classification of the voice input.
Drenthen (2012) introduced a fundamental speech
recognition system that encompasses several distinct stages,
namely pre-processing, feature extraction, clustering, and
classification [8]. In the pre-processing module, aimed at
input speech signals, enhancing the signal-to-noise ratio is
crucial. Subsequently, in the second step, relevant features of
the signal are extracted through an appropriate technique for
feature extraction. The third step involves determining the
centroid by employing the k-means algorithm on the feature
vectors. Lastly, a pattern-matching technique is utilized to
recognize the speech signal, with the matching score
dependent on both the chosen algorithm and the size of the
training database [9].

B. Evolution of Speech Recognition Systems


The history of speech recognition technology dates back to
significant milestones. In 1784, a scholar in Vienna created
the first Acoustic Mechanical Speech Machine. Later, in
1879, Thomas Edison invented the first dictation machine.
Advancing further, Bell Laboratories developed a speech
recognition system in 1952 capable of accurately recognizing
spoken digits, albeit limited to the inventor's voice. Notably,
in 1970, a scholar introduced the Harpy System, which
exhibited impressive capabilities, recognizing over 1000
words, different pronunciations, and certain phrases. The
1980s witnessed further advancements with the introduction
of the Hidden Markov Model (HMM) in speech recognition.
This mathematical approach revolutionized the analysis of
sound waves and paved the way for numerous breakthroughs. Fig. 1. The block diagram of the Voice Recognition System
IBM Tangora, in 1986, harnessed the power of HMM and
A. Data Acquisition
successfully predicted upcoming phonemes in speech. In
2006, the National Security Agency (NSA) adopted speech The data collection process involved obtaining audio
recognition systems to segment keywords in recorded speech, recordings from ten different users using a mobile phone as
highlighting its relevance in security applications. the recording device. A total of 1000 audio samples were
The subsequent surge in speech recognition technology collected, consisting of number-digits ranging from 0 to 9.
occurred when leading IT companies such as Facebook, The dataset was in waveform (.wav) format. To facilitate
Google, Amazon, Microsoft, and Apple ventured into model training and evaluation, the data was split into training
offering this functionality across various devices. Services and testing subsets. The testing subset comprised 100 audio
like Google Home, Amazon Echo, and Apple Siri exemplify samples, while the training dataset contained 900 audio
their commitment to developing voice assistants that provide samples. Within the training dataset, a further split was made
accurate responses and replies. These top tech companies to create a validation set, with a ratio of 60:40 for training and
strive to enhance the accuracy and efficiency of voice validation respectively. This split ensured an appropriate
assistants, driving the continuous evolution of speech distribution of data for model development and evaluation.
recognition technology. Fig. 2 illustrates the allocation of the pre-processed data into
the respective subsets, maintaining the appropriate
proportions. Each of these split data groups—training,
testing, and validation was utilized during the modelling stage
IV. METHODOLOGY for training, testing, and validation purposes respectively.
The purpose of the speaker recognition system is to
identify individuals based on their unique voice patterns. The
methodology involves extracting various features from the

2
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

By examining the heatmap, patterns and dependencies


between the features can be identified. High correlations
between features indicate a strong relationship, while low
correlations suggest a weaker or no relationship.

Fig. 2. Splitting of the dataset

B. Feature Extraction
Various techniques can be employed to extract unique
features from the audio dataset, including Linear Prediction
Coefficients (LPC), Linear Prediction Cepstral Coefficients
(LPCC), Mel Frequency Cepstral Coefficients (MFCC),
Discrete Wavelet Transform (DWT), Perceptual Linear
Prediction (PLP), and more. In the case of the dataset at hand,
the technique chosen for feature extraction is Mel Frequency
Cepstral Coefficients (MFCC). When extracting features
from audio data, a range of characteristics can be considered, Fig. 3. Data correlation of the Audio feature using Heatmap A
such as zero-crossing rate, energy, spectral roll-off, spectral
flux, spectral entropy, chroma features, pitch, MFCC,
spectral bandwidth, spectral centroid, and so on. These
features provide valuable information about the properties
and patterns within the audio signals. In the specific case of
the dataset being analyzed, the MFCC feature extraction
technique is employed, which focuses on capturing the Mel-
frequency spectrum and cepstral coefficients of the audio
signals.

C. Pre-Processing
Pre-processing plays a crucial role in the analysis of audio
data, particularly when working with extracted features. Its
main objective is to improve the quality of audio signals,
reduce noise, and extract relevant information for subsequent
analysis. In this particular scenario, pre-processing involves
working with extracted features obtained from voice data
samples stored in a comma-separated value (.csv) file.
During the pre-processing, these techniques were employed
to enhance audio signal quality and remove unwanted noise-
filtering, normalization, denoising, and resampling.
Fig. 4. Data correlation of the Audio feature using Heatmap B

D. Explanatory Data Analysis and Correlation


Exploratory data analysis (EDA) is an approach used to
understand the information and patterns within a dataset,
without necessarily focusing on formal modeling or Fig. 5. Data correlation range
hypothesis testing. It involves examining the data to gain
insights, identify relationships, and uncover potential A heatmap is a visualization tool that employs a color-
correlations among the extracted features. In the context of coded matrix to represent data. In the context of this analysis,
the current analysis, EDA was conducted to assess the the heatmap is used to depict the relationship between various
correlation between the extracted features from the dataset. features extracted from the audio data. Fig. 5 shows the
To visualize and analyze this correlation, a heatmap was correlation coefficient, ranging from -1 to 1, measures the
employed. The heatmap, presented in Fig. 3 and Fig. 4, strength and direction of the relationship between two
provides a graphical representation of the relationships features. A correlation coefficient of 1 indicates a perfect
between different features and allows for a quick assessment positive correlation, typically occurring when a feature is
of their strength and direction. compared to itself. It signifies that as one variable increases,

3
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

the other variable also increases at the same rate. On the other network and enables it to learn complex patterns.
hand, a correlation coefficient of -1 indicates a complete Dropout layers are incorporated in the neural network
negative correlation, where an increase in one variable architecture to address the issue of over-fitting during
corresponds to a decrease in the other variable at the same training. Over-fitting occurs when the model becomes too
rate. If the correlation coefficient is 0, it implies the absence specialized to the training data and fails to generalize well to
of any correlation between the two features. When observing unseen data. Dropout layers randomly deactivate a fraction of
the heatmap, cells with correlation coefficients closer to 1 the neurons during training, forcing the network to learn more
indicate a stronger positive influence between those features robust and generalizable representations by preventing co-
in the model training process. This implies that variations in adaptation among neurons. The output layer of the neural
one feature are strongly associated with variations in the network employs the softmax function. The softmax function
other, making them important for modeling and analysis takes the raw outputs from the preceding layers and converts
purposes. them into probabilities for each class. These probabilities
represent the likelihood of the input belonging to each class.
E. Modelling
The input is then classified based on the class with the highest
The voice recognition model was constructed using a deep probability, providing the final output of the network.
neural network architecture with the Keras library. Keras is a By utilizing this neural network architecture with hidden
popular high-level deep learning framework that provides a layers, dropout layers for regularization, and the softmax
user-friendly interface to build and train neural networks. The function in the output layer, the model can effectively learn
model architecture consisted of multiple layers of densely and classify data into different classes, taking raw input and
connected nodes, also known as fully connected layers. These producing probabilistic predictions.
layers allow information to flow in both forward and Artificial neural network (ANN) was used because of its
backward directions, enabling the network to learn complex effectiveness in processing and classifying complex audio
patterns and relationships in the data. Each node in these data [10]. The benefits of using ANN include:
layers is connected to every node in the previous and 1. Pattern Recognition: ANNs are well-suited for
subsequent layers. The activation function used in the recognizing and learning patterns in data, making them ideal
densely connected layers was the rectified linear unit (ReLU). for voice recognition tasks where subtle variations in speech
ReLU is a widely used activation function that introduces need to be detected and classified [10].
non-linearity to the model. It transforms negative input values 2. Non-linearity: The use of hidden layers allows the
to zero and leaves positive values unchanged, enabling the network to capture non-linear relationships in the data, which
network to learn more complex representations. is crucial for accurately modelling the complex nature of
For the final classification step, a softmax activation human speech [10].
function was employed. Softmax is commonly used in multi- 3. Generalization: By utilizing dropout layers, the network
class classification problems as it provides a probability is less prone to over fitting, meaning it can better generalize
distribution over the classes. It assigns probabilities to each to unseen data and perform reliably in real-world scenarios.
class, indicating the likelihood of the input belonging to each However, the major drawback of ANN is that it tends to
class. By utilizing this deep neural network architecture with over train [11], which requires you to tune some parameters
ReLU activation functions in the densely connected layers such as learning rate, dropout rate, etc. to achieve optimal
and a softmax activation function for classification, the model performance.
can effectively learn and classify voice patterns for speaker The application of artificial neural network (ANN) in voice
recognition tasks. recognition security systems is used to secure vaults, server
rooms, for smart homes or smart hospital [12].
F. Training
The training process of the neural network model was
conducted using the stochastic gradient descent (SGD)
optimizer. SGD is a widely used optimization algorithm in
deep learning. It updates the model's weights iteratively based
on the gradients computed on small batches of training data.
Fig. 6. The ANN Architecture
The specific configuration of the SGD optimizer used in this
training process includes a learning rate of 0.01, which
The artificial neural network (ANN) depicted in Fig. 6 determines the step size for weight updates. A higher learning
consists of hidden layers that perform weighted computations rate allows for larger weight updates, potentially leading to
on the received input and pass the results to the subsequent faster convergence, but it can also risk overshooting the
layers until reaching the output layer. Each layer in the neural optimal solution. Conversely, a lower learning rate reduces
network contains nodes, also known as neurons, which the risk of overshooting but may slow down the convergence
perform computations on the input data using weights and process. The momentum value of 0.9 was employed in the
activation functions. The hidden layers of the neural network SGD optimizer. Momentum helps accelerate the training
conduct a set of weighted computations on the input data they process by accumulating the gradients from previous steps
receive. These computations involve multiplying the input and adding a fraction of it to the current gradient update. This
values by corresponding weights and summing them to obtain allows the optimizer to navigate through flat areas or shallow
an intermediate result. This result is then passed through an local minima more efficiently.
activation function, which introduces non-linearity to the

4
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

The sparse categorical cross-entropy loss function was


chosen for this training process. This loss function is
commonly used when dealing with multi-class classification
problems. It computes the cross-entropy loss between the
predicted probabilities and the true class labels. The "sparse"
aspect indicates that the true class labels are provided as
integers instead of one-hot encoded vectors. The training
process was executed over 30 epochs. An epoch represents a
complete iteration over the entire training dataset. Training
over multiple epochs allows the model to gradually refine its
weights and improve its performance. A batch size of 20 was
utilized during training. The batch size determines the
number of samples processed before updating the model's
weights. A smaller batch size introduces more frequent
weight updates but can result in noisy gradients. Conversely, Fig. 8. Graph showing the epoch losses between the train data and test data
a larger batch size reduces the frequency of weight updates
but provides a more stable estimation of the gradients. H. Web Application
Fig. 7 likely represents a visualization or plot showcasing The web application was created utilizing Flask, a Python-
the training progress, such as the training loss or accuracy, based lightweight web framework. The application
over the 30 epochs. This visual representation aids in incorporated a web form that enabled users to upload an audio
understanding the model's convergence and performance recording of their voice and submit it for prediction. The
throughout the training process. uploaded audio recording underwent preprocessing using the
same techniques employed during the training phase.
Subsequently, the preprocessed audio was passed into the
saved voice recognition model for prediction. The predicted
speaker name was then displayed on the web page.
To construct the web interface, HTML, CSS, and
JavaScript were utilized for designing and implementing the
web-page. The Python Flask API was employed to run the
server, enabling communication between the web application
and the back-end voice recognition model. This combination
of technologies facilitated the seamless integration of user
interaction, data processing, and prediction generation in the
web application. Fig. 9 and 10 gives the feature extraction
and prediction code flow incorporated into the web page.

Fig. 7. Training of the Model

G. Testing
The purpose of the testing process was to assess the
model's generalization capabilities and its accuracy in
predicting outcomes using unseen data. This evaluation
involved utilizing a fresh set of voice recordings, which
underwent the same preprocessing steps as the training
dataset. Subsequently, the model's performance was
evaluated based on its ability to accurately classify the new
recordings into their respective classes. Fig. 9. Feature extraction code on the web
Fig. 8 illustrates the difference in epoch loss between the
training data and the test data. Towards the later portion of
the graph, the test and train lines closely converge, indicating
that the model is effectively generalizing to the data used in
the testing process. This convergence suggests that the model
is successfully adapting to new, unseen data and can make
accurate predictions. This alignment between the test and
train lines underscores the model's robustness and its
potential for real-world applications.

Fig. 10. Prediction code

5
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

V. RESULT AND DISCUSSION recognition systems. The confusion matrix provides valuable
The performance of the voice recognition system was insights into the specific patterns of misclassifications and the
assessed using a dedicated test dataset comprising 100 audio relationships between speakers. By analyzing these patterns,
files. This dataset contained 10 audio files for each of the 10 further refinements or adjustments can be made to improve
speakers present in the training dataset. The evaluation of the the model's performance, such as incorporating additional
system on this test dataset yielded an accuracy of 96%, as training data from speakers with similar accents or employing
depicted in Fig. 11. Out of the 100 audio files, only 3 were advanced techniques for accent or voice characteristic
misclassified by the system. Fig. 12 provides a visualization normalization.
of the classification report generated from the evaluation of
the test data on the model. The classification report offers
detailed insights into the performance of the model for each
class in the test dataset. It typically includes metrics such as
precision, recall, F1-score, and support, which collectively
provide a comprehensive assessment of the model's
performance on each class.
The reported accuracy of 96% and the low number of
misclassified audio files indicate that the voice recognition
system achieved a high level of accuracy and effectiveness in
accurately identifying and classifying speakers in the test
dataset. These results suggest that the model trained on the
training dataset generalized well to unseen data and
demonstrates its potential for real-world voice recognition
applications.

Fig. 13. Confusion Matrix Analysis of the Classified Output

B. Web application result


The web application utilizes a pre-trained speaker
Fig. 11. Accuracy of the model recognition model to make predictions based on audio files.
The model is developed using the Keras package, while the
application itself is implemented using the Python Flask API.
The web application consists of two main components: the
front-end and the back-end. The front-end is responsible for
creating the user interface, allowing users to upload an audio
file for prediction. It is developed using HTML, CSS, and
JavaScript, providing a visually appealing and interactive
interface for users. On the other hand, the back-end is built
using the Flask API. When a user uploads an audio file
through the front-end, the back-end receives the file, extracts
the relevant audio features, and utilizes the pre-trained model
to identify the speaker. The back-end processes the audio data
Fig. 12. Classification Report and produces the prediction results. Fig. 14 displays the
output details on the terminal, indicating whether the code has
A. Training Analysis run successfully or encountered an error. In the case of a
Fig. 13 presents the confusion matrix corresponding to the successful run, the terminal output includes the IP address,
classification results depicted in Figure 12, specifically indicating that the program executed without issues. To
highlighting the classification outcomes for each speaker. In access the web application, the IP address from the terminal
the confusion matrix, the diagonal elements signify the output needs to be copied and pasted into a web browser. This
number of correctly classified audio files for each speaker, action loads the web-page, as illustrated in Fig. 15, providing
while the off-diagonal elements represent the number of the user interface where audio input can be provided.
misclassified audio files. The matrix reveals that the majority To test the program, an audio file is required to be uploaded
of misclassifications occurred between speakers who through the web-page. After uploading the audio file, the user
possessed similar accents or shared voice characteristics. This clicks the "predict" button. The back-end processes the audio
observation aligns with the challenges commonly data using the pre-trained model and generates a visual output
encountered in voice recognition tasks, where distinguishing on the screen, displaying the predicted speaker's name, as
between speakers with similar traits can be more intricate. depicted in Fig. 16. These steps ensure a seamless user
The occurrence of such misclassifications is not unexpected, experience, from uploading the audio file to obtaining the
given the inherent complexities associated with voice predicted speaker's name through the web application's front-

6
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024

end and back-end integration. networks in accurately identifying speakers and opens up new
possibilities for their application in real-world scenarios. The
findings of this work encourage further exploration and
refinement of artificial neural network-based voice
recognition systems. Future research can focus on enhancing
the system's capabilities, such as handling diverse accents,
improving accuracy, and addressing challenges related to
speaker verification and identification.
Overall, this paper highlights the potential and significance
of artificial neural networks in the development of robust and
reliable voice recognition systems. The presented work
contributes to the existing body of knowledge in the field and
Fig. 14. Starting the API for the web server
paves the way for future advancements in voice recognition
technology.

REFERENCES
[1] T. Gulzar, A. Singh, D. K. Rajoriya and N. Farooq, “A Systematic
Analysis of Automatic Speech Recognition: An Overview,”
International Journal of Current Engineering and Technology, vol. 4,
no. 3, pp. 1664-1675, 2014.
[2] H.N. Mohd. Shah, M. Z. Ab Rashid, M.F. Abdollah, M.N. Kamarudin,
C.K. Lin and Z. Kamis, “Biometric Voice Recognition in Security
System,” Indian Journal of Science and Technology, vol. 7, no. 2, pp.
104-112, 2014.
[3] A. Olubukola, A. Adeoluwa, O. Abraham, B. Oyetunde and O.
Ayorinde, “Voice Recognition Door Access Control System,” IOSR
Journal of Computer Engineering (IOSR-JCE), vol. 21, no. 5, pp. 1-
12, 2019.
[4] Cypress, Data Defense, “6 Password Security Risks and How to
Avoid Them,” June 2020. [Online].Available:
https://theconversation.com/passwords-security-vulnerability-
Fig. 15. Webpage View constraints-93164.
[5] University of York, “Researchers expose vulnerabilities of password
managers,” 16 March 2020. [Online]. Available:
https://www.york.ac.uk/news-and-events/news/2020/research/expose-
vulnerabilities-password-managers/.
[6] P. Neil, “PIN Authentication Passkeys – Say Goodbye to Passwords,”
25 April 2023. [Online]. Available: https://vaultvision.com/blog/pin-
authentication-passkeys
[7] L.R. Rabiner and B.H. Juang, “Speech recognition: Statistical
methods,” in K. Brown (Ed.), Encyclopedia of Language & Linguistics,
pp. 1-18, Amsterdam: Elsevier, 2006.
[8] H.F. Pai and H.C. Wang, “A two-dimensional cepstrum approach for
the recognition of mandarin syllable initials,” Pattern Recognition, vol.
26, no. 4, pp. 569-577, 1993.
[9] S. Furui, “History and development of speech recognition,” In Speech
Technology, pp. 1-18, New York: Springer, 2010.
[10] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature, vol. 521,
pp. 436–444, 2015.
[11] M. Malik, M.K. Malik, M.K., K. Mehmood and I. Makhdoom,
Fig. 16. Web page showing the predicted speaker “Automatic speech recognition: a survey,” Multimedia Tools and
Applications, vol. 80, pp. 9411-9457, 2021.
VI. CONCLUSIONS [12] A. Ismail, S. Abdlerazek and I.M. El-Henawy, “Development of Smart
Healthcare System Based on Speech Recognition Using Support
In conclusion, this paper has successfully presented an Vector Machine and Dynamic Time Warping,” Sustainability, vol. 12,
implemented system for predicting the speaker of a speech. pp. 2403, 2020.
Through a comprehensive review of relevant literature and a
thorough comparison of various machine learning methods, it
was determined that an artificial neural network would be the
most suitable approach for realizing the system. The results
obtained from the implemented system have demonstrated
the potential of artificial neural networks in developing robust
and reliable voice recognition systems. The system's accuracy
and performance showcase its viability for deployment in
diverse domains, including security, authentication, and
communication. By achieving its goals and objectives, this
paper has contributed to the advancement of voice
recognition technology. The successful implementation of the
system highlights the effectiveness of artificial neural

You might also like