Design of A Voice Recognition System Using Artificial Intelligence
Design of A Voice Recognition System Using Artificial Intelligence
Abstract: Voice recognition systems have gained significant user IDs, and PINs. Nonetheless, these systems lack
prevalence in our everyday lives, encompassing a wide range of robustness due to the susceptibility of PIN codes to hacking
applications, from virtual assistants on smartphones to voice- and the potential theft and replication of ID cards [2].
controlled home automation systems. This research paper
Consequently, the emergence of a novel technology called
presents a comprehensive design and implementation of a voice
recognition security system employing artificial neural biometrics is expected to engender enhanced trust in security
networks. The system's training involved a dataset consisting of systems. Biometrics comprises various techniques employed
900 audio samples collected from 10 distinct speakers, enabling for identifying individuals based on their distinctive physical
the resulting model to accurately classify the speaker of a given and behavioural traits. Examples of such identifying features
audio sample. For the implementation of the voice recognition include fingerprints, voice patterns, facial characteristics,
system, Python serves as the primary programming language.
retinal and iris scans, signatures, hand geometry, and wrist
The system leverages the Keras library, which offers a high-level
interface for constructing and training neural networks, with veins [2].
efficient computation facilitated by the TensorFlow back-end. Biometric technology operates by employing a user's
Additionally, the Flask framework, a Python-based web unique physical attribute as the password or feature
framework, was utilized to create a user interface in the form of parameter. Human characteristics such as voice, face, and
a web application for the voice recognition system. To effectively fingerprints are commonly utilized as feature parameters. The
train the artificial neural network, the audio data undergoes
inherent absence of identical twins ensures user confidence
preprocessing, involving the extraction of relevant features from
the audio samples. Subsequently, during the preprocessing and safety when employing voice recognition systems [2].
phase, the audio data is labelled, and the neural network is According to a survey conducted by Unisys, 32% of
trained on this labelled dataset to learn the classification of respondents favored voice recognition, 27% preferred
different speakers. The trained model was rigorously tested on fingerprints, 20% opted for facial scans, 12% favored hand
a set of previously unseen audio samples, yielding an impressive geometries, and 10% expressed a preference for iris scans [3].
classification accuracy exceeding 96%. The finalized model will
Given that the human voice embodies the most pervasive and
be integrated into the web application, enabling users to upload
audio files and receive accurate predictions regarding the instinctive form of human communication, voice recognition
speaker's identity. This paper demonstrates the efficacy of stands as a leading biometric technology. Consequently,
artificial neural networks in the context of voice recognition voice recognition systems hold potential benefits for securing
systems, while also providing a practical framework for doors, vaults, confidential laboratories, and other restricted
constructing such systems using readily available tools and areas.
libraries.
1
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
utilizing a user's photograph to bypass security measures [6]. audio signal, such as prosodic and spectral features. These
Moreover, these methods often prove inadequate and features are then used to develop a speaker voice model using
incompatible for individuals with physical disabilities, posing machine learning algorithms. The speaker recognition system
additional challenges [7]. compares the input audio to a speaker database in order to
identify the specific speaker. Fig. 1 provides a visual
representation of the step-by-step process to achieve the
III. LITERATURE REVIEW desired outcome. The implementation stages include data
acquisition, pre-processing, feature extraction, training and
A. Speech Recognition Systems
evaluation, and classification of the voice input.
Drenthen (2012) introduced a fundamental speech
recognition system that encompasses several distinct stages,
namely pre-processing, feature extraction, clustering, and
classification [8]. In the pre-processing module, aimed at
input speech signals, enhancing the signal-to-noise ratio is
crucial. Subsequently, in the second step, relevant features of
the signal are extracted through an appropriate technique for
feature extraction. The third step involves determining the
centroid by employing the k-means algorithm on the feature
vectors. Lastly, a pattern-matching technique is utilized to
recognize the speech signal, with the matching score
dependent on both the chosen algorithm and the size of the
training database [9].
2
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
B. Feature Extraction
Various techniques can be employed to extract unique
features from the audio dataset, including Linear Prediction
Coefficients (LPC), Linear Prediction Cepstral Coefficients
(LPCC), Mel Frequency Cepstral Coefficients (MFCC),
Discrete Wavelet Transform (DWT), Perceptual Linear
Prediction (PLP), and more. In the case of the dataset at hand,
the technique chosen for feature extraction is Mel Frequency
Cepstral Coefficients (MFCC). When extracting features
from audio data, a range of characteristics can be considered, Fig. 3. Data correlation of the Audio feature using Heatmap A
such as zero-crossing rate, energy, spectral roll-off, spectral
flux, spectral entropy, chroma features, pitch, MFCC,
spectral bandwidth, spectral centroid, and so on. These
features provide valuable information about the properties
and patterns within the audio signals. In the specific case of
the dataset being analyzed, the MFCC feature extraction
technique is employed, which focuses on capturing the Mel-
frequency spectrum and cepstral coefficients of the audio
signals.
C. Pre-Processing
Pre-processing plays a crucial role in the analysis of audio
data, particularly when working with extracted features. Its
main objective is to improve the quality of audio signals,
reduce noise, and extract relevant information for subsequent
analysis. In this particular scenario, pre-processing involves
working with extracted features obtained from voice data
samples stored in a comma-separated value (.csv) file.
During the pre-processing, these techniques were employed
to enhance audio signal quality and remove unwanted noise-
filtering, normalization, denoising, and resampling.
Fig. 4. Data correlation of the Audio feature using Heatmap B
3
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
the other variable also increases at the same rate. On the other network and enables it to learn complex patterns.
hand, a correlation coefficient of -1 indicates a complete Dropout layers are incorporated in the neural network
negative correlation, where an increase in one variable architecture to address the issue of over-fitting during
corresponds to a decrease in the other variable at the same training. Over-fitting occurs when the model becomes too
rate. If the correlation coefficient is 0, it implies the absence specialized to the training data and fails to generalize well to
of any correlation between the two features. When observing unseen data. Dropout layers randomly deactivate a fraction of
the heatmap, cells with correlation coefficients closer to 1 the neurons during training, forcing the network to learn more
indicate a stronger positive influence between those features robust and generalizable representations by preventing co-
in the model training process. This implies that variations in adaptation among neurons. The output layer of the neural
one feature are strongly associated with variations in the network employs the softmax function. The softmax function
other, making them important for modeling and analysis takes the raw outputs from the preceding layers and converts
purposes. them into probabilities for each class. These probabilities
represent the likelihood of the input belonging to each class.
E. Modelling
The input is then classified based on the class with the highest
The voice recognition model was constructed using a deep probability, providing the final output of the network.
neural network architecture with the Keras library. Keras is a By utilizing this neural network architecture with hidden
popular high-level deep learning framework that provides a layers, dropout layers for regularization, and the softmax
user-friendly interface to build and train neural networks. The function in the output layer, the model can effectively learn
model architecture consisted of multiple layers of densely and classify data into different classes, taking raw input and
connected nodes, also known as fully connected layers. These producing probabilistic predictions.
layers allow information to flow in both forward and Artificial neural network (ANN) was used because of its
backward directions, enabling the network to learn complex effectiveness in processing and classifying complex audio
patterns and relationships in the data. Each node in these data [10]. The benefits of using ANN include:
layers is connected to every node in the previous and 1. Pattern Recognition: ANNs are well-suited for
subsequent layers. The activation function used in the recognizing and learning patterns in data, making them ideal
densely connected layers was the rectified linear unit (ReLU). for voice recognition tasks where subtle variations in speech
ReLU is a widely used activation function that introduces need to be detected and classified [10].
non-linearity to the model. It transforms negative input values 2. Non-linearity: The use of hidden layers allows the
to zero and leaves positive values unchanged, enabling the network to capture non-linear relationships in the data, which
network to learn more complex representations. is crucial for accurately modelling the complex nature of
For the final classification step, a softmax activation human speech [10].
function was employed. Softmax is commonly used in multi- 3. Generalization: By utilizing dropout layers, the network
class classification problems as it provides a probability is less prone to over fitting, meaning it can better generalize
distribution over the classes. It assigns probabilities to each to unseen data and perform reliably in real-world scenarios.
class, indicating the likelihood of the input belonging to each However, the major drawback of ANN is that it tends to
class. By utilizing this deep neural network architecture with over train [11], which requires you to tune some parameters
ReLU activation functions in the densely connected layers such as learning rate, dropout rate, etc. to achieve optimal
and a softmax activation function for classification, the model performance.
can effectively learn and classify voice patterns for speaker The application of artificial neural network (ANN) in voice
recognition tasks. recognition security systems is used to secure vaults, server
rooms, for smart homes or smart hospital [12].
F. Training
The training process of the neural network model was
conducted using the stochastic gradient descent (SGD)
optimizer. SGD is a widely used optimization algorithm in
deep learning. It updates the model's weights iteratively based
on the gradients computed on small batches of training data.
Fig. 6. The ANN Architecture
The specific configuration of the SGD optimizer used in this
training process includes a learning rate of 0.01, which
The artificial neural network (ANN) depicted in Fig. 6 determines the step size for weight updates. A higher learning
consists of hidden layers that perform weighted computations rate allows for larger weight updates, potentially leading to
on the received input and pass the results to the subsequent faster convergence, but it can also risk overshooting the
layers until reaching the output layer. Each layer in the neural optimal solution. Conversely, a lower learning rate reduces
network contains nodes, also known as neurons, which the risk of overshooting but may slow down the convergence
perform computations on the input data using weights and process. The momentum value of 0.9 was employed in the
activation functions. The hidden layers of the neural network SGD optimizer. Momentum helps accelerate the training
conduct a set of weighted computations on the input data they process by accumulating the gradients from previous steps
receive. These computations involve multiplying the input and adding a fraction of it to the current gradient update. This
values by corresponding weights and summing them to obtain allows the optimizer to navigate through flat areas or shallow
an intermediate result. This result is then passed through an local minima more efficiently.
activation function, which introduces non-linearity to the
4
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
G. Testing
The purpose of the testing process was to assess the
model's generalization capabilities and its accuracy in
predicting outcomes using unseen data. This evaluation
involved utilizing a fresh set of voice recordings, which
underwent the same preprocessing steps as the training
dataset. Subsequently, the model's performance was
evaluated based on its ability to accurately classify the new
recordings into their respective classes. Fig. 9. Feature extraction code on the web
Fig. 8 illustrates the difference in epoch loss between the
training data and the test data. Towards the later portion of
the graph, the test and train lines closely converge, indicating
that the model is effectively generalizing to the data used in
the testing process. This convergence suggests that the model
is successfully adapting to new, unseen data and can make
accurate predictions. This alignment between the test and
train lines underscores the model's robustness and its
potential for real-world applications.
5
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
V. RESULT AND DISCUSSION recognition systems. The confusion matrix provides valuable
The performance of the voice recognition system was insights into the specific patterns of misclassifications and the
assessed using a dedicated test dataset comprising 100 audio relationships between speakers. By analyzing these patterns,
files. This dataset contained 10 audio files for each of the 10 further refinements or adjustments can be made to improve
speakers present in the training dataset. The evaluation of the the model's performance, such as incorporating additional
system on this test dataset yielded an accuracy of 96%, as training data from speakers with similar accents or employing
depicted in Fig. 11. Out of the 100 audio files, only 3 were advanced techniques for accent or voice characteristic
misclassified by the system. Fig. 12 provides a visualization normalization.
of the classification report generated from the evaluation of
the test data on the model. The classification report offers
detailed insights into the performance of the model for each
class in the test dataset. It typically includes metrics such as
precision, recall, F1-score, and support, which collectively
provide a comprehensive assessment of the model's
performance on each class.
The reported accuracy of 96% and the low number of
misclassified audio files indicate that the voice recognition
system achieved a high level of accuracy and effectiveness in
accurately identifying and classifying speakers in the test
dataset. These results suggest that the model trained on the
training dataset generalized well to unseen data and
demonstrates its potential for real-world voice recognition
applications.
6
JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING RESEARCH
VOL. 4, NO. 1, 2024
end and back-end integration. networks in accurately identifying speakers and opens up new
possibilities for their application in real-world scenarios. The
findings of this work encourage further exploration and
refinement of artificial neural network-based voice
recognition systems. Future research can focus on enhancing
the system's capabilities, such as handling diverse accents,
improving accuracy, and addressing challenges related to
speaker verification and identification.
Overall, this paper highlights the potential and significance
of artificial neural networks in the development of robust and
reliable voice recognition systems. The presented work
contributes to the existing body of knowledge in the field and
Fig. 14. Starting the API for the web server
paves the way for future advancements in voice recognition
technology.
REFERENCES
[1] T. Gulzar, A. Singh, D. K. Rajoriya and N. Farooq, “A Systematic
Analysis of Automatic Speech Recognition: An Overview,”
International Journal of Current Engineering and Technology, vol. 4,
no. 3, pp. 1664-1675, 2014.
[2] H.N. Mohd. Shah, M. Z. Ab Rashid, M.F. Abdollah, M.N. Kamarudin,
C.K. Lin and Z. Kamis, “Biometric Voice Recognition in Security
System,” Indian Journal of Science and Technology, vol. 7, no. 2, pp.
104-112, 2014.
[3] A. Olubukola, A. Adeoluwa, O. Abraham, B. Oyetunde and O.
Ayorinde, “Voice Recognition Door Access Control System,” IOSR
Journal of Computer Engineering (IOSR-JCE), vol. 21, no. 5, pp. 1-
12, 2019.
[4] Cypress, Data Defense, “6 Password Security Risks and How to
Avoid Them,” June 2020. [Online].Available:
https://theconversation.com/passwords-security-vulnerability-
Fig. 15. Webpage View constraints-93164.
[5] University of York, “Researchers expose vulnerabilities of password
managers,” 16 March 2020. [Online]. Available:
https://www.york.ac.uk/news-and-events/news/2020/research/expose-
vulnerabilities-password-managers/.
[6] P. Neil, “PIN Authentication Passkeys – Say Goodbye to Passwords,”
25 April 2023. [Online]. Available: https://vaultvision.com/blog/pin-
authentication-passkeys
[7] L.R. Rabiner and B.H. Juang, “Speech recognition: Statistical
methods,” in K. Brown (Ed.), Encyclopedia of Language & Linguistics,
pp. 1-18, Amsterdam: Elsevier, 2006.
[8] H.F. Pai and H.C. Wang, “A two-dimensional cepstrum approach for
the recognition of mandarin syllable initials,” Pattern Recognition, vol.
26, no. 4, pp. 569-577, 1993.
[9] S. Furui, “History and development of speech recognition,” In Speech
Technology, pp. 1-18, New York: Springer, 2010.
[10] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature, vol. 521,
pp. 436–444, 2015.
[11] M. Malik, M.K. Malik, M.K., K. Mehmood and I. Makhdoom,
Fig. 16. Web page showing the predicted speaker “Automatic speech recognition: a survey,” Multimedia Tools and
Applications, vol. 80, pp. 9411-9457, 2021.
VI. CONCLUSIONS [12] A. Ismail, S. Abdlerazek and I.M. El-Henawy, “Development of Smart
Healthcare System Based on Speech Recognition Using Support
In conclusion, this paper has successfully presented an Vector Machine and Dynamic Time Warping,” Sustainability, vol. 12,
implemented system for predicting the speaker of a speech. pp. 2403, 2020.
Through a comprehensive review of relevant literature and a
thorough comparison of various machine learning methods, it
was determined that an artificial neural network would be the
most suitable approach for realizing the system. The results
obtained from the implemented system have demonstrated
the potential of artificial neural networks in developing robust
and reliable voice recognition systems. The system's accuracy
and performance showcase its viability for deployment in
diverse domains, including security, authentication, and
communication. By achieving its goals and objectives, this
paper has contributed to the advancement of voice
recognition technology. The successful implementation of the
system highlights the effectiveness of artificial neural