0% found this document useful (0 votes)
56 views6 pages

(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M

The paper presents a comprehensive survey on Automatic Speaker Recognition (ASR) focusing on deep learning methodologies, including speaker verification, identification, and robust recognition techniques. It discusses the use of Autoencoders for feature extraction and noise reduction, as well as various deep learning architectures such as CNNs and RNNs. The study aims to enhance the accuracy of ASR systems in challenging conditions and provides insights into future research directions in the field.

Uploaded by

EighthSenseGroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views6 pages

(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M

The paper presents a comprehensive survey on Automatic Speaker Recognition (ASR) focusing on deep learning methodologies, including speaker verification, identification, and robust recognition techniques. It discusses the use of Autoencoders for feature extraction and noise reduction, as well as various deep learning architectures such as CNNs and RNNs. The study aims to enhance the accuracy of ASR systems in challenging conditions and provides insights into future research directions in the field.

Uploaded by

EighthSenseGroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

RESEARCH ARTICLE OPEN ACCESS

Automatic Speaker Recognition: A Survey


Rizwan K Rahim[1],Tharikh Bin Siyad [2], Muhammed Ameen M.A [3],
Muhammed Salim K.T [4] , Selin M [5]
[1], [2], [3], [4], [5]
Computer Science and Engineering, APJ Abdul Kalam Technological University – India

ABSTRACT
Speaker recognition is the task of identifying persons from their voices. Recently, deep learning has dramatically
revolutionized speaker recognition. This paper, reviews several major subtasks of speaker recognition, including
speaker verification, identification, and robust speaker recognition, with a focus on deep learning-based methods.
An Automatic Speaker Recognition is a biometric system that allows you to identify and verify people, using voice
as a discriminatory feature. Automatic Speaker Recognition (ASR) using Autoencoder is discussed here. This paper
discusses the Deep Learning methodologies for ASR followed by different Feature Extraction techniques. Then the
Autoencoder technology, its working, and its architecture and, how the ASR works using Deep Learning are
discussed. Finally, A survey about robust speaker recognition from the perspectives of domain adaptation and
speech enhancement, which are two major approaches to dealing with domain mismatch and noise problems is done.
Keywords- Deep Learning, Automatic Speaker Recognition, Auto-encoder, Feature Extraction, MFCC.

I. INTRODUCTION

An Automatic Speaker Recognition (ASR), is a non- Here, the reader gets a comprehensive overview of
invasive biometric system because it manipulates the the deep learning-based speaker recognition methods
voice as a discriminatory feature, also it presents a in terms of the vital subtasks and research topics,
great versatility during the evaluation so this is a including speaker identification, voice diarization,
process that only requires that the user speaks, which and obust speaker recognition. From this study, we
constitutes a natural act of human’s behavior [3]. It is hope to provide a useful resource for the speaker
known that a speaker’s voice contains personal traits recognition community. The main contributions of
of the speaker, given the unique pronunciation organs this article are to summarize deep learning-based
and speaking manner of the speaker, e.g. the unique feature extraction techniques for speaker verification
vocal tract shape, larynx size, accent, and rhythm. and identification, Make an overview of the deep
Therefore, it is possible to identify a speaker from learning-based speaker diarization, with an emphasis
his/her voice automatically via a computer system. on recent supervised, end-to-end, and online
This technology is termed automatic speaker diarization, and Survey robust speaker recognition
recognition, which is the core topic of this paper. from the perspectives of domain adaptation and
Speaker recognition is a fundamental task of speech speech enhancement, and Domain adaptation and
processing and finds its wide applications in real- speech enhancement are two major approaches to
world scenarios. For example, it is used for the voice- dealing with domain mismatch and noise problems.
based authentication of personal smart devices, such
as cellular phones, vehicles, and laptops. It Many studies have proposed techniques to improve
guarantees the transaction security of bank trading the accuracy of ASR in noisy and reverberant
and remote payment. It has been widely applied to conditions. One approach is to enhance the noisy
forensics for investigating a suspect to be guilty or features by applying noise removal techniques,
not guilty, or surveillance and automatic identity Others designed discriminative, handcrafted features
tagging. It is important in audio-based information that are more robust against noise and reverberation.
retrieval for broadcast news, meeting recordings, and Many works also propose adapting the acoustic
telephone calls. It can also serve as a frontend of models to noisy conditions. For deep learning
automatic speech recognition (ASR) for improving frameworks, various architectures are investigated to
the transcription performance of multi-speaker find better systems such as recurrent neural networks
conversations [4]. (RNN) and convolutional neural networks (CNN).

ISSN: 2347-8578 www.ijcstjournal.org Page 179


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

But here, We are trying to deal with Automatic The back-end first computes a similarity score
Speaker Recognition using the Auto-encoder between enrollment and test speaker features and
technology [5]. For the feature extraction techniques, then compares the score with a threshold:
Mel Frequency Cepstral Coefficients(MFCC)
predominates. Even though MFCC is the most cited
and used, there are some robust feature extraction
techniques that will work more accurately and
efficiently.

1.1 Overview and scope

This summary outlines four major research branches


of speaker recognition, which are speaker
verification, identification, and robust speaker
recognition respectively. The flowcharts of the first
three branches are burst; speaker recognition deals
with the challenges of noise and domain mismatch ………(1)
troubles. The topics of the overview are organized in
Fig. 2, which are characterized briefly as follows. where f(·) indicates a function for calculating the
similarity, w stands for the parameters of the back-
Speaker verification aims at verifying whether an end, xe and xt are the enrollment and test speaker
utterance is pronounced by a hypothesized speaker features respectively, ξ is the threshold, H0 represents
based on his/her pre-recorded utterances [6]. Speaker the hypothesis of xe and xt belonging to the same
verification algorithms can be classified into stage- speaker, and H1 is the opposite hypothesis of H0. One
wise and end-to-end ones. A stage-wise speaker of the major responsibilities of the back-end is to
verification system usually consists of a front-end for compensate for channel variability and reduce
the extraction of speaker features and a back-end for interferences, e.g. language mismatch. Because most
the resemblance calculation of speaker features. The back-ends aim at alleviating the interferences, which
front-end transforms an utterance in the time domain belongs to the problem of powerful speaker
or time-frequency domain into a high-dimensional recognition.
feature vector. It accounts for the recent advantage of
deep learning-based speaker recognition.

Fig 1.1 Overview of deep learning-based speaker recognition

comprises many layers with various neurons in each


II. DEEP LEARNING layer. These layers can vary from a few to thousands
and each layer may further comprise thousands of
Deep learning also called Deep Neural Network neurons (processing unit). The simplest function in a
neuron is to multiply the input values with the

ISSN: 2347-8578 www.ijcstjournal.org Page 180


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

allocated weight to each input and sum up the result Voice Activity Detection is a strategy used in speech
[8]. processing to recognize speech existence of speech
absence in audio. This procedure processes the
Deep learning approaches are reasonable for us to speech signals to rule out the silence fraction,
solve many problems. In the future, it is foreseeable otherwise, the training might be biased [11]. Long
that deep learning could demonstrate perfect theories Term Spectral Divergence (LTSD) algorithm [22]
to explain its performance. Meanwhile, its capacities was used concurrently with a noise compression
for unsupervised learning will be enhanced since script from SOX1 to perform this task. LTSD
there are millions of pieces of data in the world but it algorithm breaks an utterance into overlapped frames
is not applicable to add labels to all of them. It is also and gives scores for each frame on the probability
predicted that neural network structures will become that there is voice activity in the frame. The
more complex so that they can extract more probability is then developed to extract all the
semantically significant features [7]. What is more, duration with voice activity.
deep learning will combine with reinforcement
learning and we can use these benefits to accomplish
more tasks.

Deep learning models usually recognize hierarchical


structures to connect their layers [9]. The output of a
lower layer can be considered as the input of a
massive layer via simple linear or nonlinear
computations. These models can transform low-level
features of the data into high-level abstract features.
Owing to this characteristic, deep learning models
can be more powerful than shallow machine learning
models in feature representation.

There are many kinds of Deep Learning technologies


that we can use for ASR programs. Even though the
most cited and used one is Convolutional Neural
Network (CNN), it has so many drawbacks. So we
are using Auto-encoder as the Deep Learning
technology.

III. METHODOLOGY
A systematic workflow of the proposed Automatic
speaker recognition system is shown in Fig. 3.1.
Given an input speech signal, voice activity detection
is accomplished to identify speech presence or speech
absence in the given speech signal [10]. An Auto-
Encoder is used to denoise the noisy input and
enhance the quality & intelligibility of distorted
speech signals. Then audio feature vectors are
extracted and used to train the models using the
Gaussian mixture model. Lastly, the network Fig 3.1 Systematic workflow of the proposed
recognizes the speaker by testing the sample with the system
trained model.
3.2. AUTO-ENCODER

Speech enhancement (SE) aims to improve the


quality and intelligibility of deformed speech signals,
which may be caused by background noises,
3.1. VOICE ACTIVITY DETECTION interference, and recording accessories [12]. SE
strategies are generally used as pre-processing in

ISSN: 2347-8578 www.ijcstjournal.org Page 181


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

various audio-related applications, such as speech Mel Frequency Cepstral Coefficient (MFCC)
communication, automatic speech recognition (ASR),
speaker recognition, hearing assistance, and cochlear Mel Frequency Cepstral Coefficients (MFCC) is one
implants Denoising Autoencoders (DAE) has been of the most cited and used methods in the speech
widely explored in the field of speech signal processing community. It’s established on the
processing. documented the usefulness of DAE on simulation of cochlear auditory capability, with the
dereverberation and distant-talking speech design of a uniformly spaced filterbank in the Mel
recognition. [10] investigated the performance of frequency scale, which when altered to the linear
DAE on unsupervised domain adaptation for speech frequency scale, the spacing between filters is linear
emotion recognition. in the range of first 1000 Hz.

Auto-encoder allows our speaker verification system Feature extraction acts as a crucial position in
to quickly adapt to the release of any new models of training a model. It is essential to extract a set of
smart speakers. Finally, as the next step, we plan to features from audio signals. A group of extracted
extend the exploration beyond smart speakers to features is provided as input to the classifier. In
other fields in the industry where labeled speakers are speech recognition feature vector represents the
scarce but unlabeled data is abundant. speech waveforms. There are various feature
extraction strategies available to extract the features
An autoencoder-based semi-supervised curriculum from audio signals such as MFCC, delta MFCC,
learning scheme is proposed to automatically LPCC, PCA, etc [15].
accumulate unlabeled data and iteratively update the
corpus during training. This new training technique The Fourier transformation of the time-domain audio
allows us to (1) progressively improve the size of the signal into the frequency domain is called a spectrum.
training corpus by using unlabeled data and rectifying By using fast Fourier transformation samples from
previous labels at run-time; and (2) improve each frame are converted into frequency domain i.e.
robustness when generalizing to numerous spectrum. Mel scales for frequency f find out by
conditions, such as out-of-domain and text- using the equation:
independent speaker verification tasks. It is also
discovered that a denoising autoencoder can 𝑀𝑒𝑙 𝑓 = 2595 log10( 𝑓/700 + 1)
considerably enhance the clustering accuracy when it
is trained on a carefully-selected subset of speakers. Log magnitude of mel is called the mel spectrum.
DCT (Discrete Cosine Transform) applied to mel
An autoencoder-based semi-supervised curriculum spectrum and mel frequency cepstral coefficients
learning approach is proposed to rapidly adapt the features are computed. The computation of MFCC
speaker verification system to the unseen new features comprises numerous phases such as pre-
domain in which no labeled data is available [13]. processing, framing, windowing, estimation of
discrete Fourier transform, mel frequency, and
3.3. FEATURE EXTRACTION inverse document frequency.

An Automatic Speaker Recognition (ASR), is a non- Functionally, this scheme is established on the
invasive biometric system because they use the voice introductory process of windowing and overlapping,
as a discriminatory feature, also it illustrates a great then the signal power spectrum is assessed and
versatility during the examination so this is a process distributed into sub-bands through a Mel filterbank,
that only requires that the user speaks, which after that is logarithmically compressed, and finally
constitutes a natural act of human’s behaviour [3]. the Discrete Cosine Transform (DCT) is applied for
The voice has six information statuses, from spectral accumulating information in the first coefficients.
(lower level) to semantic level (upper level), and the
complexity during the information extraction
procedure rises proportionally respecting the level on
which it’s worked [14]. In real circumstances, the
voice can be supported by all kinds of noise, such as
public transport sound, channel distortion, and even
reverberation, because it is significant to use reliable
techniques in noisy conditions.

ISSN: 2347-8578 www.ijcstjournal.org Page 182


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

speaker “X” (This is the recognition phase), then the


voice is matched to the speaker “X” (Which we get
from the enrollment phase) voice print only. Based
on the amount of similarity we can set the threshold
for matching.

3.4.2. Speaker Enrollment: In this phase when a


new user comes into the system their voice samples
are stored and the d-vector is calculated of all the
samples and an average is taken and stored as that
user's voiceprint. so that next time the same user
Fig 4.1 Process of MFCC feature extraction comes we can match it with this stored voiceprint.
Here longer voice samples help to capture features
better and more samples help to show the variation of
the user's voice. A good voice sample falls in the
range of 3–5 seconds. Speaker Enrollment is also
known as Speaker Identification.

Figure 4.4 shows both Speaker Verification and


Speaker Identification.

Fig 4.2 Scaled Filterbank on Mel Frequency

3.4. SPEAKER RECOGNITION

Speaker recognition is a technique used to


automatically understand a speaker from a recording Fig 4.4 Speaker Verification and Speaker
of their voice or speech utterance. It has evolved into Identification
an economical and reliable method for person
identification and verification. This paper presents IV. CONCLUSION
the advancement of an automatic speaker recognition
system that incorporates the classification and Here presented a light introduction on an Automatic
recognition of speakers. Four classifier models, Speaker Recognition System using Deep Learning
namely, Support Vector Machines, K-Nearest and the present scenario where it is now. In the
Neighbors, Multilayer Perceptrons (MLP), and proposed system, the first phase is voice activity
Random Forest (RF), are trained using the WEKA detection and it is done by Long Term Spectral
data mining tool[16]. Auto-WEKA is assigned to Divergence (LTSD) algorithm. Then we discussed
specify the best classifier model together with its best the Auto-encoder technology as the denoising or
hyper-parameters. The performance of each model is Speech Enhancement technique we use in the ASR
assessed in WEKA manipulating 10-fold system. We familiarize ourselves with the feature
cross-validation. The following evaluation extraction procedure and the most important phase in
measurements are used; RMSE, Accuracy, Precision, the system. A well-known feature extraction method
and Recall are used to evaluate the performance of is used i.e., MFCCs (Mel Frequency Cepstral
the models. The Speaker verification is again divided Coefficient). Then we looked into the Speaker
into two, Speaker Verification and Speaker Verification procedures and their classifications.
Enrollment.

3.4.1. Recognition: It is a one-on-one matching


process. Here we have prior information that this is

ISSN: 2347-8578 www.ijcstjournal.org Page 183


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 3, May-Jun 2022

V. REFERENCES [12]Cheng Yu, Ryandhimas E. Zezario, Syu-Siang


Wang, Jonathan Sherman, Yi-Yen Hsieh, Xugang
[1] T. Kinnunen and H. Li, "An overview of text- Lu, Hsin-Min Wang, Senior Member, IEEE, and Yu
independent speaker recognition: From features to Tsao, Senior Member, IEEE. Speech Enhancement
super vectors," Speech communication, vol. 52, no. 1, based on Denoising Autoencoder with Multi-
pp. 12-40, 2010. branched Encoders

[2] R. R Ramachandran, K. R. Farrell, R. [13] Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei
Ramachandran, and R. J. Mammone, "Speaker Machine Intelligence Technology, Alibaba Group.
recognition general classifier approaches and data Autoencoder-based Semi-Supervised Curriculum
fusion methods," Pattern Recognition, vol. 35, no. 12, Learning For Out-of-domain Speaker Verification.
pp. 2801-2821, 2002.
[14] Kiran Adnan, Rehan Akbar. An analytical study
[3] Campbell, Edward & Lara, José & Hernández- of information extraction from unstructured and
Sierra, Gabriel. (2018). Feature extraction of multidimensional big data Journal of Big Data 6,
Automatic Speaker Recognition, analysis,s and Article number: 91 (2019)
evaluation in a real environment.
[15] Vaisali A. Kherdekar, Dr.Sachin A.Naik
[4] Červa, Petr & Silovský, Jan & Zdánský, Jindrich (2021)Convolution Neural Network Model for
& Nouza, Jan & Seps, Ladislav. (2013). Speaker- Recognition of Speech for Words used in
adaptive speech recognition using speaker diarization Mathematical Expression.
for improved transcription of large spoken archives.
Speech Communication. 55. 1033-1046. [16] Tumisho Billson Mokgonyane,Tshephisho
Joseph Sefara,Thipe Isaiah Modipa,Mercy Mosibudi
[5] Yu, Dong & Li, Jinyu. (2017). Recent progress in Mogal,Madimetja Jonas Manamela .(2019)
deep learning-based acoustic models. IEEE/CAA Automatic Speaker Recognition System based on
Journal of Automatica Sinica. Machine Learning Algorithms

[6] Das, Rohan & Prasanna, S.. (2017). Speaker [17] D. Ferbrache, "Passwords are the broken-the
Verification from Short Utterance Perspective: A future shape of biometrics," Biometric Technology
Review. IETE Technical Review. 35. 1-19. Today, vol. 2016, no. 3, pp. 5-7, 2016.

[7] Sarker, I.H. Machine Learning: Algorithms, Real- [18] L. Hamid, "Biometric technology: not a
World Applications, and Research Directions. SN password replacement, but a compliment," Biometric
COMPUT. SCI. 2, 160 (2021). Technology Today, vol. 2015, no. 6, pp. 7-10, 2015.

[8] Steven Walczak, Narciso Cerpa, Artificial Neural [19] N. Singh, R. Khan, and R. Shree, "Applications
Networks. Encyclopedia of Physical Science and of speaker recognition," Pmcedia engineering, vol.
Technology (Third Edition), 2003 38, pp. 3122-3126, 2012.

[9] Jianzhu Ma,1, Michael Ku Yu,1,2, Samson [20] A. Larcher, K. A. Lee, B. Ma, and H. Li, "Text-
Fong,1,3, Keiichiro Ono,1 Eric Sage,1 Barry dependent speaker verification: Classifiers, databases
Demchak,1 Roded Sharan,4 and Trey Ideker1,2,3,* and rsr2015," Speech Communication, vol. 60, pp.
Using deep learning to model the hierarchical 56-77, 2014.
structure and function of a cell
[21] E. Aliyu, O. Adewale, and A. Adetunmbi,
[10] Dharm Singh Jat, ... Charu Singh, Voice Activity "Development of a text-dependent speaker
Detection-Based Home Automation System for recognition system," International Journal of
People With Special Needs. Intelligent Speech Signal Computer Applications, vol. 69, no. 16, 2013.
Processing, 2019
[22] E. Variani, X. Lei, E. McDermott, I. Lopez-
[11] rVAD: An unsupervised Segment-based Robust Moreno, and J. GonzalezDominguez, "Deep neural
Voice Activity Detection Method Zheng-Hua Tana, networks for small footprint text-dependent speaker
Achintya kr. Sarkara,b, Najim Dhaka verification." in ICASSP, vol. 14, 2014, pp. 4052-
4056.

ISSN: 2347-8578 www.ijcstjournal.org Page 184

You might also like