21AIE315 - AI IN SPEECH PROCESSING
DISTINCTIVE SPEAKER IDENTIFICATION
USING CNN
TEAM 13
CH.EN.U4AIE21011 – Chennu Chaitanya
CH.EN.U4AIE21016 – Hemanth
CH.EN.U4AIE21052 – Sai Akshay
ABSRACT:
CNNVoiceDetect applies deep learning to speaker identification tasks, utilizing a Convolutional Neural Network
(CNN) based model.
The study lays a foundation for further research in voice-based authentication systems and establishes a reliable way
to identify speakers from audio data.
The project aims to create an efficient and reliable speaker identification system applicable in various domains such
as security and speech recognition.
The CNN architecture incorporates residual blocks for feature extraction and pooling layers for dimensionality
reduction, enabling efficient processing of audio data.CNNVoiceDetect offers an intricate solution for speaker
identification through innovative deep learning methods.
Future research may involve exploring optimization tactics, increasing speaker diversity in the dataset, and
implementing the model in real-world settings for enhanced security and user authentication.
PROBLEM STATEMENT
The problem addressed in this research is the development of a speaker identification system capable of
accurately distinguishing between different speakers based on audio samples.
The Traditional methods often face challenges such as limited scalability, susceptibility to noise, and dependence
on handcrafted features.
Addressing these limitations, CNNVoiceDetect aims to leverage deep learning techniques to overcome these
challenges and achieve robust speaker identification performance across diverse datasets and environmental
conditions.
OBJECTIVE
The aim of the project is to develop and evaluate deep learning models, particularly Convolutional Neural
Networks (CNNs), for speaker identification and verification tasks.
The project aims to address the challenges of speaker recognition under various conditions, including noisy
environments and unconstrained audio data. By leveraging large-scale datasets such as VoxCeleb1 and
VoxCeleb2, the project seeks to achieve high accuracy rates in identifying speakers and distinguishing between
them.
Additionally, the project aims to explore different methodologies for fine-tuning pre-trained models and
evaluating their performance using metrics such as accuracy, Equal Error Rate (EER), and precision.
Ultimately, the goal is to contribute to the advancement of speaker recognition technology, with potential
applications in voice authentication, security systems, and other audio-based tasks.
LITERATURE REVIEW:
Title Year Objective Technology Used
Deep Speaker The paper proposes leveraging deep The paper utilizes deep neural networks,
Embeddings for Short- neural networks for speaker verification specifically convolutional and fully-connected
Duration Speaker 2017 with short-duration recordings, comparing attention models, to learn speaker embeddings
Verification directly from time-frequency speech
deep embeddings with traditional i-vectors
representationsWav2vec 2.0 models ,Support
and advocating for treating speech as Vector Machines (SVMs) with features extracted
images for improved recognition from the wav2vec 2.0 model
Unraveling Adversarial To address the threat posed by adversarial LightResNet34 and ECAPA-TDNN for attack
Examples against Speaker 2024 examples to speaker recognition systems. classification and detection
Identification –
Techniques for Attack
Detection and Victim
Model Classification
SPEAKER The objective of the paper is to address speaker Utilize Deep Neural Networks (DNNs),
RECOGNITION FOR 2019 recognition in multi-speaker conversations by particularly x-vectors, known for their
MULTI-SPEAKER combining deep neural network (DNN) effectiveness in both speaker recognition and
CONVERSATIONS embeddings, specifically x-vectors, with diarization tasks
USING X-VECTORS speaker diarization techniques.
LITERATURE REVIEW:
Title Year Objective Technology Used
VoxCeleb2: Deep Speaker The primary objective of this paper These models effectively
Recognition is speaker recognition under noisy and recognize speaker identities
2018 unconstrained conditions from voice under various
conditions using the
VoxCeleb2 dataset
Clova Baseline System for the Presenting Clova's baseline system for the Utilizing ResNet architecture,
VoxCeleb Speaker Recognition 2020 VoxCeleb Speaker Recognition Challenge incorporating techniques like self-
Challenge 2020 2020, focusing on ResNet-based models. attentive pooling (SAP) and Attentive
Statistics Pooling (ASP) for feature
aggregation
Fine-tuning wav2vec2 for The primary objective of this The authors utilize
speaker recognition. 2021 paper is to explore applying the wav2vec2 framework,
the wav2vec2 which is originally designed
framework to speaker for speech recognition.
recognition instead of speech
recognition.
LITERATURE REVIEW:
Title Year Objective Technology Used
Joint Speaker Counting, To propose a unified model for speaker- To propose a unified model for
Speech Recognition, and attributed automatic speech recognition (SA- speaker-attributed automatic speech
Speaker Identification for 2020 ASR) that addresses the challenges posed by recognition (SA-ASR) that addresses
Overlapped Speech of Any overlapped speech the challenges posed by overlapped
Number of Speakers speech
Strategies for Improving The objective of this study is to enhance the Time-domain implementation of
Speaker Discrimination in 2020 speaker discrimination capability of SpeakerBeam (TD-SpeakerBeam),
Target Speech Extraction SpeakerBeam for target speech extraction,
utilization of spatial features, multi-
task learning with SI-loss.
PROPOSED WORK:
Data Collection: The primary dataset utilized in this study is the VoxCeleb dataset, a large-scale audio-visual speaker recognition
dataset. VoxCeleb2 consists of over a million utterances from more than 6,000 speakers, collected from open-source media sources.
Additionally, noise samples are obtained from various sources to augment the dataset for robustness testing.
Data Preprocessing: Audio data undergoes preprocessing to ensure uniformity and compatibility with the model. This includes
resampling audio files to a consistent sample rate of 16 kHz, segmenting longer audio recordings into shorter clips, and augmenting data
by adding noise to simulate real-world conditions.
Model Development: The core of the methodology involves the development of CNN- based models for speaker recognition. The
model architecture consists of multiple layers of convolutional, pooling, and fully connected layers, designed to extract relevant features
from input audio spectrograms and classify them into speaker identities. The model architecture is based on prior research and
experimentation to optimize performance.
.
PROPOSED WORK:
Data Collection & Dataset
Model Training Model Evaluation
Preprocessing Generation
Prediction
PROPOSED WORK:
Model Training: The residual_block function allows for the effective training of deep neural networks by enabling the learning of
complex features while mitigating the vanishing gradient problem. It promotes better information flow and gradient propagation through
the network, leading to improved performance and convergence during training. The convolutional layers within the residual block are
responsible for learning spatial hierarchies of features within the input data. Activation functions introduce non-linearity into the model,
enabling it to learn complex mappings between inputs and outputs.
Model Evaluation: The trained models are evaluated using various metrics such as accuracy, Equal Error Rate (EER), and
precision-recall curves. Evaluation is performed on separate validation and test datasets to assess the generalization capability of the
models.
PROPOSED WORK:
RESULTS:
The results obtained from the conducted experiments showcase the performance of the proposed CNN-based models for
speaker recognition.
The models achieved a commendable accuracy of 95.4% on the test dataset, indicating their ability to effectively identify
speakers. Despite the high accuracy, the models exhibited a 7% Equal Error Rate (EER), suggesting a moderate level of
error in distinguishing between genuine and impostor speakers.
Additionally, the precision of the models was measured at 87%, highlighting their capability to correctly identify true
positives while minimizing false positives.
RESULTS:
Our model finally predict the speaker from the test dataset for real-time period
Conclusion:
The project successfully implements speaker identification using deep neural networks, showcasing the effectiveness of
deep embeddings compared to traditional methods.
The model achieves high accuracy and demonstrates the potential for real-world applications in speaker recognition
systems.
Robustness against variations in speech recordings and background noise levels.
Scalability to accommodate a larger number of speakers and datasets.
Adaptability for fine-tuning or retraining with additional data.
References:
[1] Gautam Bhattacharya, Md Jahangir Alam, Patrick Kenny (2017, August). In Conference: Interspeech 2017
[2] Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak(2024). https://arxiv.org/html/2402.19355v1 IEEE.
[3] David Snyder; Daniel Garcia-Romero; Gregory Sell(2019Published in:
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[4Joon Son Chung, Arsha Nagrani, Andrew Zisserman (2018). IEEE Journal of Biomedical and Health Informatics.
[5] Hee Soo Heo, Bong-Jin Lee, Jaesung Huh, Joon Son Chung. (2020, September). In International Conference on Text, Speech, and
Dialogue (pp. 423-436). Cham: Springer International Publishing.
[6]Nik Vaessen; David A. Van Leeuwen (2022)
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
https://ieeexplore.ieee.org/document/9413520/
[7] Naoyuki Kanda; Xuankai Chang; Yashesh Gaur; Xiaofei Wang(2021) Published in:
2021 IEEE Spoken Language Technology Workshop (SLT)
[8] Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki, (2020)
[9] Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg (2020) arXiv:2010.12653v1 [eess.AS] 23 Oct 2020
Thank You