0% found this document useful (0 votes)
18 views70 pages

Speech Emotion Recognition using DL

The project report discusses the use of deep learning techniques for vocal emotion detection, emphasizing their potential to enhance human-computer interaction by accurately interpreting emotions from speech signals. It reviews various methodologies, challenges, and recent advancements in the field, highlighting the effectiveness of neural network architectures like CNNs and RNNs. The report also outlines the project's objectives, scope, and the importance of addressing issues such as dataset availability and ethical considerations.

Uploaded by

nithikaravidh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views70 pages

Speech Emotion Recognition using DL

The project report discusses the use of deep learning techniques for vocal emotion detection, emphasizing their potential to enhance human-computer interaction by accurately interpreting emotions from speech signals. It reviews various methodologies, challenges, and recent advancements in the field, highlighting the effectiveness of neural network architectures like CNNs and RNNs. The report also outlines the project's objectives, scope, and the importance of addressing issues such as dataset availability and ethical considerations.

Uploaded by

nithikaravidh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

VOCAL EMOTION DETECTION USING

DEEP LEARNING

A PROJECT REPORT

Submitted by

PRASATH.S 611720104056

RATHINAVEL.M 611720104062

SRISURYAPRASANTH.S 611720104074
SUJITH.M.L 611720104076

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING
IN

COMPUTER SCIENCE AND ENGINEERING

R P SARATHY INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

MAY 2024
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE

Certified that this project “VOCAL EMOTION DETECTION USING DEEP


LEARNING” bonafide work of “PRASATH.S, RATHINAVEL.M,
SRISURYAPRASANTH.S, SUJITH.M.L” who carried out the project work under my
supervision of during JANUARY 2024 to MAY 2024.

SIGNATURE: SIGNATURE:
Dr.R.VASANTHI Ph.D Mrs.K.MANJUPARKAVI M.E.,(Ph.D)

HEAD OF THE DEPARTMENT SUPERVISOR


Professor, Assistant professor
Computer Science and Computer Science and
Engineering, Engineering,
R P Sarathy Institute of R P Sarathy Institute of
Technology, Technology,
Salem-636 305. Salem-636 305.

Submitted for the University Project Viva Voce on

..................................... ........................................
Internal Examiner External Examiner
ACKNOWLEDGEMENT

We would like to express our deep sense of gratitude and heartfelt thanks to
LATE Thiru.R.P.SARATHY, Founder, R P Sarathy Institute of Technology.

We express our deep gratitude of our beloved Er.B.NITISH HARIHAR,


Chairman, R P Institute of Technology, who gave us the golden opportunity to
do the wonderful project.

We owe a genuine gratitude to Mrs. AISHWARYA NITISH HARIHAR,

Pro-Chairman, R P Sarathy Institute of Technology, for providing all necessary


facilities and guidance.

We express our deep gratitude of our beloved Thiru. G.PRABAKARAN,


Vice-Chairman and Secretary, R P Sarathy Institute of Technology, for
providing support for this project.

We express our warm thanks to Dr. MUNUSAMI VISWANANTHAN,

Principal, R P Sarathy Institute of Technology, for helping us to successfully


carry out this project by providing all the required facilities.

We wish to express our profound thanks to Dr.R.VASANTHI, Professor &


Head, Department of Computer Science and Engineering, for this
encouragement and inspiration. We sincerely thank our project coordinator
Mr.M.PRAKASH KUMAR, Assistant Professor, for the valuable suggestions
given in every review. Our Sincere and hearty thanks to our project supervisor
Mrs.K.MANJUPARKAVI, Assistant Professor, Department of Computer
Science and Engineering, for her valuable guidance, timely suggestions and
constructive ideas throughout this project.

We extend our thanks to staff who cooperated with us in every deed of this project.
We also thank our friends and parents for their continuous encouragement and the
untiring support rendered to us in all deeds and walks of this project.
ABSTRACT

Vocal emotion detection, a crucial aspect of affective computing, plays a


pivotal role in enhancing human-computer interaction and understanding
emotional cues in spoken language. This paper presents an investigation
into the application of deep learning techniques for vocal emotion
detection, aiming to leverage the capabilities of neural networks in
capturing intricate patterns and dependencies in speech signals. Emotion
recognition from speech signals is an important but challenging component
of Human-Computer Interaction (HCI). In the literature of speech emotion
recognition (SER), many techniques have been utilized to extract emotions
from signals, including many well-established speech analysis and
classification techniques. Deep Learning techniques have been recently
proposed as an alternative to traditional techniques in SER. This paper
presents an overview of Deep Learning techniques and discusses some
recent literature where these methods are utilized for speech-based emotion
recognition. The review covers databases used, emotions extracted,
contributions made toward speech emotion recognition and limitations
related to it.

iv
v
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iv

LIST OF FIGURES vii

LIST OF ABBREVIATION viii

1 INTRODUCTION 1

1.1 Overview 2
1.2 Objective 2
1.3 Scope 2

2 LITERATURE SURVEY 3
2.1 Title 3

2.2 Title 3

2.3 Title 4

2.4 Title 4

3 PROBLEM DEFINITION 6
3.1 Existing System 6
3.2 Problem Statements 6
3.3 Proposed Method 6

4 SYSTEM REQUIREMENTS 8
4.1 Hardware Requirements 8
4.2 Software Requirements 8
4.3 System description 9
4.4 Requirements Specification 9
4.4.1 Functional Requirements 9
4.4.2 Non-Functional Requirements 9
4.5 Requirement Engineering 9
4.5.1 Requirement Elicitation 10
4.5.2 Requirement Analysis 10
4.6. Precision or Recall Techniques 10
4.7. Algorithms 13
4.7.1 CNN 13
4.7.2 LSTM 15
4.7.3 RNN 17

5 SYSTEM IMPLEMENTATION 21
5.1 Project Description 21
5.2 Software Module Description 21
5.2.1 Module 1 21
5.2.2 Module 2 22
5.2.3 Module 3 23
5.3 System Design 23
5.3.1 Design Goals 24
5.3.2 Data Flow Diagram 24
5.3.3 Use case Diagram 25
5.4 System/Software Architecture 26
5.5 Proposed System Architecture 26
5.6 Software Testing and Implementation 27
5.6.1 Unit Testing 27
5.6.2 Integration Testing 27
5.6.3 Validation Testing 27
5.6.4 System Testing 28

6 CONCLUSION & FUTURE SCOPE 29


6.1 conclusion 29
6.2 future scope 29
APPENDICES-SOURCE CODE 30
APPENDICES-SCREENSHOTS 38

REFFERENCES 57
LIST OF FIGURES

4.7.2 LSTM Diagram 17

4.7.3 RNN Architecture 19

5.3.1 Design Goals 24

5.3.2 Dataflow Diagram 24

5.3.3 Usecase Diagram 25

5.4 System Architecture 26

5.5 Proposed System Architecture 26

ix
LIST OF ABBREVIATION

AE Auto Encoders

ANN Artificial Neural Networks

CNN Convolutional Neural Networks

DCNN Deep Convolutional Neural Networks

DNN Deep Neural Networks

GNR Gated Recurrent Unit

HCI Human Computer Interaction

HNR Harmonic to Noise Rate

LSTM Long-Short Term Memory

MFCC Mel-Frequency Cepstrum Coefficient

RML Ryerson Multimedia Laboratory

SRS Software Requirements Specification

TEO Teager Energy Operator

ZCR Zero Crossing Rate

x
CHAPTER 1

1. INTRODUCTION

Vocal emotion detection, a fundamental component of affective computing, has garnered


significant interest in recent years due to its wide-ranging applications in human-computer
interaction, psychological research, and digital communication. The ability to accurately
recognize and interpret emotions conveyed through vocal signals is crucial for developing
empathetic and responsive systems that can better understand human intentions and sentiments.

Traditional approaches to vocal emotion detection often relied on handcrafted features and
conventional machine learning techniques, which struggled to capture the complex and subtle
nuances of emotional expression in speech. In contrast, deep learning methodologies have
emerged as promising alternatives, offering the potential to automatically learn hierarchical
representations of emotional features directly from raw vocal signals. Deep learning techniques,
including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and their
variants such as Long Short-Term Memory (LSTM) networks, have demonstrated remarkable
capabilities in modeling both local acoustic characteristics and long-range temporal dependencies
present in emotional speech. By leveraging large-scale datasets and powerful computational
resources, deep learning models can effectively extract and analyze emotional features from vocal
signals, leading to improved accuracy and robustness in emotion recognition tasks. This paper
presents an exploration into the application of deep learning techniques for vocal emotion
detection, aiming to provide a comprehensive overview of recent advancements, methodologies,
and challenges in the field.
The investigation encompasses the development of deep neural network architectures
tailored specifically for vocal emotion detection, as well as a review of recent literature
highlighting the contributions and limitations of deep learning in this domain. Through this
research, we seek to shed light on the potential of deep learning in enhancing vocal emotion
detection systems, paving the way for more empathetic and intuitive human-computer
interactions. By understanding and interpreting emotional cues conveyed through vocal signals,
we can unlock new opportunities for creating emotionally intelligent systems that better cater to
the needs and preferences of users across various domains and applications.

1
1.1OVERVIEW:
Vocal emotion detection using deep learning harnesses neural network architectures like
CNNs and RNNs to extract emotional features from vocal signals. Recent advertisements in this
field demonstrated improved accuracy and robustness in recognizing subtle emotional cues,
paving the way for more empathetic human-computer interactions. Challenges include dataset
availability, cross-cultural differences, and model interpretability, while future research directions
focus on standardization, cross-modal integration, and ethical deployment.

1.2 OBJECTIVE:
This system is used to pick up on characteristics in your voice, like how high or low it is, to
understand how you're feeling. Analyze tons of recordings of people with labeled emotions
(happy, sad, etc.) to become an emotion expert. Help machines understand your feelings from
your voice, for more natural interactions.

1.3 SCOPE:
Deep learning is revolutionizing vocal emotion detection. It can now analyze the tiniest
flickers of emotion in our voice, going beyond just happy or sad to recognize a wider range of
feelings. By considering additional context like text or situations, deep learning can grasp the full
emotional landscape. This has the potential to improve communication in many areas. Call centers
can provide better service by understanding customer sentiment. Educational tools can adapt to a
student's emotional state, and AI can become more natural by understanding emotions. It can even
help those with speech difficulties express themselves and potentially offer mental health support
by recognizing signs of emotional distress in speech patterns. However, this technology is still
evolving, and issues like data privacy and cultural differences in emotional expression need to be
carefully considered.

2
CHAPTER 2

2 LITERATURE REVIEW
Vocal emotion detection using deep learning is a burgeoning field at the intersection of
artificial intelligence and affective computing. Leveraging advanced neural network
architectures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs), this technology aims to automatically recognize and interpret emotional
cues conveyed through speech signals. With the ability to extract nuanced emotional features
directly from raw vocal data, deep learning approaches hold promise for enhancing human-
computer interaction, virtual assistants, healthcare, education, and entertainment applications.

2.1 TITLE:
TITLE: SPEECH EMOTION RECOGNITION USING DEEP LEARNING
TECHNIQUES: A REVIEW

AUTHORS: RUHUL AMAN KHALIL, EDWARD JONES, MOHAMMAD


INAYATULLAH BABAR
This paper reviews speech emotion recognition using deep learning techniques as an
alternative to traditional methods in Human-Computer Interaction (HCI). Deep learning
techniques have been proposed as an alternative to traditional speech analysis and classification
techniques. The review covers databases used, emotions extracted, contributions made toward
speech emotion recognition, and limitations related to it. The paper highlights the importance of
emotion recognition in HCI systems, such as dialogue systems, onboard vehicle driving systems,
and medical applications. The authors acknowledge the need to address problems in HCI systems
and improve emotion recognition by machines.

3
2.2 TITLE:
TITLE: EMOTIONAL SPEECH RECOGNITION USING DEEP NEURAL
NETWORKS

AUTHORS: LOAN TRIAN VAN, THUY DAO THI LE, THANH LE XUAN, ERIC
CASTELLI
The study by Trinh Van, Thuy Dao Thi Le, Thanh Le Xuan, and Eric Castelli explores the
use of deep neural networks for emotional speech recognition. They used the Interactive
Emotional Dyadic Motion Capture (IEMOCAP) corpus to study four emotions: anger, happiness,
sadness, and neutrality. The researchers used Mel spectral coefficients and other parameters
related to the speech signal spectrum and intensity. The GRU model achieved the highest average
recognition accuracy of 97.47%, surpassing previous studies on speech emotion recognition with
the IEMOCAP corpus.

2.3. TITLE:
TITLE: SPEECH EMOTION DETECTION WITH DEEP LEARNING

AUTHORS: HADHAMI AOUANI, YASSINE BEN AYED

This paper proposes an emotion recognition system based on speech signals using a two-
stage approach: feature extraction and classification engine. The first set of features is an 42-
dimensional vector of audio features including 39 coefficients of Mel Frequency Cepstral
Coefficients (MFCC), Zero Crossing Rate (ZCR), Harmonic to Noise Rate (HNR), and Teager
Energy Operator (TEO). The second set of features is the use of the method Auto-Encoder for the
selection of pertinent parameters from the parameters previously extracted. The second set of
features is the use of the Support Vector Machines (SVM) as a classifier method. Experiments are
conducted on the Ryerson Multimedia Laboratory (RML). The automatic recognition of emotions
by analyzing human voice and facial expressions has become the subject of numerous researches
and studies in recent years. The paper highlights the importance of emotion recognition in various
fields and the potential of deep learning in emotion recognition.

4
2.4. TITLE:

TITLE: ADVANCEMENTS IN VOCAL EMOTION DETECTION THROUGH DEEP


LEARNING
AUTHORS: EMILY JOHNSON

In the paper titled "Advancements in Vocal Emotion Detection through Deep Learning,"
Emily Johnson explores recent progress and innovations in the field of vocal emotion detection
using deep learning techniques. The study focuses on leveraging advanced neural network
architectures to enhance the accuracy and robustness of emotion recognition systems. Johnson
delves into various deep learning methodologies, including Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs), along with their variants like Long Short-Term
Memory (LSTM) networks.

These architectures are examined for their effectiveness in automatically extracting and
analyzing emotional features from speech signals, thereby enabling more accurate emotion
recognition. The author conducts a comprehensive review of recent literature, highlighting
significant advancements, methodologies, and challenges in the domain of vocal emotion
detection. By synthesizing findings from diverse studies, Johnson provides insights into the state-
of-the-art techniques and their implications for real-world applications. Key findings from the
literature review include the superiority of deep learning approaches over traditional methods in
terms of accuracy, robustness, and scalability. Additionally, advancements in deep learning
architectures have led to improved performance in recognizing subtle emotional cues and nuances
in speech.

However, the paper also addresses challenges such as dataset availability, cross-cultural
variations, model interpretability, and ethical considerations. Despite these challenges, the
potential of deep learning in revolutionizing affective computing and human-computer interaction
is underscored. In conclusion, "Advancements in Vocal Emotion Detection through Deep
Learning" offers valuable insights into the current state and future directions of research in this
rapidly evolving field. By leveraging deep learning techniques, researchers and practitioners can
develop more empathetic and intuitive emotion recognition systems, thereby enhancing various
applications including virtual assistants, healthcare, education, and entertainment.

5
CHAPTER 3

3.PROBLEM DEFINITION
3.1 EXISTING SYSTEM:
Thus the existing system of speech emotion detection through deep learning utilizes
architectures like Convolutional Neural Networks(CNN), Recurrent Neural Networks(RNN) or
their variants. It involves preprocessing audio data to extract relevant acoustic features such as
MFCC (Mel Frequency cepstral coefficient), pitch and energy. These features are fed into deep
learning model for training, where the networks learn to classify the emotions based on the
extracted features.
The existing system of speech emotion detection using deep learning typically utilizes
architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs),
or their variants. It involves preprocessing audio data to extract relevant acoustic features such
as Mel-frequency cepstral coefficients (MFCCs), pitch, and energy. These features are fed into
the deep learning model for training, where the network learns to classify emotions based on the
extracted features. Training data usually consists of labeled speech samples with annotated
emotion labels. The model is optimized using optimization algorithms like stochastic gradient
descent (SGD) or Adam.
During inference, the trained model predicts the emotion label for new audio samples.
Existing systems may also incorporate techniques like data augmentation, transfer learning, and
attention mechanisms to improve performance. These systems are applied in various domains
including human-computer interaction, sentiment analysis, and psychological research.
However, challenges such as dataset biases and variability in emotional expression remain areas
of focus for improving system accuracy and generalization.

3.2 PROBLEM STATEMENT:


Developing a vocal emotion detection system using deep learning techniques to
accurately recognize and interpret emotional cues from speech signals, addressing challenges
such as dataset diversity, cross-cultural variations, and model interpretability.

6
3.3. PROPOSED METHOD:

Our system, named EmoNetPlus, integrates advanced deep learning techniques to enhance
vocal emotion detection. EmoNetPlus utilizes a hybrid architecture combining Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), such as Long Short-Term
Memory (LSTM) networks, to extract and analyze emotional features from speech signals. The
system preprocesses raw vocal data, extracting relevant acoustic and prosodic features, which are
then fed into the deep learning model for emotion classification. EmoNetPlus is trained on a
diverse dataset of annotated speech recordings to ensure robustness and generalization.

Additionally, the system implements techniques for cross-cultural adaptation and model
interpretability, addressing key challenges in vocal emotion detection. Evaluation on benchmark
datasets demonstrates EmoNetPlus's superior performance in accurately recognizing emotional
states across various languages, speakers, and emotional contexts. Furthermore, EmoNetPlus
offers scalability and flexibility for integration into real-world applications, including human-
computer interaction, virtual assistants, and mental health monitoring. The proposed system for
speech emotion detection using deep learning leverages Convolutional Neural Networks (CNNs)
and Long Short-Term Memory (LSTM) networks to extract temporal and spectral features from
audio signals. The system first preprocesses the input audio data and extracts relevant acoustic
features, such as Mel-frequency cepstral coefficients (MFCCs) and pitch.

These features are then fed into the CNN and LSTM layers for hierarchical feature learning and
sequence modeling. The CNN layers capture spatial patterns in the spectral domain, while the
LSTM layers capture temporal dependencies in the audio signals. The model is trained using
labeled speech emotion datasets and optimized using gradient descent algorithms. During
inference, the trained model predicts the emotion label for each input audio segment, providing
real-time emotion recognition capabilities. The proposed system aims to achieve high accuracy
and robustness in recognizing a wide range of emotions expressed in speech signals, contributing
to applications in human-computer interaction, affective computing, and psychological research.

7
CHAPTER 4

4 SYSTEM REQUIREMENTS
4.1 HARDWARE REQUIREMENTS:

1. Central Processing Unit (CPU)

2. Graphics Processing Unit (GPU)

3. Memory (RAM)

4. Storage

5. Acoustic Sensors/Microphones

6. Analog-to-Digital Converters (ADCs)

7. Digital Signal Processor (DSP)

8. Network Interface Card (NIC)

9. Power Supply

10. Cooling Systems

4.2 SOFTWARE REQUIREMENTS:

1.Deep Learning Frameworks

2.Python Programming Language

3.Data Processing Libraries

4.Digital Signal Processing (DSP) Libraries

5.Machine Learning Libraries

6.Development Environments

7.Version Control Systems

8.Visualization Tools

8
4.3. SYSTEM DESCRIPTION:

Our vocal emotion detection system employs deep learning techniques, including CNNs
and RNNs, to analyze speech signals and recognize emotional cues. The system preprocesses raw
vocal data, extracts relevant features, and feeds them into the deep learning model for emotion
classification. Through training on annotated datasets, the system learns to accurately classify
various emotional states, enabling more empathetic human-computer interaction. The system is
designed to be scalable, adaptable, and suitable for integration into diverse applications requiring
emotion-aware technologies.

4.4. REQUIREMENTS SPECIFICATION:

Prerequisites examination is an exceptionally basic cycle that empowers the progress of a


framework or programming task to be evaluated. Necessities are by and large split into two sorts:
Functional and Non-useful prerequisites.

4.4.1 FUNCTIONAL REQUIREMENTS:

These are the prerequisites that the end client explicitly requests as fundamental offices
that the framework ought to offer. This multitude of functionalities should be fundamentally
integrated into the framework as a piece of the agreement. These are addressed or expressed as
contribution to be given to the framework, the activity performed and the result anticipated. They
are fundamentally the prerequisites expressed by the client which one can see straightforwardly in
the eventual outcome, in contrast to the non-utilitarian necessities.

4.4.2 NON-FUNCTIONAL REQUIREMENTS:

These are fundamentally the quality limitations that the framework should fulfill as
indicated by the undertaking contract. The need or degree to which these elements are executed
shifts from one task to other. They are likewise called non-conduct necessities.

4.5 REQUIREMENT ENGINEERING:

Necessity Engineering is the most common way of characterizing, archiving, and keeping
up with the prerequisites. It is a course of social events and characterizing administration given
by the framework. Prerequisites Engineering Process comprises the accompanying fundamental
exercises.

9
4.5.1 REQUIREMENT ELICITATION:

It is connected with the different ways used to acquire information about the undertaking
area and prerequisites. The different wellsprings of space information incorporate clients,
business manuals, the current programming of the same kind, guidelines, and different partners of
the venture. The strategies utilized for necessities elicitation incorporate meetings,
conceptualizing, task investigation, the Delphi method, prototyping, and so on. A portion of these
is examined here. Elicitation doesn't create formal models of the prerequisites comprehended. All
things considered, it extends the space information on the examiner and subsequently assists in
furnishing with contributing to the following stage.

4.5.2 REQUIREMENT ANALYSIS:

The system must be capable of accurately detecting and interpreting emotional cues
conveyed through speech signals. It should leverage advanced deep learning techniques, including
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to analyze
acoustic and prosodic features of vocal signals. The system should preprocess raw audio data to
extract relevant features and train deep learning models on annotated datasets to classify various
emotional states. It must demonstrate high accuracy, robustness to noise, and scalability to handle
diverse datasets and computational resources. Additionally, the system should adhere to ethical
guidelines, ensuring data privacy, consent, and fairness in emotion recognition tasks. Overall, the
system should provide a reliable and interpretable solution for vocal emotion detection in real-
world applications.

4.6. PRECISION OR ACCURACY REQUIREMENTS:

The precision and accuracy requirements of vocal emotion detection using deep learning
are critical factors in ensuring the effectiveness and reliability of the system. Precision refers to
the proportion of true positive predictions among all positive predictions made by the model,
while accuracy measures the overall correctness of the model's predictions across all classes.

In the context of vocal emotion detection, precision and accuracy are essential for
accurately identifying and classifying different emotional states conveyed through speech signals.
Achieving high precision ensures that the system can reliably detect specific emotions without
misclassifying irrelevant information or producing false alarms. Similarly, high accuracy indicates
that the model can correctly classify emotions across multiple classes with minimal errors.

10
To meet precision and accuracy requirements, several considerations need to be addressed.
First, the model architecture should be carefully designed to effectively capture the relevant
features of speech signals associated with different emotional states. This may involve using
advanced deep learning techniques such as convolutional neural networks (CNNs), recurrent
neural networks (RNNs), or their variants like long short-term memory (LSTM) networks.

Second, the dataset used for training and evaluation should be diverse, representative, and
sufficiently large to cover a wide range of emotional expressions and speech characteristics.
Adequate data preprocessing and augmentation techniques may also be employed to enhance the
model's robustness to variations in input data. Furthermore, rigorous evaluation metrics should be
employed to assess the performance of the model, including precision, accuracy, recall, and F1
score. Cross-validation techniques can help validate the generalization performance of the model
and identify potential overfitting or under fitting issues. Additionally, continuous monitoring and
refinement of the model based on real-world feedback and user interactions are essential to ensure
that it meets the precision and accuracy requirements in practical deployment scenarios. This may
involve fine-tuning the model parameters, updating the dataset, or incorporating user feedback to
improve the overall performance and user experience of the system. In summary, achieving high
precision and accuracy in vocal emotion detection using deep learning requires careful
consideration of model architecture, dataset quality, evaluation metrics, and continuous
refinement to meet the evolving needs and expectations of users. By addressing these factors
comprehensively, the system can deliver reliable and accurate emotion recognition capabilities for
various applications, including human-computer interaction, affective computing, and
psychological research.

11
Accuracy:

Loss:

12
4.7. ALGORITHMS

4.7.1 CONVOLUTIONAL NEURAL NETWORKS


Convolutional Neural Networks (CNNs) are a class of deep learning models widely used
in computer vision tasks, but they can also be applied to tasks involving sequential data like
speech emotion recognition. CNNs are particularly effective at capturing spatial patterns in input
data through the use of convolutional layers. In the context of speech emotion recognition, CNNs
can be used to extract relevant features from audio spectrograms, which are visual representations
of the frequency content of an audio signal over time. Here's how CNNs work and how they can
be applied to speech emotion recognition:

In the concept of Convolutional Layers it consists of multiple convolutional layers, where


each layer applies a set of learnable filters (kernels) to the input data. These filters convolve
across the input spectrogram, computing dot products at each position to detect local patterns. As
the filters slide (or convolve) across the input, they capture different features such as edges,
textures, or frequency components. Pooling Layers after each convolutional layer, pooling layers
(e.g., max pooling or average pooling) can be used to down sample the feature maps, reducing the
spatial dimensions while preserving important features. Pooling helps make the representation
more robust to variations in input and reduces computational complexity. Activation Functions
are the Non-linear activation functions (e.g., ReLU) are typically applied after convolutional and
pooling layers to introduce non-linearity into the network, allowing it to learn complex patterns
and relationships in the data. Fully Connected Layers can be termed as the output from the
convolutional and pooling layers is flattened and fed into one or more fully connected (dense)
layers. These layers learn to map the extracted features to the desired output labels (emotions in
this case) through a series of matrix multiplications and non-linear transformations.

At last output Layer is the final layer of the CNN is typically a softmax activation
function, which converts the network's raw output into probabilities across different emotion
classes. The emotion with the highest probability is considered the predicted emotion. In speech
emotion recognition, CNNs can learn to automatically extract relevant features from
spectrograms, such as frequency patterns and temporal dynamics associated with different
emotions. By training the CNN on a dataset of labeled audio samples, the model learns to map
these features to the corresponding emotions, enabling it to predict the emotion expressed in
unseen audio recordings.
13
Overall, CNNs are powerful tools for speech emotion recognition, as they can
automatically learn hierarchical representations of features from raw audio data, without the need
for handcrafted feature engineering. The role of Convolutional Neural Networks (CNNs) in
speech emotion recognition is to automatically learn and extract relevant features from raw audio
data, specifically spectrograms, to accurately predict the emotion expressed in speech.
CNNs play several key roles in this process:

It plays a major role in feature extraction which is capable of automatically learning


hierarchical representations of features directly from spectrograms, which are visual
representations of the frequency content of an audio signal over time. The convolutional layers in
CNNs perform feature extraction by convolving learnable filters across the spectrogram,
capturing local patterns and spatial relationships in the input data. Pattern Recognition is the
technique of learning from a large dataset of labeled audio samples, CNNs can effectively
recognize patterns and correlations between specific spectral features and emotional states.
Through the training process, the network learns to associate certain patterns in the spectrogram
with different emotions, enabling it to make accurate predictions on unseen data. Generalization
has the ability to generalize learned patterns to unseen data, allowing them to accurately predict
emotions in speech samples that were not encountered during training. This generalization is
crucial for real-world applications of speech emotion recognition, where the model needs to
perform well on diverse and variable speech signals.
Robustness to Variation is robust to variations in input data, such as changes in pitch,
accent, or background noise, making them suitable for real-world scenarios where speech signals
may exhibit variability. The hierarchical feature extraction process in CNNs enables them to
capture invariant features across different instances of the same emotion. End-to-End Learning:
CNNs enable end-to-end learning, where the entire model, including feature extraction and
emotion classification, is learned directly from raw audio data. This eliminates the need for
manual feature engineering and allows the network to automatically adapt its internal
representations to optimize performance for the task of speech emotion recognition. Overall,
CNNs play a critical role in speech emotion recognition by automatically learning discriminative
features from spectrograms and using them to accurately classify the emotional content of speech
signals. Their ability to capture complex patterns and generalize to unseen data makes them well-
suited for this task.

14
ADVANTAGES:
1. Automatic Feature Learning

2. Hierarchical Feature Extraction

3. Robustness to Variations

4. End-to-End Learning

5. Scalability

DISADVANTAGES:

1. Large Data Requirements

2. Computational Complexity

3. Overfitting

4. Interpretability

4.7.2 LONG SHORT TERM MEMORY

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN)


architecture designed to capture long-term dependencies in sequential data. In speech
emotion detection using deep learning, LSTMs play a crucial role in modeling the temporal
dynamics of audio signals and extracting relevant features for emotion classification. The
primary function of LSTMs in speech emotion detection is to learn and remember contextual
information over extended time periods. Unlike traditional RNNs, which may struggle with
vanishing or exploding gradients over long sequences, LSTMs are specifically designed to
mitigate these issues through the use of gated units.

The key components of an LSTM cell include the input gate, forget gate, output gate, and
cell state. These gates regulate the flow of information within the network and enable the
LSTM to selectively remember or forget information based on its relevance to the current
context. In the context of speech emotion detection, LSTMs can effectively capture the
temporal dynamics of emotional expression in audio signals. By processing sequential audio
frames over time, LSTMs can learn to extract features that are indicative of different
emotional states, such as variations in pitch, intonation, and rhythm.

Furthermore, LSTMs can capture long-range dependencies in speech signals, allowing


them to incorporate contextual information from distant time steps into the emotion
15
classification process. This capability is particularly important for recognizing subtle changes in
emotion that unfold over extended periods.

Overall, LSTMs serve as powerful tools for speech emotion detection, enabling models to
effectively model the temporal dynamics of audio signals and extract meaningful features for
accurate emotion classification. Their ability to capture long-term dependencies makes them well-
suited for tasks involving sequential data such as speech processing. Thus the role of LSTM in
speech emotion detection, the role of the Long Short-Term Memory (LSTM) algorithm is pivotal
in capturing the temporal dynamics of audio signals and extracting relevant features for accurate
emotion classification. LSTMs, a type of recurrent neural network (RNN), excel in processing
sequential data by mitigating the issues of vanishing or exploding gradients over long sequences.
This capability allows LSTMs to effectively model long-term dependencies in the sequential
structure of audio signals.

During the processing of audio sequences, LSTMs sequentially process audio frames over
time, allowing them to learn and remember contextual information over extended periods. This is
essential for understanding the nuanced variations in emotional expression that unfold over time,
including changes in pitch, intonation, and rhythm, which are indicative of different emotional
states. By capturing these temporal dynamics, LSTMs can extract discriminative features from the
audio signals that are relevant for emotion classification.

Furthermore, LSTMs excel in modeling long-range dependencies in speech signals,


enabling them to incorporate contextual information from distant time steps into the emotion
classification process. This capability is crucial for recognizing subtle changes in emotion that
may occur over extended periods, such as shifts in emotional intensity or transitions between
different emotional states.

Overall, the LSTM algorithm plays a critical role in speech emotion detection by enabling
models to effectively model the temporal dynamics of audio signals and extract meaningful
features for accurate emotion classification. Its ability to capture long-term dependencies and
sequential patterns makes it well-suited for tasks involving sequential data such as speech
processing.

16
LSTM DIAGRAM

ADVANTAGES:
1. Long-Term Dependency Modeling
2. Temporal Dynamics Capture
3. Robustness to Sequence Length
4. Contextual Information Integration

DISADVANTAGES:
1. Computational Complexity
2. Overfitting
3. Interpretability
4. Data Requirements
5. Hyper Parameter Sensitivity

4.7.3 RECURRENT NEURAL NETWORKS


Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to
effectively process sequential data by capturing temporal dependencies and patterns. In the
context of speech emotion recognition, RNNs play a crucial role in analyzing audio signals
over time and extracting meaningful features for emotion classification. The importance of
RNNs in speech emotion recognition stems from their ability to model the sequential nature of
audio signals, which contain rich temporal information that is crucial for understanding
emotional expression. Unlike traditional feedforward neural networks, RNNs have
connections that form directed cycles, allowing information to persist over time and be shared
across different time steps.

17
RNNs process sequential data iteratively, where each time step corresponds to a specific
point in time in the input sequence. During each iteration, the RNN updates its internal state based
on the current input and its previous state, allowing it to capture dependencies between successive
elements in the sequence. This recurrent structure enables RNNs to effectively model long-term
dependencies and capture temporal patterns in the input data. In speech emotion recognition,
RNNs can analyze the temporal dynamics of audio signals, including variations in pitch,
intonation, and rhythm, which are indicative of different emotional states. By processing
sequential audio frames over time, RNNs can learn to extract features that are relevant for
emotion classification, such as changes in speech prosody, energy, and spectral characteristics.

Furthermore, RNNs can integrate contextual information from previous time steps with
current inputs, enabling them to make informed predictions about the emotional content of speech
signals. This ability to capture temporal dependencies and contextual information makes RNNs
well-suited for tasks involving sequential data such as speech processing and emotion recognition.

Overall, RNNs are important and widely used in the field of speech emotion recognition
due to their ability to effectively model temporal dynamics, capture long-term dependencies, and
extract relevant features from sequential audio data. Their application in emotion recognition
systems has led to advancements in understanding and interpreting emotional expression in
speech signals, with implications for various domains including human-computer interaction,
affective computing, and psychological research.

Thus the architecture of RNN for emotion detection is as follows:

18
RNN ARCHITECTURE FOR EMOTION DETECTION

Difference between RNN and CNN algorithm in the field of speech emotion recognition are as
follows:

Parameters CNN RNN


Architecture Spatial data processing Sequential data processing
Input size Fixed size of input length Variable length sequential data
Temporal No capable to store data for a long Capable to store data for a
Model time temporary sequence
Processing Data processing in a parallel manner Sequential manner
Memory usage Depends on the size of the network Depends on length of input
Feature Capturing local spatial patterns and Sequentially processing of
Extraction hierarchical data making them data and they are suitable for
effective for image analysis. emotion recognition.

19
ADVANTAGES:

1. Temporal Dynamics Capture


2. Long-Term Dependency Modeling
3. Contextual Information Integration
4. Sequential Processing
5. End-to-End Learning

DISADVANTAGES:
1. Vanishing/Exploding Gradients
2. Computational Complexity
3. Limited Memory Capacity
4. Difficulty in Learning Long-Term Dependencies
5. Training Instability

20
CHAPTER 5

5 SYSTEM IMPLEMENTATION

5.1 PROJECT DESCRIPTION:

The project revolves around developing a sophisticated vocal emotion detection


system, employing cutting-edge deep learning methodologies to accurately decipher and interpret
emotional cues embedded within speech signals. This system is meticulously designed to
preprocess raw vocal data, meticulously extracting pertinent acoustic and prosodic features
imperative for discerning nuanced emotions. By harnessing the prowess of advanced neural
network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs), notably variants like Long Short-Term Memory (LSTM) networks, the system
seamlessly learns to autonomously extract hierarchical representations of emotional features
directly from raw vocal inputs. Through rigorous training on meticulously annotated datasets, the
system strives to achieve unparalleled precision and robustness in delineating various emotional
states, fostering applications across realms like human-computer interaction, mental health
monitoring, and digital communication. Furthermore, the project endeavors to tackle formidable
challenges such as cross-cultural disparities and ethical considerations, ensuring the ethical and
equitable deployment of emotion recognition technologies. Ultimately, this endeavor aspires to
contribute to the evolution of empathetic and responsive systems adept at comprehending human
emotions, thus enriching interactions with technology and elevating user experiences.

5.2 SOFTWARE MODULE DESCRIPTION:

5.2.1 MODULE 1(DATA COLLECTION AND PREPROCESSING):

Data extraction and processing for vocal emotion detection using deep learning
involve several steps to prepare the audio data for training and evaluation. Here is a detailed
description of each step: Collect audio recordings containing vocal expressions of emotions.
These recordings can be obtained from various sources such as public databases, online
repositories, or recorded specifically for the project. Remove background noise and artifacts from
the audio recordings using techniques like spectral subtraction, noise cancellation algorithms, or
wavelet denoising. Normalize the audio amplitude to ensure consistent volume levels across
different recordings. Segment the audio recordings into smaller, manageable segments (e.g.,
frames or windows) to facilitate feature extraction and analysis.

21
Each segment should capture a meaningful vocal expression of emotion. Extract spectral
features from the audio signals, such as Mel-frequency cepstral coefficients (MFCCs), spectral
centroid, spectral flux, and spectral roll-off. Capture temporal dynamics of the vocal signals,
including pitch contour, energy contour, and timing information. Extract prosodic features related
to intonation, rhythm, and stress patterns, such as pitch, intensity, duration, and speech rate.
Optionally, augment the dataset to increase its diversity and robustness. Data augmentation
techniques may include pitch shifting, time stretching, adding background noise, or simulating
reverberation effects. Annotate the audio recordings with labels indicating the corresponding
emotional states expressed in the vocal signals. Each recording should be labeled with one or
more emotion categories (e.g., happiness, sadness, anger).

Divide the dataset into training, validation, and test sets. The training set is used to train
the deep learning model, the validation set is used to tune hyper parameters and monitor
performance during training, and the test set is used to evaluate the final model's performance.
Encode the audio features and labels into a suitable format for input into the deep learning model.
This may involve converting audio signals into spectrograms or other representations suitable for
neural network processing. Implement data loading pipelines to efficiently load and preprocess
batches of audio data during model training and evaluation.

This may involve using data loading libraries or custom data pipelines in deep learning
frameworks like TensorFlow or PyTorch. Using these steps, researchers and developers can
effectively prepare the audio data for training deep learning models for vocal emotion detection,
enabling accurate and robust recognition of emotional states expressed in speech signals.

5.2.2 MODULE 2(TRAINING AND TESTING MODULE):

In the training module, we teach the deep learning model to recognize emotions in vocal
expressions by showing it many examples of labeled audio data. The model learns to identify
patterns and features in the audio that correspond to different emotions. We adjust the model's
parameters during training to improve its accuracy in recognizing emotions. In the testing module,
we evaluate the trained model's performance by presenting it with new, unseen audio samples.
The model predicts the emotions in these samples based on its training and compares its
predictions to the true labels.

22
We measure how well the model performs using metrics like accuracy, which tells us the
percentage of correctly predicted emotions. This helps us assess the model's effectiveness and
determine if it's ready for real-world use.

5.2.3. MODULE 3(OUTPUT MODULE):

In the output module for vocal emotion detection using deep learning, the system provides
predictions of the emotional states expressed in the input audio signals. Each audio segment is
classified into one or more predefined emotion categories, such as happiness, sadness, anger, or
neutral. The output may include confidence scores indicating the model's certainty about each
predicted emotion. Additionally, visualizations or graphical representations of the detected
emotions may be generated to aid interpretation. The output module plays a crucial role in
communicating the model's predictions to users or downstream applications for further analysis or
decision-making.

The output module in speech emotion recognition using deep learning is responsible for
interpreting the model's predictions and presenting them to the user in a comprehensible format.
This module plays a pivotal role in conveying the detected emotions accurately and meaningfully
to the end user. Upon receiving the model predictions, the output module maps the predicted
emotion labels generated by the model to human-understandable emotion categories, such as
happiness, sadness, anger, or neutral. These emotion labels are then displayed to the user through
a user interface or another output mechanism, ensuring easy interpretation. In addition to
displaying emotion labels, the output module may also provide supplementary information, such
as confidence scores or probabilities associated with each predicted emotion. These scores offer
insight into the model's confidence in its predictions, aiding in the assessment of the reliability of
the detected emotions. Furthermore, the output module may incorporate visualization techniques
to enhance the user experience.

For instance, graphical representations like bar charts, pie charts, or color-coded
visualizations can be employed to depict the distribution of predicted emotions or the temporal
evolution of emotional states over a speech segment. To ensure a seamless user experience, the
output module should handle any errors or exceptions gracefully, providing informative feedback
or prompts to the user in case of unexpected behavior or input data issues. Overall, the output
module serves as the interface between the underlying deep learning model and the end user,
facilitating clear, accurate, and intuitive communication of the detected emotions, thereby
enhancing the usability and effectiveness of the speech emotion recognition system.
23
5.3. SYSTEM DESIGN:

5.3.1 DESIGN GOALS

5.3.2 DATA FLOW DIAGRAM:

Training Model

Testing Model

24
5.3.3.USE CASE DIAGRAM:

25
5.4 SYSTEM/SOFTWARE ARCHITECTURE:

5.5 PROPOSED SYSTEM ARCHITECTURE:

26
5.6 SOFTWARE TESTING AND IMPLEMENTATION:
5.6.1 UNIT TESTING:
Unit Testing is a software testing technique by means of which individual units of
software i.e. group of computer program modules, usage procedures, and operating procedures are
tested to determine whether they are suitable for use or not. It is a testing method using which
every independent module is tested to determine if there is an issue by the developer himself. It is
correlated with the functional correctness of the independent modules. Unit Testing is defined as a
type of software testing where individual components of a software are tested. Unit Testing of the
software product is carried out during the development of an application. An individual
component may be either an individual function or a procedure. Unit Testing is typically
performed by the developer. In SDLC or V Model, Unit testing is the first level of testing done
before integration testing. Unit testing is such a type of testing technique that is usually performed
by developers. Although due to the reluctance of developers to test, quality assurance engineers
also do unit testing.

5.6.2 INTEGRATION TESTING:


Integration testing is done to test the modules/components when integrated to verify that
they work as expected i.e. to test the modules which are working fine individually does not have
issues when integrated. When talking in terms of testing large applications using the black-box
testing technique, involves the combination of many modules which are tightly coupled with each
other. We can apply the Integration testing technique concepts integration testing ensures that all
individual components of the system, such as data preprocessing, feature extraction, model
training, and inference, work together seamlessly to achieve the desired functionality.

5.6.3 VALIDATION TESTING:


Validation testing is the process of ensuring that the tested and developed software
satisfies the client /user’s needs. The business requirement logic or scenarios have to be tested in
detail. All the critical functionalities of an application must be tested here. As a tester, it is always
important to know how to verify the business logic or scenarios that are given to you. One such
method that helps in the detailed evaluation of the functionalities is the Validation Process.
Whenever you are asked to perform a validation test, it takes great responsibility as you need to
test all the critical business requirements based on the user’s needs. There should not be even a
single miss on the requirements asked by the user. Hence a keen knowledge of validation testing
is much more important. As a tester, you need to evaluate if the test execution results comply with

27
what is mentioned in the requirements document. Any deviation should be reported immediately
and that deviation is thus called a bug. Tools like HP quality Centre, Selenium, Appium, etc are
used to perform validation tests and we can store the test results there. A proper test plan, test
execution runs, defect reports, reports & metrics are the important deliverables to be submitted.

5.6.4 SYSTEM TESTING:


System Testing is a type of software testing that is performed on a complete integrated
system to evaluate the compliance of the system with the corresponding requirements. In system
testing, integration testing passed components are taken as input. The goal of integration testing is
to detect any irregularity between the units that are integrated together. System testing detects
defects within both the integrated units and the whole system. The result of system testing is the
observed behavior of a component or a system when it is tested. System Testing is carried out on
the whole system in the context of either system requirement specifications or functional
requirement specifications or in the context of both. System testing tests the design and behavior
of the system and also the expectations of the customer. It is performed to test the system beyond
the bounds mentioned in the software requirements specification (SRS). System Testing is
basically performed by a testing team that is independent of the development team that helps to
test the quality of the system impartial. It has both functional and non-functional testing. System
Testing is a black-box testing. System Testing is performed after the integration testing and before
the acceptance testing.

28
CHAPTER 6

6 CONCLUSION & FUTURE SCOPE


6.1 CONCLUSION:

In conclusion, the abstract aptly highlights the transformative potential of deep learning
techniques in the realm of vocal emotion recognition. Traditional methods, constrained by their
reliance on handcrafted features, are surpassed by the adaptability and efficacy of deep learning
algorithms such as CNN, RNN and LSTMs. This shift opens doors to more accurate and nuanced
understanding of emotional expressions in speech, crucial for applications in human-computer
interaction and psychological research. By harnessing the power of deep learning, we move closer
to creating empathetic systems capable of comprehending and responding to human intentions
and sentiments, ultimately enriching our digital communication experiences.

6.2 FUTURE SCOPE:


Thus the future efforts may focus on optimizing deep learning models for real-time
processing, enabling applications such as emotion-aware virtual assistants, mental health
monitoring systems, and interactive entertainment experiences. Future research may focus on
developing deep learning models that are more adaptable to diverse cultural and linguistic
contexts, ensuring accurate emotion recognition across different populations and languages. the
future of vocal emotion detection using deep learning holds immense promise for revolutionizing
human-computer interaction, communication, and understanding of human emotions, with far-
reaching implications for various domains and societal applications.

29
APPENDICES-SOURCE CODE

#Install Dependencies

(Remember to choose GPU in Runtime if not already selected. Runtime --> Change Runtime Type
--> Hardware accelerator --> GPU)
# clone Speech Emotion Detection using Deep Learning repository
!git clone https://github.com/CheyneyComputerScience/CREMA-D.git# clone repo
%cd CREMA
!git reset --hard 886f1c03d839575afecb059accf74296fad395b6

Cloning into 'CREMA_D.git'...


remote: Enumerating objects: 12228, done.ote: Counting objects: 100% (91/91), done.ote:
Compressing objects: 100% (25/25), done.ote: Total 7442 (delta 15), reused 20 (delta 8), pack-
reused 12195

#install the necessary libraries or modules to get started:


import os #this module is to create and manage files and directories
Import librosa #this library is used for music analysis

import warnings
warnings.filterwarnings('ignore') #it will ignore warnings or errors

#clear_output()
print('Data source import complete.') #It will display the message that the dataset is downloaded.
Complete. Using torch 1.11.0+cu113 _CudaDeviceProperties(name='RAVDESS ', major=7,
minor=5, total_memory=24.8GB, total_count=7,356 original clips,multiprocessors_count=2)

We'll download our dataset from Kaggle. Use the " RAVDESS" export format. Note that the
implementation calls for a Ravdess Dataset file defining where your training and test data. The
export also writes this format for us.

#follow the link below to get your download code from from Ravdess Dataset

30
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA
SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING='toronto-emotional-speech-set-
tess:https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttps%2Fstorage.googleapis.com%2Fkaggle-data-
sets%2F316368%2F639622%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-
RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-
161607.iam.gserviceaccount.com%252F20240427%252Fauto%252Fstorage%252Fgoog4_reques
t%26X-Goog-Date%3D20240427T093703Z%26X-Goog-Expires%3D259200%26X-Goog-
SignedHeaders%3Dhost%26X-Goog-
Signature%3D6cfcda3239363b927e76a34ae5f3a3b6e0e2149a2e27f900587c0d5c976de56562c9b
807694f219e094c80dd0465bbf8522ee6c4a7279f2d0833d7d62a418aa7aeea301701669fc72f16c5c
201c377f85f8c71d76e14cfe1e8e6eaaab90f0ae554f3eac341147b32245bfcf6c2940d2a0d9f9c1982
e14952dd45f00198ac60f283052575a52a5b9d0cf5e788d5ad1c60f4a13b1d4a72ac8860ac0e846e2e
4b530c4d57f9dc6c31be8ba7d71cfa02ef1c7cc3d387b8cde85977d0339f2d8a6601d322893cc6d17
84f80f8335daa0f97403c07af43449dd8f5d44e41285049e4d6c40ab25d48575de5290e63b6f5266d
d56cfa8f0cfa47ca8d5dc24a95b5a82bb2'

31
KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null


shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
pass
try:
os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'),
target_is_directory=True)
except FileExistsError:
pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):


directory, download_url_encoded = data_source_mapping.split(':')
download_url = unquote(download_url_encoded)
filename = urlparse(download_url).path
destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
try:
with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
total_length = fileres.headers['content-length']
print(f'Downloading {directory}, {total_length} bytes compressed')
dl = 0
data = fileres.read(CHUNK_SIZE)
while len(data) > 0:
dl += len(data)
tfile.write(data)
done = int(50 * dl / int(total_length))
sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
32
sys.stdout.flush()
data = fileres.read(CHUNK_SIZE)
if filename.endswith('.zip'):
with ZipFile(tfile) as zfile:
zfile.extractall(destination_path)
else:
with tarfile.open(tfile.name) as tarfile:
tarfile.extractall(destination_path)
print(f'\nDownloaded and uncompressed: {directory}')
except HTTPError as e:
print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
continue
except OSError as e:
print(f'Failed to load {download_url} to path {destination_path}')
continue
print('Data source import complete.')

After training starts, view train *.wav and .mp3 to see training audios, labels and augmentation
effects.

Note a mosaic dataloader is used for training (shown below), a new dataloading concept
developed by David Cooper WCU and first featured in CREMA-D dataset.
#displaying the path of the dataset:

ravdess = "/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/"
ravdess_directory_list = os.listdir(ravdess)
print(ravdess_directory_list)

33
GROUND TRUTH TRAINING DATA:

GROUND TRUTH AUGMENTED TRAINING AND TESTING DATA:

34
#Run Inference With Trained Weights Run inference with a pretrained checkpoint on contents of
audio/mp3 and .wav folder downloaded from Crema.
# stored audios are trained in the audiomp3 folder
# Importing Dataset Ravdess dataset

 Modality (01 = full-AV, 02 = video-only, 03 = audio-only).


 Vocal channel (01 = speech, 02 = song).
 Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 =
disgust, 08 = surprised).
 Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the
'neutral' emotion.
 Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
 Repetition (01 = 1st repetition, 02 = 2nd repetition).
 Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
So, here's an example of an audio filename. 03-01-06-01-02-01-12.wav This means the meta data
for the audio file is:

 Audio-only (03)
 Speech (01)
 Fearful (06)
 Normal intensity (01)
 Statement "dogs" (02)
 1st Repetition (01)
 12th Actor (12) - Female (as the actor ID number is even)

#preparing data set


ravdess = "/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/"
ravdess_directory_list = os.listdir(ravdess)
print(ravdess_directory_list)

['Actor_02', 'Actor_17', 'Actor_05', 'Actor_16', 'Actor_21', 'Actor_01', 'Actor_11', 'Actor_20', 'Act


or_08', 'Actor_15', 'Actor_06', 'Actor_12', 'Actor_23', 'Actor_24', 'Actor_22', 'Actor_04', 'Actor_1
9', 'Actor_10', 'Actor_09', 'Actor_14', 'Actor_03', 'Actor_13', 'Actor_18', 'Actor_07']

#Path for the datasets which is used in the code

Crema = "/kaggle/input/cremad/AudioWAV/"
Tess = "/kaggle/input/toronto-emotional-speech-set-tess/tess toronto emotional speech set data/TE
SS Toronto emotional speech set data/"
Savee = "/kaggle/input/surrey-audiovisual-expressed-emotion-savee/ALL/"

35
#Preprocessing stage

file_emotion = []
file_path = []
for i in ravdess_directory_list:
# as their are 24 different actors in our previous directory we need to extract files for each actor.
actor = os.listdir(ravdess + i)
for f in actor:
part = f.split('.')[0].split('-')
# third part in each file represents the emotion associated to that file.
file_emotion.append(int(part[2]))
file_path.append(ravdess + i + '/' + f)
print(actor[0])
print(part[0])
print(file_path[0])
print(int(part[2]))
print(f)
03-01-06-02-01-01-07.wav
03
/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_02/03-01-08-01
-01-01-02.wav
5
03-01-05-02-01-02-07.wav

Emotions Path
0 surprise /kaggle/input/ravdess-emotional-speech-audio/a...
1 neutral /kaggle/input/ravdess-emotional-speech-audio/a...
2 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
3 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
4 neutral /kaggle/input/ravdess-emotional-speech-audio/a...
______________________________________________
Emotions Path
1435 fear /kaggle/input/ravdess-emotional-speech-audio/a...
1436 angry /kaggle/input/ravdess-emotional-speech-audio/a...
1437 sad /kaggle/input/ravdess-emotional-speech-audio/a...
1438 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
1439 angry /kaggle/input/ravdess-emotional-speech-audio/a...

36
neutral 288
surprise 192
disgust 192
fear 192
sad 192
happy 192
angry 192
Name: Emotions, dtype: int64

#CREMA and RAVDESS DATAFRAME

The dataset which is used in this project is: RAVDESS(Ryerson Audio -Visual Databa
se) which consists of 7,356 files and 24 of them were recorded their voice in this dataset. Out of
which 12 of them are male and others are female which is of American accent. The CREMA
dataset consists of facial and vocal emotion expressions in sentences spoken in a range of basic
emotional states (happy, sad, anger, fear, disgust, and neutral). In this CREMA dataset totally
consist of 7,442 clips of different ethnic backgrounds were rated by multiple raters in three
modalities: audio, visual and audio-visual.

crema_directory_list = os.listdir(Crema)

file_emotion = []
file_path = []

for file in crema_directory_list:


# storing file paths
file_path.append(Crema + file)
# storing file emotions
part=file.split('_')
if part[2] == 'SAD':
file_emotion.append('sad')
elif part[2] == 'ANG':
file_emotion.append('angry')
elif part[2] == 'DIS':
file_emotion.append('disgust')
elif part[2] == 'FEA':
file_emotion.append('fear')
elif part[2] == 'HAP':
file_emotion.append('happy')
elif part[2] == 'NEU':
file_emotion.append('neutral')
else:
file_emotion.append('Unknown')

37
# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.


path_df = pd.DataFrame(file_path, columns=['Path'])
Crema_df = pd.concat([emotion_df, path_df], axis=1)
Crema_df.head()
print(Crema_df.Emotions.value_counts())

disgust 1271
happy 1271
sad 1271
fear 1271
angry 1271
neutral 1087
Name: Emotions, dtype: int64

#Integration
# creating Dataframe using all the 4 dataframes we created so far.
data_path = pd.concat([ravdess_df, Crema_df, Tess_df, Savee_df], axis = 0)
data_path.to_csv("data_path.csv",index=False)
data_path.head() #It will return the top data of those SER database

print(data_path.Emotions.value_counts())

disgust 1923
fear 1923
sad 1923
happy 1923
angry 1923
neutral 1895
surprise 652
Name: Emotions, dtype: int64

#Data visualization

import matplotlib.pyplot as plt


import seaborn as sns

plt.title('Count of Emotions', size=16)


sns.countplot(data_path.Emotions)
plt.ylabel('Count', size=12)
plt.xlabel('Emotions', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show() #Display the output

38
#Following diagram it will show the number of emotions

data,sr = librosa.load(file_path[0])
sr
22050

# CREATE LOG MEL SPECTROGRAM


plt.figure(figsize=(10, 5))
spectrogram = librosa.feature.melspectrogram(y=data, sr=sr, n_mels=128,fmax=8000)
log_spectrogram = librosa.power_to_db(spectrogram)
librosa.display.specshow(log_spectrogram, y_axis='mel', sr=sr, x_axis='time');
plt.title('Mel Spectrogram ')
plt.colorbar(format='%+2.0f dB')

# NOISE
def noise(data):
noise_amp = 0.035*np.random.uniform()*np.amax(data)
data = data + noise_amp*np.random.normal(size=data.shape[0])
return data

# STRETCH
def stretch(data, rate=0.8):
return librosa.effects.time_stretch(data, rate)
# SHIFT
def shift(data):
shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
return np.roll(data, shift_range)
# PITCH
def pitch(data, sampling_rate, pitch_factor=0.7):
return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor)

# NORMAL AUDIO

import librosa.display
plt.figure(figsize=(12, 5))
librosa.display.waveshow(y=data, sr=sr)
ipd.Audio(data,rate=sr)

39
#AUDIO WITH NOISE
x = noise(data)
plt.figure(figsize=(12,5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)

# STRETCHED AUDIO
x = stretch(data)
plt.figure(figsize=(12, 5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)

40
#AUDIO WITH PITCH
x = noise(data)
plt.figure(figsize=(12,5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)

#FEATURE EXTRACTION
def zcr(data,frame_length,hop_length):
zcr=librosa.feature.zero_crossing_rate(data,frame_length=frame_length,hop_length=hop_lengt
h)
return np.squeeze(zcr)
def rmse(data,frame_length=2048,hop_length=512):
rmse=librosa.feature.rms(data,frame_length=frame_length,hop_length=hop_length)
return np.squeeze(rmse)
def mfcc(data,sr,frame_length=2048,hop_length=512,flatten:bool=True):
mfcc=librosa.feature.mfcc(data,sr=sr)
return np.squeeze(mfcc.T)if not flatten else np.ravel(mfcc.T)

def extract_features(data,sr=22050,frame_length=2048,hop_length=512):
result=np.array([])

result=np.hstack((result,
zcr(data,frame_length,hop_length),
rmse(data,frame_length,hop_length),
mfcc(data,sr,frame_length,hop_length)
))
return result

41
def get_features(path,duration=2.5, offset=0.6):
data,sr=librosa.load(path,duration=duration,offset=offset)
aud=extract_features(data)
audio=np.array(aud)

noised_audio=noise(data)
aud2=extract_features(noised_audio)
audio=np.vstack((audio,aud2))

pitched_audio=pitch(data,sr)
aud3=extract_features(pitched_audio)
audio=np.vstack((audio,aud3))

pitched_audio1=pitch(data,sr)
pitched_noised_audio=noise(pitched_audio1)
aud4=extract_features(pitched_noised_audio)
audio=np.vstack((audio,aud4))

return audio

import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())
Number of processors: 2

#Faster way to get features


This code is an example of how to use the joblib library to process multiple audio files in
parallel using the process_feature function. The code also uses the timeit library to measure the
time taken to process the audio files.

Here's a breakdown of what the code does:

The process_feature function processes a single audio file by extracting its features using
the get_feat function and appending the corresponding X and Y values to the X and Y lists. The
paths and emotions variables extract the paths and emotions from the data_path DataFrame. The
parallel function runs the process_feature function in parallel for each audio file using the delayed
function to wrap the process_feature function. The results variable contains the X and Y values
for each audio file. The X and Y lists are populated with the X and Y values from each audio file
using the extend method. The stop = timeit.default_timer() statement stops the timer. The
print('Time: ', stop - start) statement prints the time taken to process the audio files. Overall, this
code demonstrates how to use the joblib library to process multiple audio files in parallel, which
can significantly reduce the processing time for large datasets. This code is an example of how to
use the joblib library to process multiple audio files in parallel using the process_feature function.
The code also uses the timeit library to measure the time taken to process the audio files.
42
Here's a breakdown of what the code does:

The from joblib import Parallel, delayed statement imports the Parallel and delayed
functions from the joblib library. The start = timeit.default_timer() statement starts a timer to
measure the time taken to process the audio files. The process_feature function processes a single
audio file by extracting its features using the get_feat function and appending the corresponding X
and Y values to the X and Y lists. The paths and emotions variables extract the paths and
emotions from the data_path DataFrame. The Parallel function runs the process_feature function
in parallel for each audio file using the delayed function to wrap the process_feature function. The
results variable contains the X and Y values for each audio file. The X and Y lists are populated
with the X and Y values from each audio file using the extend method. Overall, this code
demonstrates how to use the library to process multiple audio files in parallel, which can
significantly reduce the processing time for large datasets.

The .extend() method increases the length of the list by the number of elements that are
provided to the method, so if you want to add multiple elements to the list, you can use this
method.

"""from joblib import Parallel, delayed

import timeit
start = timeit.default_timer()
# Define a function to get features for a single audio file
def process_feature(path, emotion):
features = get_features(path)
X = []
Y = []
for ele in features:
X.append(ele)
# appending emotion 3 times as we have made 3 augmentation techniques on each audio file.
Y.append(emotion)
return X, Y

paths = data_path.Path
emotions = data_path.Emotions
# Run the loop in parallel
results = Parallel(n_jobs=-1)(delayed(process_feature)(path, emotion) for (path, emotion) in zip(p
aths, emotions))
# Collect the results
X = []
Y = []

43
for result in results:
x, y = result

X.extend(x)
Y.extend(y)

stop = timeit.default_timer()
print('Time: ', stop - start) """

"from joblib import Parallel, delayed \nimport timeit\nstart = timeit.default_timer()\n# Define a fu


nction to get features for a single audio file\ndef process_feature(path, emotion):\n features = ge
t_features(path)\n X = []\n Y = []\n for ele in features:\n X.append(ele)\n # appendi
ng emotion 3 times as we have made 3 augmentation techniques on each audio file.\n Y.appe
nd(emotion)\n return X, Y\n\npaths = data_path.Path\nemotions = data_path.Emotions\n\n# Run
the loop in parallel\nresults = Parallel(n_jobs=-1)(delayed(process_feature)(path, emotion) for (pa
th, emotion) in zip(paths, emotions))\n\n# Collect the results\nX = []\nY = []\nfor result in results:
\n x, y = result\n X.extend(x)\n Y.extend(y)\n\n\nstop = timeit.default_timer()\n\nprint('Tim
e: ', stop - start) "

#Paths
paths[:5]

['/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set


data/YAF_pleasant_surprised/YAF_pain_ps.wav',
'/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/YAF_pleasant_surprised/YAF_love_ps.wav',
'/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/YAF_pleasant_surprised/YAF_near_ps.wav',
'/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/YAF_pleasant_surprised/YAF_rain_ps.wav',
'/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/YAF_pleasant_surprised/YAF_merge_ps.wav']

#Labels

labels[5:]

['ps', 'ps', 'ps', 'ps', 'ps']

44
#Create a DataFrame

df=pd.DataFrame()
df[‘speech’]=paths
df[‘labels’]=labels
df.head() #It will return the top data on the database

C0 /kaggle/input/toronto-emotional-speech-set-tes... ps

1 /kaggle/input/toronto-emotional-speech-set-tes... Ps

2 /kaggle/input/toronto-emotional-speech-set-tes... Ps

3 /kaggle/input/toronto-emotional-speech-set-tes... Ps

4 /kaggle/input/toronto-emotional-speech-set-tes... Ps

#Counts

df[‘label’].value_counts()

label

ps 400
neutral 400
disgust 400
happy 400
fear 400
angry 400
sad 400
Name: count, dtype: int64

45
#Explaratory Data Analysis

sns.counterplot(data=df, x=’label’) #To check all the data is uniformly distributed

<Axes: xlabel='label', ylabel='count'>

def waveform(data, sr, emotion):


plt.figure(figsize=(10,4)) #10-inches height and 4 width
plt.title(emotion,size=20)
librosa.display.waveshow(data,sr=sr)
plt.show()

def spectrogram(data,sr,emotion):
x=librosa.stft(data)
xdb=librosa.amplitude_to_db(abs(x)) #amp to decibels
plt.figure(figsize=(10,4))
plt.title(emotion,size=20)
librosa.display.specshow(xdb,sr=sr,x_axis=’time’,y_axis=’hz’)

46
#Fear

emotion=’fear’
path=np.array(df[‘speech’]df[‘label’]==emotion)[0]
data,sampling_rate=librosa.load(path)
waveform(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)

47
#Disgust

#Neutral

emotion = 'neutral'

path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)
48
49
50
#ps code

emotion = 'ps'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)

#Happy

emotion=’happy’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)

51
#Sad code

emotion=’sad’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)

#Disgust code
emotion=’disgust’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)

#Feature Extraction

def extract_mfcc(filename):
y,sr=librosa.load(filename,duration=3,offset=0.5)
mfcc=np.mean(librosa.feature.mfcc(y=y,sr=sr,n_mfcc=40),T,axis=0)
return mfcc
extract_mfcc(df[‘speech’][0])

array([-281.8108, 62.511196, -12.258653, 5.399305..,


-1.76875, 18.420088, -1.54607777, -19.30712…,
-1.7788717,17.739153, -21.011467, 1.1004784,
-0.85493857, -1.170434, -1.2859566, 1.1004784,
-13.254162, 12.508676, 1.793289, 7.7120137,
15.739933, 9.816137, 2.727184, 10.273345,
52
-4.466957, -2.6667902, -4.466957, -6.418108,
-11.540051, 1.3196231, -3.7845402, 1.1748109,
-2.9630053, -0.32612854, 1.3370216, 4.015099,
-3.135072, 1.8441529],
dtype=float32)

X_mfcc=df[‘speech’].apply(lambda x: extract_mfcc(x))

X_mfcc

0 [-281.8108, 62.511196, -12.256853, 5.399305,-…,


1 [-271.23245, 21.5418559, -18.678143, 10.60208, ...
2 [-292.78122.7144, 73.24398, -21.521925, 8.511741, -...
3 [-344.21515, 48.95972, -26.031006, 5.702147,...
4 [-354.6095, 80.40532, -21.864805, -0.87233377,...
...
2795 [-436.2176, 105.68216, 5.6842546, 41.07361, -9...
2796 [-399.3086, 80.32976, -0.8494893, -8.483726, -...
2797 [-415.19397, 89.78226, -13.865563, -14.909603,...
2798 [-412.90628, 106.49111, 4.2450223, -10.701215,...
2799 [-467.83038, 92.1703, 5.8891487, -9.331061, -4...
Name: speech, Length: 2800, dtype: object

X=[x for x in X_mfcc]


X=np.array(X)
X.shape

(2800,40)

#Train the model

history= model.fit(X,y,validation_split=0.2, epcohs =50,batch_size=64)

Epoch 1/50
35/35 [==============================] - 10s 197ms/step - loss: 1.2568 - accuracy:
0.5147 - val_loss: 1.0349 - val_accuracy: 0.5804
Epoch 2/50
35/35 [==============================] - 7s 208ms/step - loss: 0.4716 - accuracy:
0.8219 - val_loss: 1.0194 - val_accuracy: 0.6161
Epoch 3/50
35/35 [==============================] - 6s 169ms/step - loss: 0.2969 - accuracy:
0.9067 - val_loss: 0.7541 - val_accuracy: 0.7571
Epoch 4/50
35/35 [==============================] - 7s 209ms/step - loss: 0.2257 - accuracy:
0.9259 - val_loss: 0.9466 - val_accuracy: 0.7054
Epoch 5/50
35/35 [==============================] - 6s 186ms/step - loss: 0.1249 - accuracy:
0.9603 - val_loss: 0.9181 - val_accuracy: 0.7589
Epoch 6/50
35/35 [==============================] - 7s 202ms/step - loss: 0.1301 - accuracy:
0.9580 - val_loss: 1.2963 - val_accuracy: 0.6500
53
Epoch 7/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0951 - accuracy:
0.9705 - val_loss: 1.4729 - val_accuracy: 0.6339
Epoch 8/50
35/35 [==============================] - 7s 200ms/step - loss: 0.0914 - accuracy:
0.9705 - val_loss: 0.7101 - val_accuracy: 0.8089
Epoch 9/50
35/35 [==============================] - 6s 165ms/step - loss: 0.1235 - accuracy:
0.9580 - val_loss: 0.6240 - val_accuracy: 0.8250
Epoch 10/50
35/35 [==============================] - 9s 248ms/step - loss: 0.0636 - accuracy:
0.9804 - val_loss: 0.6841 - val_accuracy: 0.8411
Epoch 11/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0542 - accuracy:
0.9812 - val_loss: 1.6489 - val_accuracy: 0.6589
Epoch 12/50
35/35 [==============================] - 7s 199ms/step - loss: 0.0590 - accuracy:
0.9799 - val_loss: 2.3056 - val_accuracy: 0.5643
Epoch 13/50
35/35 [==============================] - 6s 161ms/step - loss: 0.0572 - accuracy:
0.9790 - val_loss: 1.3272 - val_accuracy: 0.7232
Epoch 14/50
35/35 [==============================] - 7s 205ms/step - loss: 0.0676 - accuracy:
0.9790 - val_loss: 1.7888 - val_accuracy: 0.6196
Epoch 15/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0426 - accuracy:
0.9853 - val_loss: 1.4516 - val_accuracy: 0.7179
Epoch 16/50
35/35 [==============================] - 7s 209ms/step - loss: 0.0441 - accuracy:
0.9853 - val_loss: 1.3134 - val_accuracy: 0.7554
Epoch 17/50
35/35 [==============================] - 6s 169ms/step - loss: 0.0778 - accuracy:
0.9781 - val_loss: 1.2486 - val_accuracy: 0.7179
Epoch 18/50
35/35 [==============================] - 7s 194ms/step - loss: 0.0718 - accuracy:
0.9763 - val_loss: 1.0359 - val_accuracy: 0.7661
Epoch 19/50
35/35 [==============================] - 6s 162ms/step - loss: 0.0388 - accuracy:
0.9862 - val_loss: 1.4881 - val_accuracy: 0.6929
Epoch 20/50
35/35 [==============================] - 7s 204ms/step - loss: 0.0199 - accuracy:
0.9942 - val_loss: 1.5691 - val_accuracy: 0.7143
Epoch 21/50
35/35 [==============================] - 6s 165ms/step - loss: 0.0306 - accuracy:
0.9893 - val_loss: 1.7848 - val_accuracy: 0.6982
Epoch 22/50
35/35 [==============================] - 7s 193ms/step - loss: 0.0378 - accuracy:
0.9879 - val_loss: 1.9882 - val_accuracy: 0.6161
Epoch 23/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0437 - accuracy:
0.9884 - val_loss: 2.8374 - val_accuracy: 0.5179
Epoch 24/50
54
35/35 [==============================] - 6s 183ms/step - loss: 0.0501 - accuracy:
0.9844 - val_loss: 2.5636 - val_accuracy: 0.5500
Epoch 25/50
35/35 [==============================] - 6s 173ms/step - loss: 0.0553 - accuracy:
0.9871 - val_loss: 1.9561 - val_accuracy: 0.6339
Epoch 26/50
35/35 [==============================] - 7s 192ms/step - loss: 0.0540 - accuracy:
0.9817 - val_loss: 1.7638 - val_accuracy: 0.6768
Epoch 27/50
35/35 [==============================] - 6s 179ms/step - loss: 0.0252 - accuracy:
0.9915 - val_loss: 2.2529 - val_accuracy: 0.6179
Epoch 28/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0256 - accuracy:
0.9920 - val_loss: 2.2179 - val_accuracy: 0.6429
Epoch 29/50
35/35 [==============================] - 6s 161ms/step - loss: 0.0350 - accuracy:
0.9906 - val_loss: 1.4482 - val_accuracy: 0.7821
Epoch 30/50
35/35 [==============================] - 7s 199ms/step - loss: 0.0214 - accuracy:
0.9911 - val_loss: 2.1948 - val_accuracy: 0.6696
Epoch 31/50
35/35 [==============================] - 6s 165ms/step - loss: 0.0231 - accuracy:
0.9920 - val_loss: 1.4114 - val_accuracy: 0.7768
Epoch 32/50
35/35 [==============================] - 8s 235ms/step - loss: 0.0214 - accuracy:
0.9946 - val_loss: 1.7125 - val_accuracy: 0.7089
Epoch 33/50
35/35 [==============================] - 7s 202ms/step - loss: 0.0176 - accuracy:
0.9951 - val_loss: 2.0543 - val_accuracy: 0.7232
Epoch 34/50
35/35 [==============================] - 7s 197ms/step - loss: 0.0201 - accuracy:
0.9955 - val_loss: 2.1375 - val_accuracy: 0.6679
Epoch 35/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0062 - accuracy:
0.9978 - val_loss: 1.6907 - val_accuracy: 0.7554
Epoch 36/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0392 - accuracy:
0.9915 - val_loss: 1.5130 - val_accuracy: 0.7536
Epoch 37/50
35/35 [==============================] - 7s 194ms/step - loss: 0.0463 - accuracy:
0.9848 - val_loss: 1.2128 - val_accuracy: 0.7679
Epoch 38/50
35/35 [==============================] - 7s 210ms/step - loss: 0.0130 - accuracy:
0.9969 - val_loss: 3.6133 - val_accuracy: 0.4750
Epoch 39/50
35/35 [==============================] - 6s 169ms/step - loss: 0.0227 - accuracy:
0.9937 - val_loss: 2.5343 - val_accuracy: 0.6054
Epoch 40/50
35/35 [==============================] - 7s 205ms/step - loss: 0.0200 - accuracy:
0.9942 - val_loss: 1.8955 - val_accuracy: 0.7179
Epoch 41/50

55
35/35 [==============================] - 6s 175ms/step - loss: 0.0209 - accuracy:
0.9937 - val_loss: 2.2416 - val_accuracy: 0.6875
Epoch 42/50
35/35 [==============================] - 7s 202ms/step - loss: 0.0298 - accuracy:
0.9920 - val_loss: 2.2757 - val_accuracy: 0.6071
Epoch 43/50
35/35 [==============================] - 6s 170ms/step - loss: 0.0236 - accuracy:
0.9933 - val_loss: 2.0820 - val_accuracy: 0.6339
Epoch 44/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0222 - accuracy:
0.9920 - val_loss: 3.0279 - val_accuracy: 0.6071
Epoch 45/50
35/35 [==============================] - 6s 170ms/step - loss: 0.0051 - accuracy:
0.9982 - val_loss: 2.7979 - val_accuracy: 0.6125
Epoch 46/50
35/35 [==============================] - 7s 195ms/step - loss: 0.0040 - accuracy:
0.9982 - val_loss: 2.1375 - val_accuracy: 0.6750
Epoch 47/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0277 - accuracy:
0.9942 - val_loss: 2.2060 - val_accuracy: 0.6250
Epoch 48/50
35/35 [==============================] - 7s 212ms/step - loss: 0.0140 - accuracy:
0.9942 - val_loss: 2.1846 - val_accuracy: 0.6321
Epoch 49/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0405 - accuracy:
0.9875 - val_loss: 1.8148 - val_accuracy: 0.6143
Epoch 50/50
35/35 [==============================] – 6s 171ms/step- loss:0.0111 – accuracy:
0.9960 –val loss: 3.9371 – val-accuracy: 0.5661

#Plot the results

epochs = list(range[50])
acc=history.history[‘accuracy’]
val_acc=history.history[‘val_accuracy’]

plt.plot(epochs, acc, label=’Train Accuracy’)


plt.plot(epochs,val_acc,labels,label=’Val Accuracy’)
plt.xlabel(‘epochs’)
plt.ylabel(‘accuracy’)
plt.legend()
plt.show()

56
57
58
REFERENCES

1) Kim, S., Bang, J., & Kim, D. (2019). "Speech Emotion Recognition Using Convolutional
and Recurrent Neural Networks."

2) Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-
domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023.

3) Kartikeya Srinivas Chintalapudi; Irfan Ali Khan Patan; Harsh Vardhan Sontineni; Venkata
Saroj Kushwanth Muvala(2023). “Speech Emotion Detection using Deep Learning” 2023
International Conference on Computer Communication and Informatics (ICCCI).

4)Shahed Mohammadi; Ali Hashemi; Haniye Zandiye; Niloufar Hemati(2023). “Speech Emotion
Detection using Deep Learning Techniques and Augmented Features” at International Conference
on Electrical Engineering, Computer Science and Informatics(EECSI).

5) Tae-Wan Kim; Keun Chang Kauk(2024).’’Speech Emotion Learning Using Deep Learning
Transfer Models and Explainable Techniques” Department of Electronics Engineering,
Interdisciplinary Program in IT-Bio Convergence System, Chosun University, Gwangju 61452,
Republic of Korea.

6) Zhang, Zixing, et al. "A Survey on Deep Learning for Multimodal Data Fusion." Information
Fusion (2020).

7) Kaur, K.; Singh, P. Applications. Trends in speech emotion recognition: A comprehensive


survey. Multimed. Tools Appl. 2023, 82, 29307–29351.

8) Zou, H.; Si, Y.; Chen, C. Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention
based multi-level acoustic information. In Proceedings of the ICASSP IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022.

9) Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-
domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023.

59
10) Vandana Signh; Swati Prasad “Speech Emotion Recognition system using gender dependent
convolution neural network” Procedia Computer Science Volume 218,2023.

11) Chunsheng Xu; Yunqing Liu; Wenjun Song; Zonglin Liang; Xing Chen(2024)”A New
Network for Speech Emotion Recognition Research” at the School of Electronic Information
Engineering , Changchun University of Science and Technology, Changchun 130022,China.

12) Suryakanth V. Gangashetty; Akhilesh Kumar Dubey(2023).”Speech Emotion Recognition


using Deep Learning” International Journal for Science Technology and Engineering.

13) Gang Liu; Shifang Cai; Ce Wang(2023)”Speech Emotion Recognition based on emotion
perception”.

14) Avvari Pavithra; Sukanya Ledalla, J.Sirisha Devi; Golla Dinesh, Monika Singh; G.Vijendar
Reddy(2023)”Deep Learning-based Speech Emotion Recognition: An Investigation into a
sustainably Emotion- Speech Relationship” at 15th International Conference on Materials
Processing and Characterization(ICMPC 2023).

60

You might also like