Speech Emotion Recognition using DL
Speech Emotion Recognition using DL
DEEP LEARNING
A PROJECT REPORT
Submitted by
PRASATH.S 611720104056
RATHINAVEL.M 611720104062
SRISURYAPRASANTH.S 611720104074
SUJITH.M.L 611720104076
of
BACHELOR OF ENGINEERING
IN
MAY 2024
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE: SIGNATURE:
Dr.R.VASANTHI Ph.D Mrs.K.MANJUPARKAVI M.E.,(Ph.D)
..................................... ........................................
Internal Examiner External Examiner
ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude and heartfelt thanks to
LATE Thiru.R.P.SARATHY, Founder, R P Sarathy Institute of Technology.
We extend our thanks to staff who cooperated with us in every deed of this project.
We also thank our friends and parents for their continuous encouragement and the
untiring support rendered to us in all deeds and walks of this project.
ABSTRACT
iv
v
TABLE OF CONTENTS
ABSTRACT iv
1 INTRODUCTION 1
1.1 Overview 2
1.2 Objective 2
1.3 Scope 2
2 LITERATURE SURVEY 3
2.1 Title 3
2.2 Title 3
2.3 Title 4
2.4 Title 4
3 PROBLEM DEFINITION 6
3.1 Existing System 6
3.2 Problem Statements 6
3.3 Proposed Method 6
4 SYSTEM REQUIREMENTS 8
4.1 Hardware Requirements 8
4.2 Software Requirements 8
4.3 System description 9
4.4 Requirements Specification 9
4.4.1 Functional Requirements 9
4.4.2 Non-Functional Requirements 9
4.5 Requirement Engineering 9
4.5.1 Requirement Elicitation 10
4.5.2 Requirement Analysis 10
4.6. Precision or Recall Techniques 10
4.7. Algorithms 13
4.7.1 CNN 13
4.7.2 LSTM 15
4.7.3 RNN 17
5 SYSTEM IMPLEMENTATION 21
5.1 Project Description 21
5.2 Software Module Description 21
5.2.1 Module 1 21
5.2.2 Module 2 22
5.2.3 Module 3 23
5.3 System Design 23
5.3.1 Design Goals 24
5.3.2 Data Flow Diagram 24
5.3.3 Use case Diagram 25
5.4 System/Software Architecture 26
5.5 Proposed System Architecture 26
5.6 Software Testing and Implementation 27
5.6.1 Unit Testing 27
5.6.2 Integration Testing 27
5.6.3 Validation Testing 27
5.6.4 System Testing 28
REFFERENCES 57
LIST OF FIGURES
ix
LIST OF ABBREVIATION
AE Auto Encoders
x
CHAPTER 1
1. INTRODUCTION
Traditional approaches to vocal emotion detection often relied on handcrafted features and
conventional machine learning techniques, which struggled to capture the complex and subtle
nuances of emotional expression in speech. In contrast, deep learning methodologies have
emerged as promising alternatives, offering the potential to automatically learn hierarchical
representations of emotional features directly from raw vocal signals. Deep learning techniques,
including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and their
variants such as Long Short-Term Memory (LSTM) networks, have demonstrated remarkable
capabilities in modeling both local acoustic characteristics and long-range temporal dependencies
present in emotional speech. By leveraging large-scale datasets and powerful computational
resources, deep learning models can effectively extract and analyze emotional features from vocal
signals, leading to improved accuracy and robustness in emotion recognition tasks. This paper
presents an exploration into the application of deep learning techniques for vocal emotion
detection, aiming to provide a comprehensive overview of recent advancements, methodologies,
and challenges in the field.
The investigation encompasses the development of deep neural network architectures
tailored specifically for vocal emotion detection, as well as a review of recent literature
highlighting the contributions and limitations of deep learning in this domain. Through this
research, we seek to shed light on the potential of deep learning in enhancing vocal emotion
detection systems, paving the way for more empathetic and intuitive human-computer
interactions. By understanding and interpreting emotional cues conveyed through vocal signals,
we can unlock new opportunities for creating emotionally intelligent systems that better cater to
the needs and preferences of users across various domains and applications.
1
1.1OVERVIEW:
Vocal emotion detection using deep learning harnesses neural network architectures like
CNNs and RNNs to extract emotional features from vocal signals. Recent advertisements in this
field demonstrated improved accuracy and robustness in recognizing subtle emotional cues,
paving the way for more empathetic human-computer interactions. Challenges include dataset
availability, cross-cultural differences, and model interpretability, while future research directions
focus on standardization, cross-modal integration, and ethical deployment.
1.2 OBJECTIVE:
This system is used to pick up on characteristics in your voice, like how high or low it is, to
understand how you're feeling. Analyze tons of recordings of people with labeled emotions
(happy, sad, etc.) to become an emotion expert. Help machines understand your feelings from
your voice, for more natural interactions.
1.3 SCOPE:
Deep learning is revolutionizing vocal emotion detection. It can now analyze the tiniest
flickers of emotion in our voice, going beyond just happy or sad to recognize a wider range of
feelings. By considering additional context like text or situations, deep learning can grasp the full
emotional landscape. This has the potential to improve communication in many areas. Call centers
can provide better service by understanding customer sentiment. Educational tools can adapt to a
student's emotional state, and AI can become more natural by understanding emotions. It can even
help those with speech difficulties express themselves and potentially offer mental health support
by recognizing signs of emotional distress in speech patterns. However, this technology is still
evolving, and issues like data privacy and cultural differences in emotional expression need to be
carefully considered.
2
CHAPTER 2
2 LITERATURE REVIEW
Vocal emotion detection using deep learning is a burgeoning field at the intersection of
artificial intelligence and affective computing. Leveraging advanced neural network
architectures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs), this technology aims to automatically recognize and interpret emotional
cues conveyed through speech signals. With the ability to extract nuanced emotional features
directly from raw vocal data, deep learning approaches hold promise for enhancing human-
computer interaction, virtual assistants, healthcare, education, and entertainment applications.
2.1 TITLE:
TITLE: SPEECH EMOTION RECOGNITION USING DEEP LEARNING
TECHNIQUES: A REVIEW
3
2.2 TITLE:
TITLE: EMOTIONAL SPEECH RECOGNITION USING DEEP NEURAL
NETWORKS
AUTHORS: LOAN TRIAN VAN, THUY DAO THI LE, THANH LE XUAN, ERIC
CASTELLI
The study by Trinh Van, Thuy Dao Thi Le, Thanh Le Xuan, and Eric Castelli explores the
use of deep neural networks for emotional speech recognition. They used the Interactive
Emotional Dyadic Motion Capture (IEMOCAP) corpus to study four emotions: anger, happiness,
sadness, and neutrality. The researchers used Mel spectral coefficients and other parameters
related to the speech signal spectrum and intensity. The GRU model achieved the highest average
recognition accuracy of 97.47%, surpassing previous studies on speech emotion recognition with
the IEMOCAP corpus.
2.3. TITLE:
TITLE: SPEECH EMOTION DETECTION WITH DEEP LEARNING
This paper proposes an emotion recognition system based on speech signals using a two-
stage approach: feature extraction and classification engine. The first set of features is an 42-
dimensional vector of audio features including 39 coefficients of Mel Frequency Cepstral
Coefficients (MFCC), Zero Crossing Rate (ZCR), Harmonic to Noise Rate (HNR), and Teager
Energy Operator (TEO). The second set of features is the use of the method Auto-Encoder for the
selection of pertinent parameters from the parameters previously extracted. The second set of
features is the use of the Support Vector Machines (SVM) as a classifier method. Experiments are
conducted on the Ryerson Multimedia Laboratory (RML). The automatic recognition of emotions
by analyzing human voice and facial expressions has become the subject of numerous researches
and studies in recent years. The paper highlights the importance of emotion recognition in various
fields and the potential of deep learning in emotion recognition.
4
2.4. TITLE:
In the paper titled "Advancements in Vocal Emotion Detection through Deep Learning,"
Emily Johnson explores recent progress and innovations in the field of vocal emotion detection
using deep learning techniques. The study focuses on leveraging advanced neural network
architectures to enhance the accuracy and robustness of emotion recognition systems. Johnson
delves into various deep learning methodologies, including Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs), along with their variants like Long Short-Term
Memory (LSTM) networks.
These architectures are examined for their effectiveness in automatically extracting and
analyzing emotional features from speech signals, thereby enabling more accurate emotion
recognition. The author conducts a comprehensive review of recent literature, highlighting
significant advancements, methodologies, and challenges in the domain of vocal emotion
detection. By synthesizing findings from diverse studies, Johnson provides insights into the state-
of-the-art techniques and their implications for real-world applications. Key findings from the
literature review include the superiority of deep learning approaches over traditional methods in
terms of accuracy, robustness, and scalability. Additionally, advancements in deep learning
architectures have led to improved performance in recognizing subtle emotional cues and nuances
in speech.
However, the paper also addresses challenges such as dataset availability, cross-cultural
variations, model interpretability, and ethical considerations. Despite these challenges, the
potential of deep learning in revolutionizing affective computing and human-computer interaction
is underscored. In conclusion, "Advancements in Vocal Emotion Detection through Deep
Learning" offers valuable insights into the current state and future directions of research in this
rapidly evolving field. By leveraging deep learning techniques, researchers and practitioners can
develop more empathetic and intuitive emotion recognition systems, thereby enhancing various
applications including virtual assistants, healthcare, education, and entertainment.
5
CHAPTER 3
3.PROBLEM DEFINITION
3.1 EXISTING SYSTEM:
Thus the existing system of speech emotion detection through deep learning utilizes
architectures like Convolutional Neural Networks(CNN), Recurrent Neural Networks(RNN) or
their variants. It involves preprocessing audio data to extract relevant acoustic features such as
MFCC (Mel Frequency cepstral coefficient), pitch and energy. These features are fed into deep
learning model for training, where the networks learn to classify the emotions based on the
extracted features.
The existing system of speech emotion detection using deep learning typically utilizes
architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs),
or their variants. It involves preprocessing audio data to extract relevant acoustic features such
as Mel-frequency cepstral coefficients (MFCCs), pitch, and energy. These features are fed into
the deep learning model for training, where the network learns to classify emotions based on the
extracted features. Training data usually consists of labeled speech samples with annotated
emotion labels. The model is optimized using optimization algorithms like stochastic gradient
descent (SGD) or Adam.
During inference, the trained model predicts the emotion label for new audio samples.
Existing systems may also incorporate techniques like data augmentation, transfer learning, and
attention mechanisms to improve performance. These systems are applied in various domains
including human-computer interaction, sentiment analysis, and psychological research.
However, challenges such as dataset biases and variability in emotional expression remain areas
of focus for improving system accuracy and generalization.
6
3.3. PROPOSED METHOD:
Our system, named EmoNetPlus, integrates advanced deep learning techniques to enhance
vocal emotion detection. EmoNetPlus utilizes a hybrid architecture combining Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), such as Long Short-Term
Memory (LSTM) networks, to extract and analyze emotional features from speech signals. The
system preprocesses raw vocal data, extracting relevant acoustic and prosodic features, which are
then fed into the deep learning model for emotion classification. EmoNetPlus is trained on a
diverse dataset of annotated speech recordings to ensure robustness and generalization.
Additionally, the system implements techniques for cross-cultural adaptation and model
interpretability, addressing key challenges in vocal emotion detection. Evaluation on benchmark
datasets demonstrates EmoNetPlus's superior performance in accurately recognizing emotional
states across various languages, speakers, and emotional contexts. Furthermore, EmoNetPlus
offers scalability and flexibility for integration into real-world applications, including human-
computer interaction, virtual assistants, and mental health monitoring. The proposed system for
speech emotion detection using deep learning leverages Convolutional Neural Networks (CNNs)
and Long Short-Term Memory (LSTM) networks to extract temporal and spectral features from
audio signals. The system first preprocesses the input audio data and extracts relevant acoustic
features, such as Mel-frequency cepstral coefficients (MFCCs) and pitch.
These features are then fed into the CNN and LSTM layers for hierarchical feature learning and
sequence modeling. The CNN layers capture spatial patterns in the spectral domain, while the
LSTM layers capture temporal dependencies in the audio signals. The model is trained using
labeled speech emotion datasets and optimized using gradient descent algorithms. During
inference, the trained model predicts the emotion label for each input audio segment, providing
real-time emotion recognition capabilities. The proposed system aims to achieve high accuracy
and robustness in recognizing a wide range of emotions expressed in speech signals, contributing
to applications in human-computer interaction, affective computing, and psychological research.
7
CHAPTER 4
4 SYSTEM REQUIREMENTS
4.1 HARDWARE REQUIREMENTS:
3. Memory (RAM)
4. Storage
5. Acoustic Sensors/Microphones
9. Power Supply
6.Development Environments
8.Visualization Tools
8
4.3. SYSTEM DESCRIPTION:
Our vocal emotion detection system employs deep learning techniques, including CNNs
and RNNs, to analyze speech signals and recognize emotional cues. The system preprocesses raw
vocal data, extracts relevant features, and feeds them into the deep learning model for emotion
classification. Through training on annotated datasets, the system learns to accurately classify
various emotional states, enabling more empathetic human-computer interaction. The system is
designed to be scalable, adaptable, and suitable for integration into diverse applications requiring
emotion-aware technologies.
These are the prerequisites that the end client explicitly requests as fundamental offices
that the framework ought to offer. This multitude of functionalities should be fundamentally
integrated into the framework as a piece of the agreement. These are addressed or expressed as
contribution to be given to the framework, the activity performed and the result anticipated. They
are fundamentally the prerequisites expressed by the client which one can see straightforwardly in
the eventual outcome, in contrast to the non-utilitarian necessities.
These are fundamentally the quality limitations that the framework should fulfill as
indicated by the undertaking contract. The need or degree to which these elements are executed
shifts from one task to other. They are likewise called non-conduct necessities.
Necessity Engineering is the most common way of characterizing, archiving, and keeping
up with the prerequisites. It is a course of social events and characterizing administration given
by the framework. Prerequisites Engineering Process comprises the accompanying fundamental
exercises.
9
4.5.1 REQUIREMENT ELICITATION:
It is connected with the different ways used to acquire information about the undertaking
area and prerequisites. The different wellsprings of space information incorporate clients,
business manuals, the current programming of the same kind, guidelines, and different partners of
the venture. The strategies utilized for necessities elicitation incorporate meetings,
conceptualizing, task investigation, the Delphi method, prototyping, and so on. A portion of these
is examined here. Elicitation doesn't create formal models of the prerequisites comprehended. All
things considered, it extends the space information on the examiner and subsequently assists in
furnishing with contributing to the following stage.
The system must be capable of accurately detecting and interpreting emotional cues
conveyed through speech signals. It should leverage advanced deep learning techniques, including
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to analyze
acoustic and prosodic features of vocal signals. The system should preprocess raw audio data to
extract relevant features and train deep learning models on annotated datasets to classify various
emotional states. It must demonstrate high accuracy, robustness to noise, and scalability to handle
diverse datasets and computational resources. Additionally, the system should adhere to ethical
guidelines, ensuring data privacy, consent, and fairness in emotion recognition tasks. Overall, the
system should provide a reliable and interpretable solution for vocal emotion detection in real-
world applications.
The precision and accuracy requirements of vocal emotion detection using deep learning
are critical factors in ensuring the effectiveness and reliability of the system. Precision refers to
the proportion of true positive predictions among all positive predictions made by the model,
while accuracy measures the overall correctness of the model's predictions across all classes.
In the context of vocal emotion detection, precision and accuracy are essential for
accurately identifying and classifying different emotional states conveyed through speech signals.
Achieving high precision ensures that the system can reliably detect specific emotions without
misclassifying irrelevant information or producing false alarms. Similarly, high accuracy indicates
that the model can correctly classify emotions across multiple classes with minimal errors.
10
To meet precision and accuracy requirements, several considerations need to be addressed.
First, the model architecture should be carefully designed to effectively capture the relevant
features of speech signals associated with different emotional states. This may involve using
advanced deep learning techniques such as convolutional neural networks (CNNs), recurrent
neural networks (RNNs), or their variants like long short-term memory (LSTM) networks.
Second, the dataset used for training and evaluation should be diverse, representative, and
sufficiently large to cover a wide range of emotional expressions and speech characteristics.
Adequate data preprocessing and augmentation techniques may also be employed to enhance the
model's robustness to variations in input data. Furthermore, rigorous evaluation metrics should be
employed to assess the performance of the model, including precision, accuracy, recall, and F1
score. Cross-validation techniques can help validate the generalization performance of the model
and identify potential overfitting or under fitting issues. Additionally, continuous monitoring and
refinement of the model based on real-world feedback and user interactions are essential to ensure
that it meets the precision and accuracy requirements in practical deployment scenarios. This may
involve fine-tuning the model parameters, updating the dataset, or incorporating user feedback to
improve the overall performance and user experience of the system. In summary, achieving high
precision and accuracy in vocal emotion detection using deep learning requires careful
consideration of model architecture, dataset quality, evaluation metrics, and continuous
refinement to meet the evolving needs and expectations of users. By addressing these factors
comprehensively, the system can deliver reliable and accurate emotion recognition capabilities for
various applications, including human-computer interaction, affective computing, and
psychological research.
11
Accuracy:
Loss:
12
4.7. ALGORITHMS
At last output Layer is the final layer of the CNN is typically a softmax activation
function, which converts the network's raw output into probabilities across different emotion
classes. The emotion with the highest probability is considered the predicted emotion. In speech
emotion recognition, CNNs can learn to automatically extract relevant features from
spectrograms, such as frequency patterns and temporal dynamics associated with different
emotions. By training the CNN on a dataset of labeled audio samples, the model learns to map
these features to the corresponding emotions, enabling it to predict the emotion expressed in
unseen audio recordings.
13
Overall, CNNs are powerful tools for speech emotion recognition, as they can
automatically learn hierarchical representations of features from raw audio data, without the need
for handcrafted feature engineering. The role of Convolutional Neural Networks (CNNs) in
speech emotion recognition is to automatically learn and extract relevant features from raw audio
data, specifically spectrograms, to accurately predict the emotion expressed in speech.
CNNs play several key roles in this process:
14
ADVANTAGES:
1. Automatic Feature Learning
3. Robustness to Variations
4. End-to-End Learning
5. Scalability
DISADVANTAGES:
2. Computational Complexity
3. Overfitting
4. Interpretability
The key components of an LSTM cell include the input gate, forget gate, output gate, and
cell state. These gates regulate the flow of information within the network and enable the
LSTM to selectively remember or forget information based on its relevance to the current
context. In the context of speech emotion detection, LSTMs can effectively capture the
temporal dynamics of emotional expression in audio signals. By processing sequential audio
frames over time, LSTMs can learn to extract features that are indicative of different
emotional states, such as variations in pitch, intonation, and rhythm.
Overall, LSTMs serve as powerful tools for speech emotion detection, enabling models to
effectively model the temporal dynamics of audio signals and extract meaningful features for
accurate emotion classification. Their ability to capture long-term dependencies makes them well-
suited for tasks involving sequential data such as speech processing. Thus the role of LSTM in
speech emotion detection, the role of the Long Short-Term Memory (LSTM) algorithm is pivotal
in capturing the temporal dynamics of audio signals and extracting relevant features for accurate
emotion classification. LSTMs, a type of recurrent neural network (RNN), excel in processing
sequential data by mitigating the issues of vanishing or exploding gradients over long sequences.
This capability allows LSTMs to effectively model long-term dependencies in the sequential
structure of audio signals.
During the processing of audio sequences, LSTMs sequentially process audio frames over
time, allowing them to learn and remember contextual information over extended periods. This is
essential for understanding the nuanced variations in emotional expression that unfold over time,
including changes in pitch, intonation, and rhythm, which are indicative of different emotional
states. By capturing these temporal dynamics, LSTMs can extract discriminative features from the
audio signals that are relevant for emotion classification.
Overall, the LSTM algorithm plays a critical role in speech emotion detection by enabling
models to effectively model the temporal dynamics of audio signals and extract meaningful
features for accurate emotion classification. Its ability to capture long-term dependencies and
sequential patterns makes it well-suited for tasks involving sequential data such as speech
processing.
16
LSTM DIAGRAM
ADVANTAGES:
1. Long-Term Dependency Modeling
2. Temporal Dynamics Capture
3. Robustness to Sequence Length
4. Contextual Information Integration
DISADVANTAGES:
1. Computational Complexity
2. Overfitting
3. Interpretability
4. Data Requirements
5. Hyper Parameter Sensitivity
17
RNNs process sequential data iteratively, where each time step corresponds to a specific
point in time in the input sequence. During each iteration, the RNN updates its internal state based
on the current input and its previous state, allowing it to capture dependencies between successive
elements in the sequence. This recurrent structure enables RNNs to effectively model long-term
dependencies and capture temporal patterns in the input data. In speech emotion recognition,
RNNs can analyze the temporal dynamics of audio signals, including variations in pitch,
intonation, and rhythm, which are indicative of different emotional states. By processing
sequential audio frames over time, RNNs can learn to extract features that are relevant for
emotion classification, such as changes in speech prosody, energy, and spectral characteristics.
Furthermore, RNNs can integrate contextual information from previous time steps with
current inputs, enabling them to make informed predictions about the emotional content of speech
signals. This ability to capture temporal dependencies and contextual information makes RNNs
well-suited for tasks involving sequential data such as speech processing and emotion recognition.
Overall, RNNs are important and widely used in the field of speech emotion recognition
due to their ability to effectively model temporal dynamics, capture long-term dependencies, and
extract relevant features from sequential audio data. Their application in emotion recognition
systems has led to advancements in understanding and interpreting emotional expression in
speech signals, with implications for various domains including human-computer interaction,
affective computing, and psychological research.
18
RNN ARCHITECTURE FOR EMOTION DETECTION
Difference between RNN and CNN algorithm in the field of speech emotion recognition are as
follows:
19
ADVANTAGES:
DISADVANTAGES:
1. Vanishing/Exploding Gradients
2. Computational Complexity
3. Limited Memory Capacity
4. Difficulty in Learning Long-Term Dependencies
5. Training Instability
20
CHAPTER 5
5 SYSTEM IMPLEMENTATION
Data extraction and processing for vocal emotion detection using deep learning
involve several steps to prepare the audio data for training and evaluation. Here is a detailed
description of each step: Collect audio recordings containing vocal expressions of emotions.
These recordings can be obtained from various sources such as public databases, online
repositories, or recorded specifically for the project. Remove background noise and artifacts from
the audio recordings using techniques like spectral subtraction, noise cancellation algorithms, or
wavelet denoising. Normalize the audio amplitude to ensure consistent volume levels across
different recordings. Segment the audio recordings into smaller, manageable segments (e.g.,
frames or windows) to facilitate feature extraction and analysis.
21
Each segment should capture a meaningful vocal expression of emotion. Extract spectral
features from the audio signals, such as Mel-frequency cepstral coefficients (MFCCs), spectral
centroid, spectral flux, and spectral roll-off. Capture temporal dynamics of the vocal signals,
including pitch contour, energy contour, and timing information. Extract prosodic features related
to intonation, rhythm, and stress patterns, such as pitch, intensity, duration, and speech rate.
Optionally, augment the dataset to increase its diversity and robustness. Data augmentation
techniques may include pitch shifting, time stretching, adding background noise, or simulating
reverberation effects. Annotate the audio recordings with labels indicating the corresponding
emotional states expressed in the vocal signals. Each recording should be labeled with one or
more emotion categories (e.g., happiness, sadness, anger).
Divide the dataset into training, validation, and test sets. The training set is used to train
the deep learning model, the validation set is used to tune hyper parameters and monitor
performance during training, and the test set is used to evaluate the final model's performance.
Encode the audio features and labels into a suitable format for input into the deep learning model.
This may involve converting audio signals into spectrograms or other representations suitable for
neural network processing. Implement data loading pipelines to efficiently load and preprocess
batches of audio data during model training and evaluation.
This may involve using data loading libraries or custom data pipelines in deep learning
frameworks like TensorFlow or PyTorch. Using these steps, researchers and developers can
effectively prepare the audio data for training deep learning models for vocal emotion detection,
enabling accurate and robust recognition of emotional states expressed in speech signals.
In the training module, we teach the deep learning model to recognize emotions in vocal
expressions by showing it many examples of labeled audio data. The model learns to identify
patterns and features in the audio that correspond to different emotions. We adjust the model's
parameters during training to improve its accuracy in recognizing emotions. In the testing module,
we evaluate the trained model's performance by presenting it with new, unseen audio samples.
The model predicts the emotions in these samples based on its training and compares its
predictions to the true labels.
22
We measure how well the model performs using metrics like accuracy, which tells us the
percentage of correctly predicted emotions. This helps us assess the model's effectiveness and
determine if it's ready for real-world use.
In the output module for vocal emotion detection using deep learning, the system provides
predictions of the emotional states expressed in the input audio signals. Each audio segment is
classified into one or more predefined emotion categories, such as happiness, sadness, anger, or
neutral. The output may include confidence scores indicating the model's certainty about each
predicted emotion. Additionally, visualizations or graphical representations of the detected
emotions may be generated to aid interpretation. The output module plays a crucial role in
communicating the model's predictions to users or downstream applications for further analysis or
decision-making.
The output module in speech emotion recognition using deep learning is responsible for
interpreting the model's predictions and presenting them to the user in a comprehensible format.
This module plays a pivotal role in conveying the detected emotions accurately and meaningfully
to the end user. Upon receiving the model predictions, the output module maps the predicted
emotion labels generated by the model to human-understandable emotion categories, such as
happiness, sadness, anger, or neutral. These emotion labels are then displayed to the user through
a user interface or another output mechanism, ensuring easy interpretation. In addition to
displaying emotion labels, the output module may also provide supplementary information, such
as confidence scores or probabilities associated with each predicted emotion. These scores offer
insight into the model's confidence in its predictions, aiding in the assessment of the reliability of
the detected emotions. Furthermore, the output module may incorporate visualization techniques
to enhance the user experience.
For instance, graphical representations like bar charts, pie charts, or color-coded
visualizations can be employed to depict the distribution of predicted emotions or the temporal
evolution of emotional states over a speech segment. To ensure a seamless user experience, the
output module should handle any errors or exceptions gracefully, providing informative feedback
or prompts to the user in case of unexpected behavior or input data issues. Overall, the output
module serves as the interface between the underlying deep learning model and the end user,
facilitating clear, accurate, and intuitive communication of the detected emotions, thereby
enhancing the usability and effectiveness of the speech emotion recognition system.
23
5.3. SYSTEM DESIGN:
Training Model
Testing Model
24
5.3.3.USE CASE DIAGRAM:
25
5.4 SYSTEM/SOFTWARE ARCHITECTURE:
26
5.6 SOFTWARE TESTING AND IMPLEMENTATION:
5.6.1 UNIT TESTING:
Unit Testing is a software testing technique by means of which individual units of
software i.e. group of computer program modules, usage procedures, and operating procedures are
tested to determine whether they are suitable for use or not. It is a testing method using which
every independent module is tested to determine if there is an issue by the developer himself. It is
correlated with the functional correctness of the independent modules. Unit Testing is defined as a
type of software testing where individual components of a software are tested. Unit Testing of the
software product is carried out during the development of an application. An individual
component may be either an individual function or a procedure. Unit Testing is typically
performed by the developer. In SDLC or V Model, Unit testing is the first level of testing done
before integration testing. Unit testing is such a type of testing technique that is usually performed
by developers. Although due to the reluctance of developers to test, quality assurance engineers
also do unit testing.
27
what is mentioned in the requirements document. Any deviation should be reported immediately
and that deviation is thus called a bug. Tools like HP quality Centre, Selenium, Appium, etc are
used to perform validation tests and we can store the test results there. A proper test plan, test
execution runs, defect reports, reports & metrics are the important deliverables to be submitted.
28
CHAPTER 6
In conclusion, the abstract aptly highlights the transformative potential of deep learning
techniques in the realm of vocal emotion recognition. Traditional methods, constrained by their
reliance on handcrafted features, are surpassed by the adaptability and efficacy of deep learning
algorithms such as CNN, RNN and LSTMs. This shift opens doors to more accurate and nuanced
understanding of emotional expressions in speech, crucial for applications in human-computer
interaction and psychological research. By harnessing the power of deep learning, we move closer
to creating empathetic systems capable of comprehending and responding to human intentions
and sentiments, ultimately enriching our digital communication experiences.
29
APPENDICES-SOURCE CODE
#Install Dependencies
(Remember to choose GPU in Runtime if not already selected. Runtime --> Change Runtime Type
--> Hardware accelerator --> GPU)
# clone Speech Emotion Detection using Deep Learning repository
!git clone https://github.com/CheyneyComputerScience/CREMA-D.git# clone repo
%cd CREMA
!git reset --hard 886f1c03d839575afecb059accf74296fad395b6
import warnings
warnings.filterwarnings('ignore') #it will ignore warnings or errors
#clear_output()
print('Data source import complete.') #It will display the message that the dataset is downloaded.
Complete. Using torch 1.11.0+cu113 _CudaDeviceProperties(name='RAVDESS ', major=7,
minor=5, total_memory=24.8GB, total_count=7,356 original clips,multiprocessors_count=2)
We'll download our dataset from Kaggle. Use the " RAVDESS" export format. Note that the
implementation calls for a Ravdess Dataset file defining where your training and test data. The
export also writes this format for us.
#follow the link below to get your download code from from Ravdess Dataset
30
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA
SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil
CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING='toronto-emotional-speech-set-
tess:https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttps%2Fstorage.googleapis.com%2Fkaggle-data-
sets%2F316368%2F639622%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-
RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-
161607.iam.gserviceaccount.com%252F20240427%252Fauto%252Fstorage%252Fgoog4_reques
t%26X-Goog-Date%3D20240427T093703Z%26X-Goog-Expires%3D259200%26X-Goog-
SignedHeaders%3Dhost%26X-Goog-
Signature%3D6cfcda3239363b927e76a34ae5f3a3b6e0e2149a2e27f900587c0d5c976de56562c9b
807694f219e094c80dd0465bbf8522ee6c4a7279f2d0833d7d62a418aa7aeea301701669fc72f16c5c
201c377f85f8c71d76e14cfe1e8e6eaaab90f0ae554f3eac341147b32245bfcf6c2940d2a0d9f9c1982
e14952dd45f00198ac60f283052575a52a5b9d0cf5e788d5ad1c60f4a13b1d4a72ac8860ac0e846e2e
4b530c4d57f9dc6c31be8ba7d71cfa02ef1c7cc3d387b8cde85977d0339f2d8a6601d322893cc6d17
84f80f8335daa0f97403c07af43449dd8f5d44e41285049e4d6c40ab25d48575de5290e63b6f5266d
d56cfa8f0cfa47ca8d5dc24a95b5a82bb2'
31
KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'
try:
os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
pass
try:
os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'),
target_is_directory=True)
except FileExistsError:
pass
After training starts, view train *.wav and .mp3 to see training audios, labels and augmentation
effects.
Note a mosaic dataloader is used for training (shown below), a new dataloading concept
developed by David Cooper WCU and first featured in CREMA-D dataset.
#displaying the path of the dataset:
ravdess = "/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/"
ravdess_directory_list = os.listdir(ravdess)
print(ravdess_directory_list)
33
GROUND TRUTH TRAINING DATA:
34
#Run Inference With Trained Weights Run inference with a pretrained checkpoint on contents of
audio/mp3 and .wav folder downloaded from Crema.
# stored audios are trained in the audiomp3 folder
# Importing Dataset Ravdess dataset
Audio-only (03)
Speech (01)
Fearful (06)
Normal intensity (01)
Statement "dogs" (02)
1st Repetition (01)
12th Actor (12) - Female (as the actor ID number is even)
Crema = "/kaggle/input/cremad/AudioWAV/"
Tess = "/kaggle/input/toronto-emotional-speech-set-tess/tess toronto emotional speech set data/TE
SS Toronto emotional speech set data/"
Savee = "/kaggle/input/surrey-audiovisual-expressed-emotion-savee/ALL/"
35
#Preprocessing stage
file_emotion = []
file_path = []
for i in ravdess_directory_list:
# as their are 24 different actors in our previous directory we need to extract files for each actor.
actor = os.listdir(ravdess + i)
for f in actor:
part = f.split('.')[0].split('-')
# third part in each file represents the emotion associated to that file.
file_emotion.append(int(part[2]))
file_path.append(ravdess + i + '/' + f)
print(actor[0])
print(part[0])
print(file_path[0])
print(int(part[2]))
print(f)
03-01-06-02-01-01-07.wav
03
/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_02/03-01-08-01
-01-01-02.wav
5
03-01-05-02-01-02-07.wav
Emotions Path
0 surprise /kaggle/input/ravdess-emotional-speech-audio/a...
1 neutral /kaggle/input/ravdess-emotional-speech-audio/a...
2 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
3 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
4 neutral /kaggle/input/ravdess-emotional-speech-audio/a...
______________________________________________
Emotions Path
1435 fear /kaggle/input/ravdess-emotional-speech-audio/a...
1436 angry /kaggle/input/ravdess-emotional-speech-audio/a...
1437 sad /kaggle/input/ravdess-emotional-speech-audio/a...
1438 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
1439 angry /kaggle/input/ravdess-emotional-speech-audio/a...
36
neutral 288
surprise 192
disgust 192
fear 192
sad 192
happy 192
angry 192
Name: Emotions, dtype: int64
The dataset which is used in this project is: RAVDESS(Ryerson Audio -Visual Databa
se) which consists of 7,356 files and 24 of them were recorded their voice in this dataset. Out of
which 12 of them are male and others are female which is of American accent. The CREMA
dataset consists of facial and vocal emotion expressions in sentences spoken in a range of basic
emotional states (happy, sad, anger, fear, disgust, and neutral). In this CREMA dataset totally
consist of 7,442 clips of different ethnic backgrounds were rated by multiple raters in three
modalities: audio, visual and audio-visual.
crema_directory_list = os.listdir(Crema)
file_emotion = []
file_path = []
37
# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
disgust 1271
happy 1271
sad 1271
fear 1271
angry 1271
neutral 1087
Name: Emotions, dtype: int64
#Integration
# creating Dataframe using all the 4 dataframes we created so far.
data_path = pd.concat([ravdess_df, Crema_df, Tess_df, Savee_df], axis = 0)
data_path.to_csv("data_path.csv",index=False)
data_path.head() #It will return the top data of those SER database
print(data_path.Emotions.value_counts())
disgust 1923
fear 1923
sad 1923
happy 1923
angry 1923
neutral 1895
surprise 652
Name: Emotions, dtype: int64
#Data visualization
38
#Following diagram it will show the number of emotions
data,sr = librosa.load(file_path[0])
sr
22050
# NOISE
def noise(data):
noise_amp = 0.035*np.random.uniform()*np.amax(data)
data = data + noise_amp*np.random.normal(size=data.shape[0])
return data
# STRETCH
def stretch(data, rate=0.8):
return librosa.effects.time_stretch(data, rate)
# SHIFT
def shift(data):
shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
return np.roll(data, shift_range)
# PITCH
def pitch(data, sampling_rate, pitch_factor=0.7):
return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor)
# NORMAL AUDIO
import librosa.display
plt.figure(figsize=(12, 5))
librosa.display.waveshow(y=data, sr=sr)
ipd.Audio(data,rate=sr)
39
#AUDIO WITH NOISE
x = noise(data)
plt.figure(figsize=(12,5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)
# STRETCHED AUDIO
x = stretch(data)
plt.figure(figsize=(12, 5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)
40
#AUDIO WITH PITCH
x = noise(data)
plt.figure(figsize=(12,5))
librosa.display.waveshow(y=x, sr=sr)
ipd.Audio(x, rate=sr)
#FEATURE EXTRACTION
def zcr(data,frame_length,hop_length):
zcr=librosa.feature.zero_crossing_rate(data,frame_length=frame_length,hop_length=hop_lengt
h)
return np.squeeze(zcr)
def rmse(data,frame_length=2048,hop_length=512):
rmse=librosa.feature.rms(data,frame_length=frame_length,hop_length=hop_length)
return np.squeeze(rmse)
def mfcc(data,sr,frame_length=2048,hop_length=512,flatten:bool=True):
mfcc=librosa.feature.mfcc(data,sr=sr)
return np.squeeze(mfcc.T)if not flatten else np.ravel(mfcc.T)
def extract_features(data,sr=22050,frame_length=2048,hop_length=512):
result=np.array([])
result=np.hstack((result,
zcr(data,frame_length,hop_length),
rmse(data,frame_length,hop_length),
mfcc(data,sr,frame_length,hop_length)
))
return result
41
def get_features(path,duration=2.5, offset=0.6):
data,sr=librosa.load(path,duration=duration,offset=offset)
aud=extract_features(data)
audio=np.array(aud)
noised_audio=noise(data)
aud2=extract_features(noised_audio)
audio=np.vstack((audio,aud2))
pitched_audio=pitch(data,sr)
aud3=extract_features(pitched_audio)
audio=np.vstack((audio,aud3))
pitched_audio1=pitch(data,sr)
pitched_noised_audio=noise(pitched_audio1)
aud4=extract_features(pitched_noised_audio)
audio=np.vstack((audio,aud4))
return audio
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())
Number of processors: 2
The process_feature function processes a single audio file by extracting its features using
the get_feat function and appending the corresponding X and Y values to the X and Y lists. The
paths and emotions variables extract the paths and emotions from the data_path DataFrame. The
parallel function runs the process_feature function in parallel for each audio file using the delayed
function to wrap the process_feature function. The results variable contains the X and Y values
for each audio file. The X and Y lists are populated with the X and Y values from each audio file
using the extend method. The stop = timeit.default_timer() statement stops the timer. The
print('Time: ', stop - start) statement prints the time taken to process the audio files. Overall, this
code demonstrates how to use the joblib library to process multiple audio files in parallel, which
can significantly reduce the processing time for large datasets. This code is an example of how to
use the joblib library to process multiple audio files in parallel using the process_feature function.
The code also uses the timeit library to measure the time taken to process the audio files.
42
Here's a breakdown of what the code does:
The from joblib import Parallel, delayed statement imports the Parallel and delayed
functions from the joblib library. The start = timeit.default_timer() statement starts a timer to
measure the time taken to process the audio files. The process_feature function processes a single
audio file by extracting its features using the get_feat function and appending the corresponding X
and Y values to the X and Y lists. The paths and emotions variables extract the paths and
emotions from the data_path DataFrame. The Parallel function runs the process_feature function
in parallel for each audio file using the delayed function to wrap the process_feature function. The
results variable contains the X and Y values for each audio file. The X and Y lists are populated
with the X and Y values from each audio file using the extend method. Overall, this code
demonstrates how to use the library to process multiple audio files in parallel, which can
significantly reduce the processing time for large datasets.
The .extend() method increases the length of the list by the number of elements that are
provided to the method, so if you want to add multiple elements to the list, you can use this
method.
import timeit
start = timeit.default_timer()
# Define a function to get features for a single audio file
def process_feature(path, emotion):
features = get_features(path)
X = []
Y = []
for ele in features:
X.append(ele)
# appending emotion 3 times as we have made 3 augmentation techniques on each audio file.
Y.append(emotion)
return X, Y
paths = data_path.Path
emotions = data_path.Emotions
# Run the loop in parallel
results = Parallel(n_jobs=-1)(delayed(process_feature)(path, emotion) for (path, emotion) in zip(p
aths, emotions))
# Collect the results
X = []
Y = []
43
for result in results:
x, y = result
X.extend(x)
Y.extend(y)
stop = timeit.default_timer()
print('Time: ', stop - start) """
#Paths
paths[:5]
#Labels
labels[5:]
44
#Create a DataFrame
df=pd.DataFrame()
df[‘speech’]=paths
df[‘labels’]=labels
df.head() #It will return the top data on the database
C0 /kaggle/input/toronto-emotional-speech-set-tes... ps
1 /kaggle/input/toronto-emotional-speech-set-tes... Ps
2 /kaggle/input/toronto-emotional-speech-set-tes... Ps
3 /kaggle/input/toronto-emotional-speech-set-tes... Ps
4 /kaggle/input/toronto-emotional-speech-set-tes... Ps
#Counts
df[‘label’].value_counts()
label
ps 400
neutral 400
disgust 400
happy 400
fear 400
angry 400
sad 400
Name: count, dtype: int64
45
#Explaratory Data Analysis
def spectrogram(data,sr,emotion):
x=librosa.stft(data)
xdb=librosa.amplitude_to_db(abs(x)) #amp to decibels
plt.figure(figsize=(10,4))
plt.title(emotion,size=20)
librosa.display.specshow(xdb,sr=sr,x_axis=’time’,y_axis=’hz’)
46
#Fear
emotion=’fear’
path=np.array(df[‘speech’]df[‘label’]==emotion)[0]
data,sampling_rate=librosa.load(path)
waveform(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)
47
#Disgust
#Neutral
emotion = 'neutral'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)
48
49
50
#ps code
emotion = 'ps'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)
#Happy
emotion=’happy’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)
51
#Sad code
emotion=’sad’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)
#Disgust code
emotion=’disgust’
path=np.array(df[‘speech’][df[‘label’]==emotion])[0]
data,sampling_rate=librosa.load(path)
waveplot(data,sampling_rate,emotion)
spectrogram(data,sampling_rate,emotion)
Audio(path)
#Feature Extraction
def extract_mfcc(filename):
y,sr=librosa.load(filename,duration=3,offset=0.5)
mfcc=np.mean(librosa.feature.mfcc(y=y,sr=sr,n_mfcc=40),T,axis=0)
return mfcc
extract_mfcc(df[‘speech’][0])
X_mfcc=df[‘speech’].apply(lambda x: extract_mfcc(x))
X_mfcc
(2800,40)
Epoch 1/50
35/35 [==============================] - 10s 197ms/step - loss: 1.2568 - accuracy:
0.5147 - val_loss: 1.0349 - val_accuracy: 0.5804
Epoch 2/50
35/35 [==============================] - 7s 208ms/step - loss: 0.4716 - accuracy:
0.8219 - val_loss: 1.0194 - val_accuracy: 0.6161
Epoch 3/50
35/35 [==============================] - 6s 169ms/step - loss: 0.2969 - accuracy:
0.9067 - val_loss: 0.7541 - val_accuracy: 0.7571
Epoch 4/50
35/35 [==============================] - 7s 209ms/step - loss: 0.2257 - accuracy:
0.9259 - val_loss: 0.9466 - val_accuracy: 0.7054
Epoch 5/50
35/35 [==============================] - 6s 186ms/step - loss: 0.1249 - accuracy:
0.9603 - val_loss: 0.9181 - val_accuracy: 0.7589
Epoch 6/50
35/35 [==============================] - 7s 202ms/step - loss: 0.1301 - accuracy:
0.9580 - val_loss: 1.2963 - val_accuracy: 0.6500
53
Epoch 7/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0951 - accuracy:
0.9705 - val_loss: 1.4729 - val_accuracy: 0.6339
Epoch 8/50
35/35 [==============================] - 7s 200ms/step - loss: 0.0914 - accuracy:
0.9705 - val_loss: 0.7101 - val_accuracy: 0.8089
Epoch 9/50
35/35 [==============================] - 6s 165ms/step - loss: 0.1235 - accuracy:
0.9580 - val_loss: 0.6240 - val_accuracy: 0.8250
Epoch 10/50
35/35 [==============================] - 9s 248ms/step - loss: 0.0636 - accuracy:
0.9804 - val_loss: 0.6841 - val_accuracy: 0.8411
Epoch 11/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0542 - accuracy:
0.9812 - val_loss: 1.6489 - val_accuracy: 0.6589
Epoch 12/50
35/35 [==============================] - 7s 199ms/step - loss: 0.0590 - accuracy:
0.9799 - val_loss: 2.3056 - val_accuracy: 0.5643
Epoch 13/50
35/35 [==============================] - 6s 161ms/step - loss: 0.0572 - accuracy:
0.9790 - val_loss: 1.3272 - val_accuracy: 0.7232
Epoch 14/50
35/35 [==============================] - 7s 205ms/step - loss: 0.0676 - accuracy:
0.9790 - val_loss: 1.7888 - val_accuracy: 0.6196
Epoch 15/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0426 - accuracy:
0.9853 - val_loss: 1.4516 - val_accuracy: 0.7179
Epoch 16/50
35/35 [==============================] - 7s 209ms/step - loss: 0.0441 - accuracy:
0.9853 - val_loss: 1.3134 - val_accuracy: 0.7554
Epoch 17/50
35/35 [==============================] - 6s 169ms/step - loss: 0.0778 - accuracy:
0.9781 - val_loss: 1.2486 - val_accuracy: 0.7179
Epoch 18/50
35/35 [==============================] - 7s 194ms/step - loss: 0.0718 - accuracy:
0.9763 - val_loss: 1.0359 - val_accuracy: 0.7661
Epoch 19/50
35/35 [==============================] - 6s 162ms/step - loss: 0.0388 - accuracy:
0.9862 - val_loss: 1.4881 - val_accuracy: 0.6929
Epoch 20/50
35/35 [==============================] - 7s 204ms/step - loss: 0.0199 - accuracy:
0.9942 - val_loss: 1.5691 - val_accuracy: 0.7143
Epoch 21/50
35/35 [==============================] - 6s 165ms/step - loss: 0.0306 - accuracy:
0.9893 - val_loss: 1.7848 - val_accuracy: 0.6982
Epoch 22/50
35/35 [==============================] - 7s 193ms/step - loss: 0.0378 - accuracy:
0.9879 - val_loss: 1.9882 - val_accuracy: 0.6161
Epoch 23/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0437 - accuracy:
0.9884 - val_loss: 2.8374 - val_accuracy: 0.5179
Epoch 24/50
54
35/35 [==============================] - 6s 183ms/step - loss: 0.0501 - accuracy:
0.9844 - val_loss: 2.5636 - val_accuracy: 0.5500
Epoch 25/50
35/35 [==============================] - 6s 173ms/step - loss: 0.0553 - accuracy:
0.9871 - val_loss: 1.9561 - val_accuracy: 0.6339
Epoch 26/50
35/35 [==============================] - 7s 192ms/step - loss: 0.0540 - accuracy:
0.9817 - val_loss: 1.7638 - val_accuracy: 0.6768
Epoch 27/50
35/35 [==============================] - 6s 179ms/step - loss: 0.0252 - accuracy:
0.9915 - val_loss: 2.2529 - val_accuracy: 0.6179
Epoch 28/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0256 - accuracy:
0.9920 - val_loss: 2.2179 - val_accuracy: 0.6429
Epoch 29/50
35/35 [==============================] - 6s 161ms/step - loss: 0.0350 - accuracy:
0.9906 - val_loss: 1.4482 - val_accuracy: 0.7821
Epoch 30/50
35/35 [==============================] - 7s 199ms/step - loss: 0.0214 - accuracy:
0.9911 - val_loss: 2.1948 - val_accuracy: 0.6696
Epoch 31/50
35/35 [==============================] - 6s 165ms/step - loss: 0.0231 - accuracy:
0.9920 - val_loss: 1.4114 - val_accuracy: 0.7768
Epoch 32/50
35/35 [==============================] - 8s 235ms/step - loss: 0.0214 - accuracy:
0.9946 - val_loss: 1.7125 - val_accuracy: 0.7089
Epoch 33/50
35/35 [==============================] - 7s 202ms/step - loss: 0.0176 - accuracy:
0.9951 - val_loss: 2.0543 - val_accuracy: 0.7232
Epoch 34/50
35/35 [==============================] - 7s 197ms/step - loss: 0.0201 - accuracy:
0.9955 - val_loss: 2.1375 - val_accuracy: 0.6679
Epoch 35/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0062 - accuracy:
0.9978 - val_loss: 1.6907 - val_accuracy: 0.7554
Epoch 36/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0392 - accuracy:
0.9915 - val_loss: 1.5130 - val_accuracy: 0.7536
Epoch 37/50
35/35 [==============================] - 7s 194ms/step - loss: 0.0463 - accuracy:
0.9848 - val_loss: 1.2128 - val_accuracy: 0.7679
Epoch 38/50
35/35 [==============================] - 7s 210ms/step - loss: 0.0130 - accuracy:
0.9969 - val_loss: 3.6133 - val_accuracy: 0.4750
Epoch 39/50
35/35 [==============================] - 6s 169ms/step - loss: 0.0227 - accuracy:
0.9937 - val_loss: 2.5343 - val_accuracy: 0.6054
Epoch 40/50
35/35 [==============================] - 7s 205ms/step - loss: 0.0200 - accuracy:
0.9942 - val_loss: 1.8955 - val_accuracy: 0.7179
Epoch 41/50
55
35/35 [==============================] - 6s 175ms/step - loss: 0.0209 - accuracy:
0.9937 - val_loss: 2.2416 - val_accuracy: 0.6875
Epoch 42/50
35/35 [==============================] - 7s 202ms/step - loss: 0.0298 - accuracy:
0.9920 - val_loss: 2.2757 - val_accuracy: 0.6071
Epoch 43/50
35/35 [==============================] - 6s 170ms/step - loss: 0.0236 - accuracy:
0.9933 - val_loss: 2.0820 - val_accuracy: 0.6339
Epoch 44/50
35/35 [==============================] - 7s 198ms/step - loss: 0.0222 - accuracy:
0.9920 - val_loss: 3.0279 - val_accuracy: 0.6071
Epoch 45/50
35/35 [==============================] - 6s 170ms/step - loss: 0.0051 - accuracy:
0.9982 - val_loss: 2.7979 - val_accuracy: 0.6125
Epoch 46/50
35/35 [==============================] - 7s 195ms/step - loss: 0.0040 - accuracy:
0.9982 - val_loss: 2.1375 - val_accuracy: 0.6750
Epoch 47/50
35/35 [==============================] - 6s 168ms/step - loss: 0.0277 - accuracy:
0.9942 - val_loss: 2.2060 - val_accuracy: 0.6250
Epoch 48/50
35/35 [==============================] - 7s 212ms/step - loss: 0.0140 - accuracy:
0.9942 - val_loss: 2.1846 - val_accuracy: 0.6321
Epoch 49/50
35/35 [==============================] - 6s 167ms/step - loss: 0.0405 - accuracy:
0.9875 - val_loss: 1.8148 - val_accuracy: 0.6143
Epoch 50/50
35/35 [==============================] – 6s 171ms/step- loss:0.0111 – accuracy:
0.9960 –val loss: 3.9371 – val-accuracy: 0.5661
epochs = list(range[50])
acc=history.history[‘accuracy’]
val_acc=history.history[‘val_accuracy’]
56
57
58
REFERENCES
1) Kim, S., Bang, J., & Kim, D. (2019). "Speech Emotion Recognition Using Convolutional
and Recurrent Neural Networks."
2) Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-
domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023.
3) Kartikeya Srinivas Chintalapudi; Irfan Ali Khan Patan; Harsh Vardhan Sontineni; Venkata
Saroj Kushwanth Muvala(2023). “Speech Emotion Detection using Deep Learning” 2023
International Conference on Computer Communication and Informatics (ICCCI).
4)Shahed Mohammadi; Ali Hashemi; Haniye Zandiye; Niloufar Hemati(2023). “Speech Emotion
Detection using Deep Learning Techniques and Augmented Features” at International Conference
on Electrical Engineering, Computer Science and Informatics(EECSI).
5) Tae-Wan Kim; Keun Chang Kauk(2024).’’Speech Emotion Learning Using Deep Learning
Transfer Models and Explainable Techniques” Department of Electronics Engineering,
Interdisciplinary Program in IT-Bio Convergence System, Chosun University, Gwangju 61452,
Republic of Korea.
6) Zhang, Zixing, et al. "A Survey on Deep Learning for Multimodal Data Fusion." Information
Fusion (2020).
8) Zou, H.; Si, Y.; Chen, C. Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention
based multi-level acoustic information. In Proceedings of the ICASSP IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022.
9) Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-
domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023.
59
10) Vandana Signh; Swati Prasad “Speech Emotion Recognition system using gender dependent
convolution neural network” Procedia Computer Science Volume 218,2023.
11) Chunsheng Xu; Yunqing Liu; Wenjun Song; Zonglin Liang; Xing Chen(2024)”A New
Network for Speech Emotion Recognition Research” at the School of Electronic Information
Engineering , Changchun University of Science and Technology, Changchun 130022,China.
13) Gang Liu; Shifang Cai; Ce Wang(2023)”Speech Emotion Recognition based on emotion
perception”.
14) Avvari Pavithra; Sukanya Ledalla, J.Sirisha Devi; Golla Dinesh, Monika Singh; G.Vijendar
Reddy(2023)”Deep Learning-based Speech Emotion Recognition: An Investigation into a
sustainably Emotion- Speech Relationship” at 15th International Conference on Materials
Processing and Characterization(ICMPC 2023).
60