SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
Ramapuram, Chennai – 600 089
SCHOOL OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
18CSP109L - PROJECT
18CSP111L - PROJECT for INTERNSHIP STUDENTS
BATCH NUMBER : 10
Speech emotion Detection Using Machine
Learning
Team Members Supervisor
RA2111003020006 MOHNISH SINGH NAME: Dr.Geetha TV
RA2111003020018 SOURAV KARMAKAR ASSISTANT PROFESSOR
RA2111003020029 PRANAMYA PATRIKAR
Department of Computer Science and
IV YEAR, CSE Engineering
SRMIST, RAMAPURAM CAMPUS SRMIST, RAMAPURAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Agenda
• Abstract
• Scope and Motivation
• Introduction
• Literature Survey ( Table format-year should be in Chronological order)
• Objectives
• Problem Statement
• Proposed Work
• Architecture Diagram/Flow Diagram/Block Diagram
• Novel idea
• Modules
• Module Description
• Software & Hardware Requirements
• References (Base paper hard copy to be submitted to the supervisor.)
• Way forward towards Outcome (Research Paper/Patent)
ABSTRACT
Emotion recognition from speech signals is an important but challenging part of
Human-Computer Interaction (HCI). Many techniques have been explored to analyze
speech and classify emotions accurately. In recent years, machine learning approaches
have gained significant attention for this task due to their ability to identify patterns in
speech features. This paper provides an overview of various machine learning methods
used for speech emotion recognition, highlighting commonly used datasets, types of
emotions detected, key contributions in the field, and existing challenges.
In our work, we will implement traditional machine learning techniques such as K-
Nearest Neighbors (KNN) and Support Vector Machine (SVM), which are well-known
for their effectiveness in classification tasks. Additionally, we will explore deep
learning-based methods to enhance recognition accuracy by capturing complex speech
features. By comparing these approaches, we aim to analyze their performance and
determine the most efficient method for speech emotion recognition.
:
Scope and Motivation
• Our project aims to develop a Speech Emotion Recognition (SER) system that can
accurately classify emotions like anger, happiness, sadness, fear, surprise, and
disgust from speech using the RAVDESS dataset. we aim to achieve at least 90%
accuracy.
• the project sets the foundation for integrating speech-based emotion recognition
with other modalities, such as facial expressions, gestures, and physiological
signals, to enhance accuracy.
• Understanding emotions from speech can significantly enhance AI-human
interactions by making systems more empathetic and responsive. This technology
has applications in mental health diagnostics, stress detection, emotion-based
therapy, and user experience enhancement. By developing a high-accuracy SER
model, we can contribute to more natural human-computer communication,
improving AI assistants and supporting emotional well-being in various real-world
applications.
Introduction
The primary objective of Speech Emotion Recognition (SER) is to enhance human-to-
machine interaction by enabling systems to understand and respond to human emotions
effectively. While significant progress has been made in the field of Speech
Recognition (SR) over the years, there is still a need for improved emotion detection
capabilities to make interactions more natural and intuitive. Developing a reliable SER
system can lead to advancements in virtual assistants, customer support, healthcare,
and AI-driven applications, making them more empathetic and responsive. This project
focuses on classifying emotions from speech using the RAVDESS dataset, leveraging
feature extraction techniques like MFCC, Chroma, and Mel Spectrogram, and training
models such as CNN and LSTM to achieve at least 90% accuracy. With potential
applications in mental health monitoring, emotion-based AI systems, and real-time
human-computer interaction, this research aims to bridge the gap between machines
and human emotions, improving the overall user experience.
Literature Survey
S.No. Title of the Year Journal/ Inferences
Paper Conference
Name
1 IEEE 2009 Zeng, Z., Pantic, IEEE Access A comprehensive
Transactions on M., Roisman, G. survey on
Pattern Analysis I., & Huang, T. S methods for
and Machine emotion
recognition using
Intelligence audio, video, and
multimodal
approaches.
2 Speech emotion 2011 El Ayadi, M., Proceedings of A foundational
recognition using Kamel, M. S., & the IEEE paper that
hidden Markov Karray, F. International investigates the
models. Conference on use of hidden
System Sciences Markov models
(HICSS) (HMMs) for
speech emotion
recognition.
Literature Survey
S.no Title of the paper Year Author Journal/ Inference
Conference
3 Speech emotion 2020 Hossain, S., & 2015 IEEE This paper
recognition methods, Jassim, H Symposium on reviews different
datasets, and applications Computational methods for
Intelligence and speech emotion
recognition,
Data Mining applications in
(CIDM) real-world
systems.
4 The first detection of 2022 Batliner, A., et Proceedings of Discusses
emotion in speech using al the 15th emotion
a large database International detection in
Conference on speech with a
Artificial
Intelligence (ICAI) large, labeled
speech database
for training
emotion
classifiers.
Literature Survey
S.no Title of the paper Year Author Journal/ Inference
Conference
5 Speech Emotion 2023 Md. Imran Multimedia Tools This study
Recognition and Hossain, Md. and Applications addresses the
Classification Using Mojahidul challenges in
Hybrid Deep CNN and Islam, Tania accurately
LSTM Models Nahrin, Md. detecting
Rashed, Md. emotions .
Atiqur
6 Speech Emotion 2023 Yan Li, Yapeng EURASIP Journal Discusses
Recognition Based on Wang, Xu Yang, on Audio, Speech, emotion
Graph-LSTM Neural and Sio-Kei Im and Music detection in
Network Processing speech with a
large, labeled
speech database
for training
emotion
classifiers.
Literature Survey
S.no Title of the paper Year Author Journal/ Inference
Conference
7 Evaluating Raw 2023 Md. Imran arXiv preprint study
Waveforms with Deep Hossain,Md. investigates the
Learning Frameworks for Atiqur feasibility of
Speech Emotion feeding raw
Recognition audio
waveforms.
8 Capturing Spectral and 2023 Md. Maksudul arXiv preprint The authors
Long-Term Contextual Haque, Samiul propose an
Information for Speech Islam, and Abu ensemble model
Emotion Recognition Jobayer Md. combining (GCN)
Using Deep Learning Sadat for textual data
Techniques and the HuBERT
transformer for
audio signals to
address
limitations in
traditional SER
approaches.
Objectives
The main objective of this project is to develop a Speech Emotion Recognition (SER)
system that can accurately classify all the emotions present in the RAVDESS dataset
using machine learning and deep learning models. Our goal is to achieve at least 90%
overall accuracy or higher, ensuring high precision and reliability in emotion
classification. We will leverage advanced feature extraction techniques such as MFCC,
Chroma Features, and Mel Spectrogram to capture essential speech characteristics. By
using traditional ML models along with deep learning architectures like CNN and
LSTM, we aim to build a highly effective system that can accurately recognize
emotions, improving AI applications in customer service, mental health monitoring,
virtual assistants, and human-computer interaction. The final model will be evaluated
rigorously using accuracy, precision, recall, and F1-score to ensure its effectiveness in
real-world applications. Additionally, this project will explore potential deployment
options, making it accessible for various industries that rely on emotional analysis from
speech.
Problem Statement
• Speech Emotion Recognition (SER) systems often struggle to work effectively
in real-world situations. Factors like background noise, different recording
conditions, and variations in speaker accents and speaking styles make it
difficult for these systems to generalize well. Many existing models perform
well in controlled environments but fail when applied to real-life scenarios,
such as customer service calls, in-car voice assistants, or mental health
monitoring apps. This lack of reliability limits their practical use.
• To address this, our project aims to develop a more robust and accurate SER
model that can handle noisy environments and speaker variations while
maintaining high accuracy. By leveraging deep learning techniques and
extracting meaningful speech features, we aim to build a system that can
accurately detect emotions even in challenging acoustic conditions.
Proposed Work-Block Diagram
Proposed-Novel Idea
Our Novel Idea is to improve the speech emotion recognition by
• Combining Multiple Feature Extraction Techniques – Instead of relying solely on
MFCC, we will integrate Chroma Features and Mel Spectrogram to capture a more
comprehensive representation of speech emotions.
• Hybrid Deep Learning Model – We propose a CNN + LSTM architecture to
leverage CNN’s ability to extract spatial features from spectrograms and LSTM’s
strength in capturing temporal dependencies in speech signals. This hybrid model
aims to achieve higher accuracy (>90%) compared to traditional machine learning
models.
• Robust Noise-Resistant Training – We will introduce data augmentation techniques
(such as adding background noise, pitch shifting, and speed variations) to make the
model more resilient to real-world environments.
Proposed-Modules
The key modules are
1. Data Collection & Preprocessing Module
2. Feature Extraction Module
3. Model Training & Classification Module
4. Emotion Detection & Prediction Module
5. Deployment & Application Module
MODULES DESCRIPTION
• Data Collection & Preprocessing Module
The dataset is loaded and split into 90% training and 10% testing sets. Audio files are processed, and extracted
features are converted into NumPy arrays. StandardScaler is applied to normalize the feature values, ensuring
consistency in the model's learning process.
• Feature Extraction Module
Audio signals are transformed into meaningful features such as MFCC (Mel-Frequency Cepstral Coefficients), Spectrograms, and Chroma features. These
features capture frequency, pitch, and intensity variations, helping distinguish different emotions. The extracted features are structured for deep learning
model inputs.
• Model Training & Classification Module
A CNN-based deep learning model is built to classify emotions using extracted features. It consists of convolutional layers, pooling layers, batch
normalization, fully connected layers, and a softmax activation function. The model is trained for 50 epochs using the Adam optimizer and evaluated using
accuracy, confusion matrix, and classification reports.
Emotion Detection & Prediction Module
Emotion Detection in speech involves analyzing audio signals to classify emotions such as happiness, sadness, anger, or neutrality. This is achieved by
passing extracted features (MFCCs, spectrograms, or chroma features) through a deep learning model, typically a Convolutional Neural Network (CNN). The
model learns patterns in speech features and maps them to corresponding emotional states. After training, the model predicts emotions from new speech
inputs with a certain accuracy
Deployment & Application Module
The trained model is used to make predictions on the test dataset, determining the emotions in speech samples based on learned patterns. To evaluate its
performance, various metrics such as accuracy, classification report, and confusion matrix are calculated, providing insights into how well the model
distinguishes between different emotions.
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING - INTERNET OF THINGS
Software & Hardware Requirements
Software Requirements
• Programming Language: Python (with libraries like TensorFlow, Keras)
Audio Processing Libraries: librosa
• Other libraries: Scikit-learn , Matplotlib, Seaborn, Pandas, NumPy
• DATASET:RAVDESS
Hardware Requirements
• Processor: Intel i5 11th
• RAM: 16GB
Module 1-Outcome
• Data Collection & pre-processing is responsible for gathering audio data, cleaning
it, and preparing it for feature extraction.
• The database have successfully loaded the Audio Files from RAVDESS.
• Converted all the files to standard format of 16KHz
• Stored dataset in a structured format with labels.
• Background noise removed and Volume levels adjusted to maintain consistency.
• Removed the silenced segment form the audio and have done Normalization.
• Represented the pre-processing Part with waveform.
References
• El Ayadi, M., Kamel, M. S., & Karray, F. (2011). "Speech emotion recognition
using hidden Markov models." Speech Communication, 53(5), 720-737.
• Schuller, B., et al. (2011). "Speech emotion recognition: Two-level classification
approach." Speech Communication, 53(9), 1062-1070.
• Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). "A survey of affect
recognition methods: Audio, visual, and spontaneous expressions." IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39-58.
• Batliner, A., et al. (2003). "The first detection of emotion in speech using a large
database." Proceedings of the 15th International Conference on Artificial
Intelligence (ICAI), 402-407.
• Hughes, D. L., & Gish, H. (1992). "Speech recognition using speech emotion
features." Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 537-540.
• Haque, M. M., Islam, S., & Sadat, A. J. M. (2023). Capturing Spectral and Long-
Term Contextual Information for Speech Emotion Recognition Using Deep
Learning Techniques. arXiv preprint.
References
• Li, Y., Wang, Y., Yang, X., & Im, S.-K. (2023). Speech Emotion Recognition Based
on Graph-LSTM Neural Network. EURASIP Journal on Audio, Speech, and Music
Processing, 2023(1), Article 18.
• Hossain, M. I., Islam, M. M., Nahrin, T., Rashed, M., & Rahman, M. A. (2024).
Speech Emotion Recognition and Classification Using Hybrid Deep CNN and
LSTM Models. International Journal of Research Publication and Reviews, 5(2),
105-113.
• Graph Neural Network-Based Speech Emotion Recognition: A Fusion of Skip
Graph Convolutional and Graph Attention Networks. Electronics, 13(3), 456-468.