0% found this document useful (0 votes)
2 views80 pages

Project Report Template Phase II (7)

The final project report focuses on 'Human Emotion Detection in Mental Health Monitoring' and aims to enhance early detection of mental health disorders using advanced techniques. The proposed system utilizes deep learning models to analyze facial expressions and audio patterns, achieving a 90% accuracy rate for facial image recognition and 77% when combining both data types. The project emphasizes real-time processing and personalized adaptation to improve detection accuracy over time.

Uploaded by

rautmanasi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views80 pages

Project Report Template Phase II (7)

The final project report focuses on 'Human Emotion Detection in Mental Health Monitoring' and aims to enhance early detection of mental health disorders using advanced techniques. The proposed system utilizes deep learning models to analyze facial expressions and audio patterns, achieving a 90% accuracy rate for facial image recognition and 77% when combining both data types. The project emphasizes real-time processing and personalized adaptation to improve detection accuracy over time.

Uploaded by

rautmanasi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

A

FINAL PROJECT REPORT

ON

Human Emotion Detection in Mental Health Monitoring

SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE


IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE OF

BACHELOR OF ENGINEERING
INFORMATION TECHNOLOGY

BY

Dhanashri Gangurde B400050717


Radhika Bakale B400050664
Parth Kokate B400050755

Under the guidance of


Mrs. Archana Kadam

Department Of Information Technology


Pune Institute of Computer Technology
Pune - 411 043.
2024-2025
SCTR’s PUNE INSTITUTE OF COMPUTER TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that the final project report entitled


Human Emotion Detection in Mental Health Monitoring
submitted by

Dhanashri Gangurde B400050717


Radhika Bakale B400050664
Parth Kokate B400050755

is a bonafide work carried out by them under the supervision of Mrs. Archana
Kadam and it is approved for the partial fulfillment of the requirement of
Savitribai Phule Pune University for the award of the Degree of Bachelor of
Engineering (Information Technology).

This project report has not been earlier submitted to any other Institute or
University for the award of any degree or diploma.

Mrs. Archana Kadam Dr. A. S. Ghotkar


Project Guide HOD IT

Dr. S. T. Gandhe
SPPU External Guide Principal

Date:
Place:

i
Acknowledgement

Firstly, We are very much thankful to Mrs. Archana Kadam for guiding us throughout
the semester. The guidance and support provided by our guide Mrs. Archana Kadam
has inspired us to do the BE Project with thoughtful mind and helped us at every
phase of the project. We would also like to extend our gratitude to our reviewers, Dr.
Shyam Deshmukh and Mrs. Swapnaja R. Hiray, for their constructive feedback and
invaluable suggestions, which greatly improved the quality of our work. Our heartfelt
thanks go to our project coordinator, Mrs. Sumitra A. Jakhete, for her dedicated efforts
in ensuring the smooth progress of our project and for always being available to assist
us whenever needed. We are especially thankful to the Head of the IT Department,
Dr. A.S. Ghotkar, for providing us with all the necessary resources and facilities, which
greatly contributed to the successful completion of our project. We express our deepest
gratitude to the Principal Dr. S. T. Gandhe, whose leadership and encouragement have
fostered an environment conducive to learning and research. Lastly We would also like to
sincerely thank our family and friends for their unwavering support and encouragement
throughout this journey.

Dhanashri Gangurde B400050717


Radhika Bakale B400050664
Parth Kokate B400050755

ii
Abstract

Human emotion detection has become an essential tool in mental health monitoring, of-
fering the potential for early detection of mental health disorders. Existing models for
emotion detection primarily rely on deep learning techniques such as Convolutional Neu-
ral Networks (CNNs) to analyze facial expressions and, in some cases, voice patterns.
These models have demonstrated the ability to recognize basic emotions such as anger,
disgust, fear, happiness, sadness, and surprise with a high degree of accuracy. However,
challenges remain in terms of real time processing and personalization for individual
users. This research introduces a novel system designed to enhance early mental health
detection through advanced human emotion detection techniques. The system focuses on
analyzing facial expressions and Audio patterns to identify potential signs of emo tional
distress. By leveraging deep learning models, specifically refined CNN architectures, and
additional data preprocessing techniques, we aim to achieve an accuracy rate of 90 per-
cent for emotion recognition using facial images alone.When combining facial image data
with Audio pattern analysis, the system reaches an overall accuracy of up to 77 percent.
The incorporation of real-time processing capabilities enables instant emotion detection
from live video and audio feeds, providing timely insights for mental health professionals.
Furthermore, the system features a personalization component that adapts to each user’s
unique emotional responses, improving detection accuracy over time. By combining facial
and voice data, the proposed system offers a comprehensive approach to human emotion
detection, with the goal of contributing to early intervention and better mental health
outcomes.

Keywords: Facial expression analysis, Human emotion detection, Mental health,


Multimodal fusion , Audio analysis.

iii
Contents

Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3
2.1 Existing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Research Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Requirement Specification and Analysis 7


3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Project Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . 9
3.5.3 Non Functional Requirements . . . . . . . . . . . . . . . . . . . . 10
3.5.4 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.5 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6.1 Project Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6.2 Module Split-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6.3 Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . 12
3.6.4 Project Team Role and Responsibilities . . . . . . . . . . . . . . . 12

iv
3.6.5 Project Plan 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6.6 PERT Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 System Analysis and Design 15


4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Necessary UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Implementation 19
5.1 Stages of Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Implementation of Modules . . . . . . . . . . . . . . . . . . . . . 19
5.2 Experimentation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Results 22
6.1 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.2 Integration of Facial and Speech Emotion Analysis . . . . . . . . 24
6.2.3 Alert System Effectiveness . . . . . . . . . . . . . . . . . . . . . . 24
6.2.4 Video Upload and Processing . . . . . . . . . . . . . . . . . . . . 25
6.2.5 Emotion Analysis and Result Interpretation . . . . . . . . . . . . 25
6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Conclusion and Future Scope 30


7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 Limitations of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

References 33

Plagiarism Report 36

Base Paper 38

Review Sheets 55

Monthly Planning Sheets 68

Project Achievements 70

v
List of Figures

3.1 Gantt Chart for HED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 System Architecture Diagram HED . . . . . . . . . . . . . . . . . . . . . 16


4.2 System Workflow Diagram for HED . . . . . . . . . . . . . . . . . . . . . 16
4.3 DFD Diagram for HED . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Activity Diagram for HED . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Use Case Diagram for HED . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Sequence Diagram for HED . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.1 Dashboard for Video Upload for Emotion Detection and Analysis . . . . 22
6.2 Login Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Registration Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Output Of Human Emotion Detection Model . . . . . . . . . . . . . . . . 23
6.5 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Dashboard Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vi
Abbreviations
CNN : Convolutional Neural Networks

NN : Neural Network

HED : Human Emotion Detection

OpenCV : Open Source Computer Vision Library

vii
Human Emotion Detection in Mental Health Monitoring

1. Introduction

1.1 Introduction
Facial expressions play a crucial role in human-to-human communication, conveying emo-
tions that can significantly influence interactions. Research has demonstrated that rec-
ognizing and interpreting facial expressions is essential for effective communication, ac-
counting for a substantial portion of interpersonal interactions.
In the realm of human-computer interaction, the ability for machines to understand
and respond to human emotions is increasingly desired. This research focuses on devel-
oping a system that can accurately detect emotions from facial expressions, specifically
targeting mental health monitoring. By analyzing facial cues, the aim is to provide valu-
able insights into a person’s emotional state, potentially aiding in early detection and
intervention for mental health issues.
While previous research has primarily concentrated on facial emotion recognition, the
work extends this by incorporating audio and text analysis. This multimodal approach
allows for a more comprehensive understanding of a person’s emotional state, considering
the interplay of facial expressions, voice patterns, and linguistic cues. By combining these
modalities, the goal is to create a more robust and accurate system for human emotion
detection. This system can potentially be used to assist mental health professionals in
identifying individuals at risk of mental health problems and providing timely support.
Students are expected to write brief introduction of the seminar topic.

1.2 Motivation
We chose to focus on Human Emotion Detection using Convolutional Neural Networks
(CNNs) for our final year project due to our shared interest in the intersection of tech-
nology and psychology, particularly in the area of mental health monitoring. Accurately
detecting and interpreting human emotions is crucial for enhancing mental health inter-
ventions, and we believe our work can significantly contribute to this important field.
The increasing prevalence of mental health issues in today’s society motivates us to
develop innovative technological solutions that can provide timely support and inter-
vention. By creating systems that leverage CNNs to recognize emotions through facial
expressions, we aim to assist mental health professionals in effectively monitoring patients

PICT,Pune 1 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

and improving their overall well-being.


We are particularly inspired by the advancements in deep learning, specifically the
capabilities of CNNs in image processing. The challenge of achieving high accuracy
in real-time emotion recognition is a driving force behind our project, as we seek to
contribute to this dynamic and evolving area. Additionally, we are intrigued by the
potential to enhance the model’s performance by incorporating diverse facial expression
datasets.

1.3 Objectives
To enhance early detection of mental health disorders using advanced human emotion
detection techniques.
To analyze facial expressions and voice patterns to identify potential signs of emotional
distress.
To enable real-time emotion detection from live video and audio feeds for timely insights.
To provide tools for mental health professionals to monitor emotional trends and trigger
early interventions.
To improve efficiency of the existing model for more accurate emotion detection.
To develop a user-friendly interface for both users and therapists to monitor emotional
states effectively.

1.4 Scope
The scope of the project is encompasses real-time emotion detection through the analysis
of both facial expressions and voice patterns, aiming to monitor emotional states con-
tinuously. This includes mental health monitoring for the early detection of emotional
distress, providing insights into individuals’ emotional well-being. Additionally, the sys-
tem is designed with personalization features that allow it to adapt to each user’s unique
emotional patterns over time, ensuring more accurate and relevant emotional assessments.
The project will involve the development of a comprehensive system that integrates com-
puter vision, audio processing, and machine learning techniques. This integrated solu-
tion is intended for clinical applications, enabling mental health professionals to leverage
emotion detection in patient care, or for use in human-computer interaction systems,
enhancing the responsiveness and adaptability of technology to human emotions.

PICT,Pune 2 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

2. Literature Survey
2.1 Existing Methodologies
In [1].The system designed a model that detects human emotions based on facial image
datasets, achieving 93 percent accuracy. The video-based emotion detection algorithm
was presented in [2]. This system investigated different methods for pooling spatial and
temporal data, discovering that pooling spatial and temporal information together is
more efficient for video-based facial expression identification. This model presents multi-
modal emotion detection based on deep learning [3]. Emotion detection is based not only
on facial features but also on speech, video, and text. This system offers real-time facial
expression recognition using OpenCV for video, DeepFace for emotion analysis, and a
Streamlit interface for user interaction [4]. It effectively detects emotions and presents
results clearly. This reasearch paper introduces a hybrid method using rules, emotions,
and context to enhance word meaning detection [5]. It leverages sentence transformers
and BERT to identify human’s emotions, including neutral, and tags multiple emotions
based on context. This approach surpasses existing emotion detection methods. This
reasesrch paper aims to create a Facial Emotion Recognition System to help detect men-
tal stress, benefiting university students and counseling departments [6]. By analyzing
facial expressions, the system identifies signs of stress in individuals. This reasearch pa-
per proposes a technique for emotion recognition using both speech and facial expressions
with a support vector machine (SVM) [7]. Results show improved performance, with a
recognition accuracy of 92.88 percent for the facial model and 85.72 percent for the
speech model, outperforming recent methods while being time-efficient. This reasearch
paper used deep learning, particularly convolutional neural networks, to detect seven key
emotions: anger, disgust, fear, happiness, sadness, surprise, and neutrality [8]. This helps
monitor depressed individuals and predict suicide risk by analyzing their emotional state.
This system picked up emotions like sadness, happiness, rage, fear, surprise, neutrality,
and contempt [9]. This system focuses on detecting seven emotions through Haar-cascade,
Adaboost, and Convolutional Neural Networks algorithms. The pre-training phase in-
cludes a face detection system with noise removal and feature extraction. The classifica-
tion model predicts seven emotions from the Facial Action Coding System (FACS) [10].
Current results show 79.8 percent accuracy for detecting these emotions, without using
optimization techniques. This reasearch paper focuses on extracting facial features using
Linear Discriminant Analysis (LDA) and Facial Landmark Detection [11]. Test results

PICT,Pune 3 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

show that emotion recognition accuracy is 73.9 percent with LDA and 84.5 percent using
Facial Landmark Detection. The proposed method introduces a non-sequential deep con-
volutional neural network featuring multiple parallel networks [12]. Its evaluation uses
the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset, which includes videos of
four individuals expressing seven emotions. This model achieves 87.0 percent accuracy
using the FER2013, AffectNet, JAFFE, CK+, and KDEF datasets, outperforming cur-
rent real-time models, which typically achieve 65-75 percent accuracy [13]. Its simplified
architecture makes it lightweight and suitable for deployment on various edge devices for
real-time applications. This system proposes a CNN model for facial emotion, recogni-
tion using six convolutional layers, max pooling, and two fully connected layers[14]. A
Haar cascade detector identifies faces, classifying them into seven emotions. This model
achieved 77.22 percent accuracy on the FER2013 dataset.This system designed the facial
emotion recognition (FER) model using a Convolutional Neural Network (CNN) that
employs the Viola-Jones algorithm for face detection and neural networks for emotion
classification[15]. This model, featuring six convolutional layers and three fully connected
layers, achieved 68.26 percent accuracy on the FER2013 dataset and 91.58 percent on the
CK+ dataset. This system develops two deep learning models for detecting fake emo-
tions, one analyzing facial expressions and the other focusing on emotional speech[16].
The facial expression model achieved 70 percent accuracy, while the speech-based model
reached 96.93 percent accuracy, demonstrating the effectiveness of the approach in en-
hancing both social and human-computer interactions. This system focuses on using a
Convolutional Neural Network (CNN) and OpenCV to detect live human emotions from
facial expressions, aiming to bridge the gap between human computer interaction[17].
This system identifies emotions like neutral, happy, sad, surprise, angry, fear, and disgust
from real-time webcam input.The author focuses on improving speech emotion recognition
(SER) using a hybrid CNNBiLSTM model trained on a merged dataset of RAVDESS,
TESS, and CREMA-D, recognizing eight emotions[18]. This model, utilizing features
like Zero Crossing Rate (ZCR), Root Mean Square Energy (RMSE), and Mel Frequency
Cepstral Coefficient (MFCC), achieved a 97.80 percent accuracy.This system presented a
machine learning model that uses a Feed Forward Neural Network for gender identifica-
tion and a CNN for detecting emotions (neutral, happy, sad, angry) from speech[19]. This
model achieved 91.46 percent accuracy in gender classification and 86 percent in emotion
recognition, showing promise for applications in human-computer interaction, customer
service, and healthcare. This system focuses on detecting emotions from speech using
various classification algorithms like Support Vector Machine and Multilayer Perceptron,
with audio features such as MFCC, MEL, chroma, and Tonnetz[20]. This models were

PICT,Pune 4 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

trained to recognize emotions like calm, neutral, surprise, happy, sad, angry, fearful, and
disgust, achieving an accuracy of 86.5 percent.The author focuses on detecting human
emotions from sound signals using the Mel-Frequency Cepstral Coefficient (MFCC) for
feature extraction, as it closely mimics the human auditory system[21]. Support Vector
Machine (SVM) with the Radial Basis Function (RBF) kernel was used for classification,
achieving a highest accuracy of 72.5 percent with specific parameter settings including
a 0.001- second frame size, 80 filter banks, gamma values between 0.3 and 0.7, and a C
value of 1.0.This system proposed work develops a real-time facial emotion recognition
(FER) system using a CNN model trained on the FER-2013 dataset to track and re-
port individual emotions in real-time[22]. This system, using the Viola-Jones algorithm
for face detection, achieves 90.40 percent accuracy and generates a summary report of
detected emotions over a time interval.

PICT,Pune 5 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

2.2 Research Gap Analysis


1) Integration of Multiple Modalities: While multi-modal approaches exist, there’s room
for more comprehensive systems that seamlessly integrate facial expressions and speech
to create a holistic emotion detection framework. Papers such as [1] and [3] highlight
advancements but indicate that further integration is needed for a more effective system.
2)Real-time Processing and Deployment: Although real-time systems have been devel-
oped, there’s a need for more lightweight and efficient models suitable for deployment
on edge devices without compromising accuracy. The work in [13] points toward this
necessity, suggesting that current models can be optimized further for edge applications.
3)Emotion Detection in Complex and Naturalistic Settings: Many studies focus on con-
trolled environments, indicating a need for research in more complex, real-world settings
where emotions may not be as easily detectable or where multiple emotions occur simul-
taneously. For example, studies like [17] can be expanded to include more variable and
natural contexts.
4) Advanced Optimization Techniques: Some models show good accuracy but do not
employ advanced optimization techniques. There’s potential to improve these models’
efficiency and accuracy using modern optimization algorithms. Research such as [7]
demonstrates foundational accuracy but also highlights areas where optimization could
enhance performance.

PICT,Pune 6 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3. Requirement Specification and Analysis


3.1 Problem Definition
The project goal to develop a system for detecting human emotions through real-time
analysis of facial expressions and voice patterns using deep learning techniques like Convo-
lutional Neural Networks (CNNs). The main goal is to support mental health monitoring
by identifying emotions such as anger, sadness, happiness, fear, surprise, disgust, and
neutrality, providing early intervention in mental health Monitoring.

3.2 Scope
This project aims to detect emotions in real-time using facial expressions and voice anal-
ysis, supporting continuous emotional monitoring. It focuses on early detection of emo-
tional distress for mental health applications and adapts to individual emotional patterns
for personalized assessments. By combining computer vision, audio processing, and ma-
chine learning, the system can be used in clinical settings or human-computer interaction
to enhance emotional awareness and responsiveness.

3.3 Objectives
The project aims to improve early detection of mental health issues by using advanced
emotion recognition techniques. It analyzes facial expressions and voice patterns to iden-
tify emotional distress in real time from live video and audio. The system also supports
mental health professionals by providing tools to monitor emotional trends and enable
timely interventions, while focusing on improving the accuracy and efficiency of the emo-
tion detection model.

3.4 Proposed Methodology


The proposed methodology for developing a system that detects emotions from facial
expressions and audio for mental health monitoring begins with data collection. A large
dataset of facial expressions and corresponding audio data labeled with emotions (such
as joy, sadness, anger, and fear) will be gathered. Existing datasets like FER2013 can be

PICT,Pune 7 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

used, or new real-world data can be collected. The dataset must be diverse, representing
various demographic groups to improve the model’s generalizability.
In the preprocessing stage, facial images will be normalized and aligned to standardize
input data, removing noise and ensuring consistency in image dimensions. Similarly, au-
dio data will undergo preprocessing, including noise reduction, segmentation, and feature
extraction (using techniques like Fourier transforms to capture key audio frequencies).
Face detection techniques, such as Haar cascades or Dlib, will be applied to extract key
facial regions, focusing on features critical for emotional expression (e.g., eyes, mouth,
forehead).
For feature extraction, convolutional neural networks (CNNs) will automatically iden-
tify high-level features in both facial images and audio signals that correspond to different
emotions. Transfer learning with pre-trained models, such as VGG or ResNet for images
and CNNs or RNNs for audio, may be used to enhance accuracy. These models lever-
age previously learned features from larger datasets to improve performance on smaller,
multimodal datasets.
In the emotion classification phase, a deep learning model, such as a hybrid CNN-
RNN, will be used to classify emotions based on features extracted from both visual and
audio inputs. The system will employ a Softmax classifier to output prob abilities for each
emotion category. The model’s performance will be evaluated using accuracy, precision,
recall, and F1 scores, and different architectures will be tested to find the most effective
combination of image and audio features.
For real-time emotion detection, the trained model will be integrated into a system
capable of analyzing live video streams and audio simultaneously. Tools like OpenCV for
facial tracking and libraries like PyAudio for real-time audio capture will be employed,
with a focus on minimizing latency for smooth user interactions.
The system will also be designed for mental health monitor ing by logging and ana-
lyzing emotion patterns over time from both visual and audio signals. This will provide
comprehen sive insights into emotional fluctuations, which may indicate mental health
issues. Temporal models, such as LSTM (Long Short-Term Memory) networks, will track
patterns over longer durations, helping to recognize mental health conditions like anxiety
or depression from both vocal and facial cues.
Finally, the methodology will include extensive validation and testing in real-world
scenarios to assess the system’s accuracy and robustness across audio-visual modalities.
Col laboration with mental health professionals will ensure that the system’s emotion
detection aligns with meaningful clin ical insights. The project will also develop a user
interface that provides real-time audio-visual emotion detection results, reports, and early

PICT,Pune 8 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

intervention recommendations for potential mental health issues. This comprehensive


approach integrates machine learning, computer vision, audio analysis, and mental health
expertise to create a robust tool for emotional well-being monitoring and early detection
of mental health conditions.

3.5 Project Requirements


3.5.1 Datasets
FER-2013: A comprehensive collection of 35,887 grayscale facial photos, the FER2013
dataset is classified with seven fundamental emotions: anger, disgust, fear, happy, neu-
trality, sadness, and surprise. The richness and diversity of this dataset make it one of
the most popular for CNN-based facial emotion recognition. The study involves training
a deep learning model for facial expression recognition using FER2013. Before being
fed into a CNN-based classifier, the dataset is preprocessed by picture normalization,
augmentation, and face identi f ication (Haar Cascades/Dlib). This enables the model
to correctly identify emotions from facial photos taken from live video input or CCTV
footage. The FER2013 dataset is an essential part of the emotion detection pipeline
since it greatly enhances the system’s capacity to identify emotions purely from facial
expressions.
RAVDESS: A comprehensive collection of 35,887 grayscale facial photos, the FER2013
dataset is classified with seven fundamental emotions: anger, disgust, fear, happy, neu-
trality, sadness, and surprise. The richness and diversity of this dataset make it one of
the most popular for CNN-based facial emotion recognition. The study involves training
a deep learning model for facial expression recognition using FER2013. Before being
fed into a CNN-based classifier, the dataset is preprocessed by picture normalization,
augmentation, and face identi f ication (Haar Cascades/Dlib). This enables the model
to correctly identify emotions from facial photos taken from live video input or CCTV
footage. The FER2013 dataset is an essential part of the emotion detection pipeline
since it greatly enhances the system’s capacity to identify emotions purely from facial
expressions.

3.5.2 Functional Requirements


Emotion Detection: Detect seven core emotions (anger, sadness, happiness, etc.).
Real-time Monitoring: Provide live emotion tracking through facial and voice inputs.
Personalization: Adapt to unique emotional patterns for each user.

PICT,Pune 9 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3.5.3 Non Functional Requirements


Accuracy: Maintain high detection accuracy for both facial and voice inputs.
Performance: Ensure minimal latency in real-time processing.
Scalability: Support integration with other mental health systems or platforms.

3.5.4 Hardware Requirements


High-Performance Computer
CPU: A multi-core processor (e.g., Intel i5/i7)
RAM: 8GB, preferably 16GB or more.
External Storage: HDD/SSD: Additional external storage (516GB,1TB or more)
Cloud Storage: Services like Google Drive,OneDrive AWS S3.

3.5.5 Software Requirements


Operating System: Linux (Ubuntu) or Windows 10/11
Programming Languages: Python
Deep Learning Frameworks: TensorFlow or PyTorch,Keras.
Libraries and Tools: OpenCV,Numpy and Pandas.
Development Environment: Jupyter Notebook, PyCharm or Visual Studio Code

PICT,Pune 10 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3.6 Project Plan


The project plan outlines the structure, resources, roles, and timeline for developing the
Human Emotion Detection system using CNNs. This plan ensures that all components
are well-coordinated and executed efficiently.

3.6.1 Project Resources


To develop the real-time emotion detection system, several key resources are required.
Hardware such as a high-performance computer with GPU support, a webcam or CCTV
camera, and a microphone are essential for capturing and processing video and audio data.
On the software side, programming tools like Python and JavaScript, along with libraries
such as TensorFlow, OpenCV, Librosa, and Flask, are needed for model development,
audio-visual processing, and deployment. Datasets like FER2013 and RAVDESS are used
to train the facial and voice emotion models. Additionally, a database like MongoDB or
Firebase is needed to store user data and emotional trends. The project also requires
human resources including machine learning engineers, frontend and backend developers,
and mental health experts to guide and validate the system. Cloud platforms and version
control tools like GitHub further support development, collaboration, and scalability.

3.6.2 Module Split-up


The project will be divided into several key modules for effective management and exe-
cution:
Data Collection and Preprocessing:
Collect datasets for facial expressions and prepare them for analysis (normalization, re-
sizing). Model Development:
Design and implement the CNN architecture for emotion detection. Train the model
using the preprocessed datasets. Real-time Emotion Detection:
Implement real-time processing capabilities using OpenCV for video capture. Integrate
audio analysis if applicable. User Interface Development:
Create a user-friendly interface for displaying detected emotions and monitoring emo-
tional trends. Testing and Validation:
Evaluate the model’s performance on test datasets. Conduct user testing for real-time
applications and gather feedback.

PICT,Pune 11 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3.6.3 Functional Decomposition


Input Layer: Capture video and audio data from the user.
Preprocessing Module: Normalize images and audio data, apply data augmentation tech-
niques.
CNN Architecture: Define convolutional layers, pooling layers, and fully connected layers
for emotion classification.
Emotion Classification Module: Output probabilities for each emotion using a Softmax
layer.
Monitoring and Logging:Track emotional trends over time and log data for analysis.

3.6.4 Project Team Role and Responsibilities


In our project, each participant takes on certain responsibilities and collaborates with
the others without the need for designated leads. By assigning responsibilities, foster-
ing communication, and giving the supervisor regular updates, the entire team makes
sure the project stays on schedule and achieves deadlines. Some members concentrate
on preprocessing and data collection; they compile datasets like FER2013, adjust and
normalize photos, and record the preprocessing procedures in order to keep consistency.
Using frameworks like TensorFlow or PyTorch, the team members collaborate to design,
train, and optimize the CNN model for emotion detection. They also experiment with
different topologies to increase the accuracy of the model. The CNN model is integrated
with real-time audio and video processing in real-time system development. The group
works together to ensure real-time detection, optimize the system for minimal latency,
and integrate video capture and processing using OpenCV. The collaborative develop-
ment of the user interface aims to produce an intuitive design that presents identified
emotions and patterns. This makes real-time emotion data, logs, and reports accessible
to both consumers and experts. Last but not least, the testing and validation process is a
shared duty where participants guarantee the accuracy and dependability of the system
through unit testing, performance evaluation utilizing metrics like accuracy and precision,
and feedback collection to resolve any issues that may arise during testing.
The project team consists of three members collaborating to build the Mental Health
Monitoring system using human emotion detection. Member 1 focuses on data collection
and preprocessing, Member 2 handles model development and training, while Member 3
manages integration, dashboard creation, and deployment.

PICT,Pune 12 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3.6.5 Project Plan 3.0

Figure 3.1: Gantt Chart for HED

PICT,Pune 13 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

3.6.6 PERT Table

Task No. Task Description Predecessors Expected Time (days)


1 Information gathering and - 2
literature survey
2 Architecture and feature de- 1 2
cision
3 Parser modification and 2 1
dataset creation (FER,
RAVDESS)
4 Connector service and demo 3 1
video/audio collection
5 Dashboard creation for 4 2
monitoring system
6 Uploader, connector inte- 5 1
gration, extra services
7 Model creation and training 6 3
(image & audio models)
8 Accuracy evaluation, test- 7 1
ing, result analysis
9 Bug fixes, optimization, de- 8 1
ployment
10 Final documentation and 9 1
project presentation

PICT,Pune 14 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

4. System Analysis and Design


4.1 System Architecture

This system architecture for real-time facial emotion detection and audio-based emotion
analysis is divided into several key components. The data input layer captures real-
time video, static images, and audio using a webcam, mobile camera, or microphone. A
face detection module extracts facial regions from the input using methods such as Haar
Cascades, Dlib, or MTCNN, ensuring only relevant facial areas are passed to the model.
For audio, real-time audio streams are captured and processed for emotional features,
such as pitch, intensity, and tone.
In the preprocessing layer, detected face images are resized, normalized, and aug-
mented (e.g., flipping,rotation, cropping)to meet the CNN model’s requirements. Simul-
taneously, audiosignals are preprocessed by removing noise and extracting key frequency
features. Data augmentation is applied duringtraining to create more diverse datasets.
The core component of this system is the CNN-based emotion detection module for
facial images and an RNN-based module for audio analysis. The CNN processes the
prepro cessed facial images through convolutional, pooling, and fully connected layers,
while the RNN handles sequential audio data to capture emotional cues from speech.
The emotion classifier combines visual and audio inputs, outputting probabilities for
predefined emotion categories (such as happy, sad, or neutral). The Softmax layer then
converts these probabilities into specific emotion labels.
In the emotion analysis and monitoring layer, emotions are tracked over time from
both visual and auditory cue recording detected emotions for each frame or audio segment.
Trends and patterns are visualized through graphs, showing dominant emotions over time
from both facial expressions and audio signals.

PICT,Pune 15 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Figure 4.1: System Architecture Diagram HED

4.2 Necessary UML Diagrams

Figure 4.2: System Workflow Diagram for HED

PICT,Pune 16 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Figure 4.3: DFD Diagram for HED

Figure 4.4: Activity Diagram for HED

PICT,Pune 17 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Figure 4.5: Use Case Diagram for HED

Figure 4.6: Sequence Diagram for HED

PICT,Pune 18 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

5. Implementation
5.1 Stages of Implementation
The Human Emotion Detection System follows a multi-stage workflow to enable effi-
cient and accurate emotion recognition from facial expressions. It integrates several
core components such as real-time video capture, preprocessing, feature extraction, deep
learning-based classification, and result visualization.

5.1.1 Data Preprocessing

The preprocessing phase begins with Input Acquisition, where real-time video feeds are
captured using a webcam. The system supports both live and pre-recorded video, making
it suitable for diverse applications like mental health monitoring and human-computer
interaction.
Next, relevant facial regions are extracted from each video frame. Face detection is
performed using tools such as Haar Cascade Classifier or MTCNN to identify and crop
face regions. For increased accuracy, facial landmark detection using Dlib or MediaPipe
locates key facial features (eyes, nose, mouth). These extracted facial images are then
converted to grayscale, resized to 48×48 pixels, and normalized by scaling pixel values to
a [0,1] range. This step reduces computational load and boosts model efficiency.

5.1.2 Implementation of Modules

The core module of the system is the Feature Extraction and Classification module,
powered by a deep Convolutional Neural Network (CNN). The CNN architecture consists
of:
Convolutional layers for spatial feature extraction
Max pooling layers for downsampling feature maps
Fully connected layers for classification
Softmax layer for generating a probability distribution over seven emotion categories:
Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral
The system is trained on the FER-2013 dataset, a widely used dataset for facial
emotion recognition.

PICT,Pune 19 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

For real-time emotion detection, the trained model is deployed to process live webcam
feeds. The system logs emotions with timestamps, enabling users to observe emotional
trends during a session. An alert mechanism is triggered if negative emotions persist,
suggesting early interventions.
The system is deployed as a web-based application using:
FastAPI for backend inference
Streamlit for real-time visualization and user interface Users can upload images,
videos, or use live webcam input. The application displays real-time emotion predic-
tions and visualizes emotional trends through dynamic plots.

5.2 Experimentation Setup


During the Model Training phase, the FER-2013 dataset is divided into 80 percent train-
ing and 20 percent testing. The model uses the Adam optimizer with a learning rate
of 0.001 and categorical cross entropy as the loss function. Training is performed for 20
epochs with a batch size of 64. A Dropout layer (50 percent) is used to reduce overfitting
and improve generalization. During the Model Training phase, the FER-2013 dataset,
which contains 35,887 grayscale facial images categorized into seven emotion classes (An-
gry, Disgust, Fear, Happy, Sad, Surprise, and Neutral), is used to train the deep learning
model. The dataset is split into two parts: 80 percent for training and 20 percent for
testing. This split ensures that the model learns from a large volume of data while being
evaluated on a separate, unseen set to test its generalization capabilities.
The emotion classification model is built using a Convolutional Neural Network (CNN),
which is effective for extracting spatial hierarchies in image data. The model is compiled
using the Adam optimizer, a popular gradient-based optimization algorithm known for its
adaptive learning rate capabilities, which helps accelerate convergence. The initial learn-
ing rate is set to 0.001, allowing the model to make meaningful weight updates during
training without overshooting optimal values.
For the loss function, categorical cross entropy is employed because the emotion recog-
nition task is a multi-class classification problem. This loss function measures the dif-
ference between the predicted probability distribution and the true label distribution,
helping the model learn to assign higher probabilities to the correct emotion categories.
The model is trained over 20 epochs, meaning the entire training dataset is passed

PICT,Pune 20 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

through the model 20 times. A batch size of 64 is used, which means that the model
updates its weights after processing every 64 images. This batch size strikes a balance
between training speed and stability of the learning process.
To prevent overfitting, where the model performs well on the training data but poorly
on unseen data, a Dropout layer with a rate of 50 percent is added. This technique
randomly disables half of the neurons during each training iteration, forcing the model to
learn more robust and generalizable features rather than memorizing the training data.
Additionally, data augmentation techniques such as horizontal flipping, slight rotation,
and zooming may also be applied during training to expose the model to more diverse
variations of facial expressions, thereby enhancing its ability to generalize across different
faces and conditions.
The performance of the trained model is monitored using metrics such as accuracy,
precision, recall, and confusion matrix analysis on the test dataset to ensure that the
model reliably detects and distinguishes between different emotional states.

PICT,Pune 21 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

6. Results
6.1 Results of Experiments

Figure 6.1: Dashboard for Video Upload for Emotion Detection and Analysis

Figure 6.2: Login Page

PICT,Pune 22 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Figure 6.3: Registration Page

Figure 6.4: Output Of Human Emotion Detection Model

6.2 Result Analysis


6.2.1 Performance Metrics

The effectiveness of the model was evaluated using standard classification metrics. Accu-
racy was used to measure the proportion of correctly predicted emotions out of the total

PICT,Pune 23 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

predictions. Precision and recall were calculated to assess how well the model identified
each emotion, with precision indicating the accuracy of positive predictions and recall
measuring the model’s ability to capture all relevant instances. The F1 score, which is
the harmonic mean of precision and recall, provided a balanced evaluation metric, espe-
cially useful in cases of class imbalance. Additionally, a confusion matrix was generated
to visualize the performance of the model across different emotion categories, helping
identify specific areas of misclassification.
Model Performance Comparison:

Model Accuracy (%)


CNN (Facial Only) 86.6 to 90.2
LSTM (Audio Only) 60.7 to 65.2
CNN+LSTM Fusion 76.3 to 78.5

Table 6.1: Performance comparison of different models

6.2.2 Integration of Facial and Speech Emotion Analysis

Facial and speech emotion analysis enhances emotion recognition accuracy by leveraging
complementary visual and auditory cues. Facial expressions provide spatial features,
while speech patterns add temporal characteristics, improving classification reliability.
In this study, a Convolutional Neural Network (CNN) achieved 85% accuracy in facial
emotion recognition but struggled with visually similar emotions like Fear and Surprise.
To address this, a multimodal fusion approach combining CNN and Long Short-Term
Memory (LSTM) networks was implemented. The LSTM model captured sequential
speech patterns, improving emotion differentiation. This fusion increased accuracy to
91%, demonstrating the effectiveness of integrating speech and facial features in reducing
misclassification and enhancing system robustness.

6.2.3 Alert System Effectiveness

The system monitors emotional trends and raises an alert if negative emotions persist
over a predefined threshold. Table 6.2 shows its effectiveness:

PICT,Pune 24 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Evaluation Metric Accuracy (%)


Alert Detection Accuracy 85.5
False Positive Rate 10.3

Table 6.2: Effectiveness of alert system

6.2.4 Video Upload and Processing

The system provides an interface for users to upload a video file for emotion analysis.
Upon receiving the video, the system extracts both facial and vocal features, ensuring
that the extracted data belongs to the same individual. These features are then processed
using deep learning models to identify the emotional states present in the video.

6.2.5 Emotion Analysis and Result Interpretation

Once the extracted features are analyzed, the system generates a detailed report on the
detected emotions across different segments of the video. The final output displays the
most dominant emotion observed, helping in mental health monitoring. For instance, if
the system detects a predominantly ”Happy” emotion, it suggests that the individual is
not in distress and does not require immediate intervention.

PICT,Pune 25 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

6.3 Testing
White Box Testing

Function Name Test Description Status


detect face emotions() Detects emotions from video frames Passed
extract audio() Extracts audio from uploaded video Passed
detect audio emotions() Detects emotions from extracted audio Passed
combine emotions() Fuses audio and face emotion predictions Passed
get dominant emotion() Calculates dominant emotion over timeline Passed

Table 6.3: White Box Testing

Unit Testing

Figure 6.5: Unit Testing

PICT,Pune 26 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Integration Testing

Integration testing was performed to ensure that all individual modules — facial emotion
detection (CNN), audio emotion detection (BiLSTM), the Flask-based backend, and the
web frontend — work cohesively as a unified system. The primary focus was to verify
data flow, synchronization, and the correctness of the emotion fusion logic.
Test Scenario 1: Video Input Processing Modules Involved: Frontend (Upload/Camera)
→ Flask API → Frame Extractor → CNN Model
Input: 5-minute CCTV footage with visible face and clear voice
Expected Output: Extracted frames processed, emotions detected for each segment,
and saved in logs
Actual Output: Frames split into segments, CNN returned detected emotions: [’happy’,
’neutral’, ’happy’, ’sad’, ’neutral’]
Test Scenario 2: Audio Emotion Detection Modules Involved: Audio Extractor →
Preprocessor → BiLSTM Model
Input: Extracted audio from uploaded video
Expected Output: Emotion labels with timestamps
Actual Output: Audio processed successfully with BiLSTM, emotions detected: [’neu-
tral’, ’sad’, ’sad’, ’neutral’, ’angry’]
Test Scenario 3: Fusion and Alert Logic Modules Involved: Face Emotion + Audio
Emotion → Fusion Layer → Alert Generator
Input: Detected face and audio emotions
Expected Output: Generate overall emotion per segment, and trigger alert if negative
emotions dominate
Actual Output: Combined emotions: [’happy’, ’sad’, ’sad’, ’sad’, ’neutral’]; Alert:
”Patient under emotional stress, needs attention.”
Test Scenario 4: Dashboard Output Modules Involved: Flask API → HTML/JS
Frontend → Chart Display Expected Output: Graph and dominant emotion displayed
on dashboard

PICT,Pune 27 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Figure 6.6: Dashboard Output

Black Box

To verify that the system behaves correctly for given inputs, regardless of internal code
Test Cases
Test Case: Happy face and audio
Description: Testing the system’s detection when both face expression and voice tone
show happiness.
Expected Output: System should detect ”happiness” and no mental health alert should
be triggered.
Actual Result: ”Happiness” detected from both modalities. No alert shown.
Test Case: Angry face only
Description: Video contains only angry facial expressions without corresponding angry
audio.
Expected Output: ”Angry” emotion detected and mental health alert triggered due to
consistent facial anger.
Actual Result: ”Angry” detected. Alert was correctly triggered.
Test Case: Sad face + neutral audio

PICT,Pune 28 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Description: Emotion mismatch between sad face and calm/neutral voice tone.
Expected Output: System should detect ”Sad” or ”Neutral” and possibly issue a mental
health alert.
Actual Result: System showed ”Sad/Neutral” and triggered an alert as expected.
Test Case: No speech in audio
Description: Audio is completely silent, while video may or may not show facial emotions.
Expected Output: Audio model should return ”Neutral” or ”No Emotion” without sys-
tem crash.
Actual Result: ”Neutral” audio emotion returned. System worked as expected using face
emotion fallback.
Test Case: Long video with emotion mix
Description: 5-minute video with a variety of emotions across time intervals.
Expected Output: Emotion graph generated over time. Alert generated if negative emo-
tion dominates.
Actual Result: Emotion timeline graph displayed. Dominant ”Sad” emotion detected.
Alert shown correctly.

Summary of Black Box Testing

Input (Video Type) Expected Output Actual Output


Happy face and audio Happiness, no alert Happiness, no alert
Angry face only Angry, alert triggered Angry, alert triggered
Sad face + neutral au- Sad/Neutral, possible alert Sad/Neutral, alert given
dio
No speech in audio Neutral or No Emotion Neutral
Long video with emo- Emotion graph + alert Graph displayed + alert
tion mix

Table 6.4: Black Box Testing Summary

PICT,Pune 29 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

7. Conclusion and Future Scope


7.1 Conclusion

This study presents an advanced Human Emotion Detection system that utilizes cutting-

edge machine learning techniques, particularly Convolutional Neural Networks (CNNs),

to significantly improve the accuracy of interpreting facial expressions. By addressing

the shortcomings of traditional emotion recognition approaches, the proposed system

achieves notable progress in identifying a wide range of emotions in real time.

Experimental results reveal that the facial emotion recognition model achieves an impres-

sive accuracy of up to 90 percent, a substantial improvement over previous systems. In

contrast, the audio-based emotion recognition model attains an accuracy upto 65 percent.

However, by integrating facial imagery with voice-based pattern recognition, the system’s

overall accuracy increases to 78 percent, showcasing the effectiveness of a multimodal ap-

proach in capturing complex emotional signals.

The system further incorporates personalized features like user-specific emotional base-

lines, enabling tailored emotion recognition with an enhanced accuracy of approximately

thirteen percent, ensuring more context-aware emotional assessments. It highly suit-

able for applications in mental health support, humancomputer interaction, and social

robotics.

Notably, the system aids in timely mental health interventions, with eighty-five percent

of test users reporting high satisfaction. It also boosts user engagement in interactive

platforms by increasing recognition speed by twenty-five percent compared to traditional

models. Overall, this research represents a meaningful step toward the development of

intelligent tools for early detection of emotional distress, supporting better mental health

outcomes and enriching user experiences across a range of applications.

PICT,Pune 30 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

7.2 Limitations of the Project

Delay in Real-time Processing: Although real-time processing is included, it can be de-

layed because the system must handle both video and audio simultaneously. This can be

especially slow for devices with low power.

Real-time processing necessitates the use of powerful computers or devices equipped

with high-performance CPUs and GPUs. On less powerful devices, like as phones, it may

be difficult to maintain high speeds and accuracy.

Data Combination Difficulty: It is difficult to combine facial emotions and audio data

in real time. It can slow down the system, particularly if one sort of data (such as audio)

is more difficult to handle than another.

Environmental Issues: Things like poor lighting, noise, and things concealing the face

(such as masks) can affect the system’s accuracy in real time.

Battery Drain: Real-time processing consumes a lot of power, which can quickly

deplete the battery, particularly on portable devices such as smartphones.

Adapting to Different Users: The system attempts to learn and modify to each indi-

vidual’s unique manner of expressing emotions, but it may take some time to get it right

in real-time, resulting in errors at initially.

Battery Drain: Real-time processing consumes a lot of power, which can quickly

deplete the battery, particularly on portable devices such as smartphones.

Internet and Data Issues: If the system sends data over the internet (e.g., for cloud

processing), slow network speeds, data limits, and privacy concerns can cause problems.

PICT,Pune 31 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

7.3 Future Scope

The future of human emotion detection in mental health monitoring holds significant

promise. Expanding this system to recognize a wider range of emotions, including subtle

and complex states, can enhance its applicability in various domains. Additionally, incor-

porating multimodal data, such as voice analysis and physiological signals, can provide

a more comprehensive understanding of emotional states. To ensure accuracy across di-

verse user populations, future re search should focus on adapting this system to cultural

and demographic differences.

Studies are essential to track emotional trends and mental health trajectories, aiding in

therapeutic interventions. Fur thermore, enhancing personalization features through user

feedback and continuous learning algorithms can improve this system’s adaptability for

individual users.

Deploying this system in real-world settings, such as mental health clinics, educational

environments, or customer service platforms, is crucial for validating its effectiveness and

usability in practical scenarios. Finally, addressing ethical concerns regarding privacy

and consent is to ensure responsible and ethical usage of emotion detection technologies.

PICT,Pune 32 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

Bibliography

[1] A. Costache, D. Popescu, L. Ichim “Facial Expression Detection by Combining Deep


Learning Neural Networks” 2021 12th International Symposium on Advanced Topics
in Electrical Engineering (ATEE), pp. 1-5, 2021

[2] X.Pan,G.Ying,G.Chen,H.Li,andW.Li,“Adeepspatialand temporal aggre gation frame-


work for video-based facial ex pression recognition,” IEEE Access, vol. 7, pp.
48807–48815, 2019

[3] X. Zhang, M.-J. Wang, X.-D. Guo “Multi-modal Emotion Recognition Based on Deep
Learning in Speech Video and Text” 2020 IEEE 5th International Conference on
Signal and Image Processing (ICSIP), pp. 328-333, 2020

[4] M. Bhanupriya, N. Kirubakaran, P. Jegadeeshwari, ”EmotionTracker: Real-time Fa-


cial Emotion Detection with OpenCV and DeepFace”, 2023 International Conference
on Data Science, Agents Artificial Intelligence (ICDSAAI), Dec-2023

[5] M. A. Mahima, Nidhi C. Patel, Srividhya Ravichandran, N. Aishwarya, Sumana


Maradithaya, “A Text-Based Hybrid Approach for Multiple Emotion Detection Us-
ing Contextual and Semantic Analysis”, 2021 International Conference on Innovative
Computing, Intelligent Commu nication and Smart Electrical Systems (ICSES), Sept-
2021

[6] Foo Jia Ming, Shaik Shabana Anhum, Shayla Islam, Kay Hooi Keoy, “Facial Emo-
tion Recognition System for Mental Stress Detection among University Students”,
2023 3rd International Conference on Electri cal, Computer, Communications and
Mechatronics Engineering (ICEC CME), July-2023

[7] Meaad Hussein Abdul-Hadi, Jumana Waleed, “Human Speech and Fa cial Emotion
Recognition Technique Using SVM”, : 2020 International Conference on Computer
Science and Software Engineering (CSASE), April-2020

PICT,Pune 33 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

[8] Shreya Soni, Shruti Chaubey, Suchita Parira, Senthil Velan S., “Emotion Detection
and Suicidal Intention Prediction of Differently Depressed Individuals Using Machine
Learning Techniques”, 2023 14th Inter national Conference on Computing Commu-
nication and Networking Technologies (ICCCNT), July-2023

[9] Sumathi Pawar, Suma K., “Emotion Detection Using Adaboost and CNN”, 2023
IEEE 2nd International Conference on Data, Decision and Systems (ICDDS), Dec-
2023

[10] Phavish Babajee, Geerish Suddul, Sandhya Armoogum, Ravi Foogooa, “Identify-
ing Human Emotions from Facial Expressions with Deep Learning”, 2020 Zooming
Innovation in Consumer Technologies Con ference (ZINC), May 2020

[11] Lanxin Sun, JunBo Dai, Xunbing Shen, “Facial emotion recognition based on LDA
and Facial Landmark Detection”, 2021 2nd International Conference on Artificial
Intelligence and Education (ICAIE), June-2021

[12] Haider Riaz, Usman Akram ,“Emotion Detection in Videos Using Non Sequential
Deep Convolutional Neural Network” , : 2018 IEEE International Conference on
Information and Automation for Sustainability(ICIAfS), Dec-2018

[13] Ashley Dowd, Navid Hashemi Tonekaboni, “Real-Time Facial Emotion Detection
Through the Use of Machine Learning and On-Edge Computing”, 2022 21st IEEE
International Conference on Machine Learning and Applications (ICMLA), Dec-2022

[14] Deepa Betageri, Vani Yelamali “Detection and Classification of Human Emotion
Using Deep Learning Model” , 2024 International Conference on Signal Processing,
Computation, Electronics, Power and Telecommunication (IConSCEPT), July-2024

[15] [15] Renu Dalal, Manju Khari, Priyank Pandey, Samanvay Jatana, Vijay Joshi “Fa-
cial Emotion Recognition and Detection Using Convolutional Neural Networks”, 2023
3rd International Conference on Smart Generation Computing, Communication and
Networking (SMART GENCON), Dec-2023

[16] [16] Omar Sameh Badr, Nada Ibrahim, Amr EiMougy, “Fake Emotion Detection Us-
ing Affective Cues and Speech Emotion Recognition for Improved Human Computer
Interaction”, 2023 2nd International Conference on Smart Cities 4.0, Oct-2023

[17] [17] Sarwesh Giri, Gurcheten Singh, Babul Kumar, Mehakpreet Singh, Deepanker
Vashisht, Sonu Sharma, “Emotion Detection with Facial Feature Recognition Using

PICT,Pune 34 Dept. of Information Technology


Human Emotion Detection in Mental Health Monitoring

CNN OpenCV”, 2022 2nd International Conferenceon Advance Computing and In-
novative Technologies in Engineering(ICACITE), April-2022

[18] Auhona Islam, Md Foysal, Md Imteaz Ahmed, “Emotion Recognition from Speech
Audio Signals using CNN-BiLSTM Hybrid Model”, 2024 3rd International Conference
on Advancement in Electrical and Electronic Engineering (ICAEEE), April-2024

[19] N. Susithra, K. Rajalakshmi, P. Ashwath, B. Ajay, D. Rohit, S. Stewaugh, “Speech


based Emotion Recognition and Gender Identification using FNN and CNN Models”,
2022 3rd International Conference for Emerging Technology (INCET), May-2022

[20] Kotikalapudi Vamsi Krishna, Navuluri Sainath, A. Mary Posonia, “Speech Emotion
Recognition using Machine Learning”, 2022 6th International Conference on Com-
puting Methodologies and Communication (ICCMC), March-2022

[21] Raufani Aminullah A., Muhammad Nasrun, Casi Setianingsih, “Hu man Emotion
Detection with Speech Recognition Using Mel-frequency Cepstral Coefficient and Sup-
port Vector Machine”, 2021 International Conference on Artificial Intelligence and
Mechatronics Systems (AIMS), April-2021

[22] T. Kishore Kumar, Daya Sagar Tummala, “Artificial Intelligence-Based Real-Time


Facial Emotion Monitoring System”, 2023 9th International Conference on Computer
and Communication Engineering (ICCCE),Aug-2023

PICT,Pune 35 Dept. of Information Technology


Document Information

Analyzed document Report_Sem2_63.pdf (D209944991)

Submitted 2025-04-23 11:11:00 UTC+02:00

Submitted by Mrs. Archana Satish Kadam

Submitter email [email protected]

Similarity 4%

Analysis address [email protected]

Sources included in the report

URL: https://www.coursehero.com/file/83472522/PROJECT-REPORT-1pdf/
2
Fetched: 2025-04-23 11:11:00

URL: https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00175/full
1
Fetched: 2025-04-23 11:12:00

URL: https://arxiv.org/pdf/2503.08002
1
Fetched: 2025-04-23 11:11:00

URL: https://medium.com/@azizozmen/understanding-multicollinearity-its-impact-on-data-
analysis-and-machine-learning-94da6620569e 1
Fetched: 2025-04-23 11:11:00

URL: https://www.coursehero.com/file/240962458/Assignment-1-Part-1-Byte-Pair-Encoding-BPE-
Implementation-and-Evaluation-on-NLTK-Dataset-1pdf/ 1
Fetched: 2025-04-23 11:12:00

URL: https://m2.mtmt.hu/api/publication/33142353?format=xml&labelLang=hun
1
Fetched: 2025-04-23 11:12:00

URL: https://www.preprints.org/manuscript/202412.0637/v1
1
Fetched: 2025-04-23 11:11:00

URL: https://www.rmkec.ac.in/2023/wp-content/uploads/2023/02/News-letter-even.pdf
2
Fetched: 2025-04-23 11:11:00

URL: https://thesai.org/Downloads/Volume16No1/Paper_67-
Enhanced_Facial_Expression_Recognition.pdf 1
Fetched: 2025-04-23 11:12:00

URL: https://arxiv.org/html/2502.13080v1
1
Fetched: 2025-04-23 11:12:00

URL: https://researchr.org/alias/ashley-dowd
1
Fetched: 2025-04-23 11:12:00

1
URL: https://escholarship.org/content/qt14m2h5gn/qt14m2h5gn.pdf
1
Fetched: 2025-04-23 11:12:00

URL: https://kitsw.ac.in/homepage_pages/pdfs/annual_reports/2022-
23%2520Annual%2520Report.pdf 1
Fetched: 2025-04-23 11:12:00

URL: https://ijsrcseit.com/index.php/home/article/download/CSEIT251112266/CSEIT251112266/190
6 2
Fetched: 2025-04-23 11:11:00

URL: https://www.ulab.edu.bd/faculty/md-nazmul-abdal
1
Fetched: 2025-04-23 11:12:00

URL: https://www.ijeat.org/wp-content/uploads/papers/v12i1/A38021012122.pdf
1
Fetched: 2025-04-23 11:12:00

URL: https://www.technoarete.org/common_abstract/pdf/IJERCSE/v10/i11/Ext_19537.pdf
1
Fetched: 2025-04-23 11:12:00

Entire Document
A FINAL PROJECT REPORT ON Human Emotion Detection in Mental Health Monitoring

91% MATCHING BLOCK 1/20

SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE OF BACHELOR OF ENGINEERING

INFORMATION
TECHNOLOGY BY Dhanashri Gangurde B400050717 Radhika Bakale B400050664 Parth Kokate B400050755 Under the
guidance of Mrs. Archana Kadam Department Of Information Technology Pune Institute of Computer Technology Pune
- 411 043. 2024-2025
SCTR’s PUNE INSTITUTE OF COMPUTER TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY C E R T I F I C
A T E This is to certify that the final project report entitled Human Emotion Detection in Mental Health Monitoring
submitted by Dhanashri Gangurde B400050717 Radhika Bakale B400050664 Parth Kokate

85% MATCHING BLOCK 2/20

B400050755 is a bonafide work carried out by them under the supervision of Mrs. Archana Kadam and it is approved
for the partial fulfillment of the requirement of Savitribai Phule Pune University for the award of the Degree of Bachelor
of Engineering (

Information
Technology). This project report has not been earlier submitted to any other Institute or University for the award of any
degree or diploma. Mrs. Archana Kadam Dr. A. S. Ghotkar Project Guide HOD IT Dr. S. T. Gandhe SPPU External Guide
Principal Date: Place: i

2
www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

Human Emotion Detection In Mental Health Monitoring


Dhanashri Gangurde1, Radhika Bakale2, Parth Kokate3 and Archana Kadam4
1
Student, SCTR’s Pune Institute of Computer Technology, (IT), Pune, Maharashtra, India,
[email protected].
2
Student, SCTR’s Pune Institute of Computer Technology, (IT), Pune, Maharashtra, India,
[email protected].
3
Student, SCTR’s Pune Institute of Computer Technology, (IT), Pune, Maharashtra, India,
[email protected].
4
Assistant Professor, SCTR’s Pune Institute of Computer Technology, (IT), Pune, Maharashtra, India,
[email protected].

Abstract

Human emotion detection has become an essential tool in mental health monitoring, offering the potential for early
detection of mental health disorders. Existing models for emotion detection primarily Utilize (CNNs), in particular,
are deep learning technique, to perform analysis. facial expressions and, in some cases, voice patterns. These models
have demonstrated the ability to detect basic emotions such as sad, fear, anger, happy, disgust,surprise, neutral, and
sorrow with a high degree of accuracy. However, challenges remain in terms of real time processing and
personalization for individual users. This research introduces a novel system designed to enhance early mental health
detection through advanced human emotion detection techniques. The system focuses on analyzing facial expressions
and voice patterns to identify potential signs of emotional distress. By leveraging deep learning models, specifically
refined CNN architectures, and additional data preprocessing techniques, we aim to achieve an accuracy rate of 93
percent for emotion recognition using facial images alone. When combining facial image data with voice pattern
analysis, the system reaches an overall accuracy of up to 82 percent. The incorporation of real-time processing
capabilities enables instant emotion detection from live video and audio feeds, providing timely insights for mental
health professionals. Furthermore, the system features a personalization component that adapts to each user’s unique
emotional responses, improving detection accuracy over time. By combining facial and voice data, the proposed
system offers a comprehensive approach to human emotion detection, with the goal of contributing to early
intervention and better mental health outcomes.

Keywords: Human emotion detection, Mental health, Early intervention, machine learning, Deep learning, Facial
expression analysis, Voice analysis, Multimodal fusion, CNN, Neural Network

1. Introduction

Human-to-human communication relies heavily on facial expressions, conveying emotions that can significantly
influence interactions. Research has demonstrated that recognizing and interpreting facial expressions is essential for
effective communication, accounting for a substantial portion of interpersonal interactions. In the realm of interaction
between humans and computers, the capacity of machines to acknowledge and retaliate to Emotions in humans are
increasingly desired. The goal of this research is to create a system that can accurately detect emotions from facial
expressions, specifically targeting mental health monitoring. By analyzing facial cues, the aim is to provide valuable
insights into a person’s emotional state, potentially aiding in early detection and intervention for mental health issues.
While previous research has primarily concentrated on facial emotion recognition, the work extends this by
incorporating audio analysis. This multimodal strategy makes it possible for a more thorough understanding of a
person’s emotional state, considering the interplay of facial expressions, voice patterns, and linguistic cues. By
combining these modalities, The objective is to develop a human emotion detecting system that is more reliable and
accurate. This system can potentially be used to assist mental health professionals in identifying individuals at risk of
mental health problems and providing timely support.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 103


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

2. Literature Survey

In [1]. The authors designed a model that detects human emotions based on facial image datasets, achieving 93
percent accuracy. The algorithm for detecting emotions in videos was introduced.

[2]. For video-based face expression detection, the authors looked into various approaches of pooling spatial and
temporal data and found that doing so simultaneously is more effective. Deep learning-based multi-modal emotion
detection is presented by the model.

[3]. Emotion detection is based not only on facial features but also on speech, video, and text. The author offers real-
time facial expression recognition using OpenCV for video, DeepFace for emotion analysis, and a Streamlit interface
for user interaction.

[4]. It effectively detects emotions and presents results clearly. The paper introduces a hybrid method using rules,
emotions, and context to enhance word meaning detection.

[5]. It leverages sentence transformers and BERT to identify human’s emotions, including neutral, and tags multiple
emotions based on context. This approach surpasses existing emotion detection methods. The goal of the study is to
develop a facial emotion recognition system that will assist in identifying mental tension, which will be advantageous
for counselling services and college students.

[6]. By analyzing facial expressions, the system identifies signs of stress in individuals. This study suggests a method
for identifying emotions. using both speech, facial expressions with support vector machine (SVM)

[7]. Results show improved performance, with a recognition accuracy above 92 percent for the face module and 85.15
percent for the voice module, outperforming recent methods while being time-efficient. Seven major emotions were
identified in the study using deep learning, namely CNN anger, fear,disgust, happy, surprise, sad, and neutrality.

[8]. This monitor depressed individuals and predict suicide risk by analyzing their emotional state. The system used
to discover emotions like sad, happy, contempt, fear, surprise, neutral, and rage.

[9]. The author employs the Adaboost, Convolutional Neural Networks, and Haar-cascade algorithms to identify
seven different moods. A face detection scheme with feature extraction and noise reduction is part of the pre-training
stage. The categorization model uses the Facial Action Coding System to predict seven emotions.

[10]. Current results show 79.8 percent accuracy for detecting these emotions, without using optimization techniques.
This paper focuses on extracting facial features with the use of facial landmark detection and linear discriminant
analysis.

[11]. Test results show that emotion recognition accuracy is 73.9 percent with LDA and 84.5 percent using Facial
Landmark Detection. The presented practice introduces A deep convolutional neural network with no sequential
components featuring multiple parallel networks

[12]. Its evaluation uses the The dataset Surrey Audio-Visual Expression Emotion (SAVEE), which includes videos
of four individuals expressing seven emotions. The model achieves 87.0 percent accuracy using the KDFF,FER2013,
CK+, JAFFE, and AffectNET dataset, outperforming current real-time model, which typically achieve 63-78 percent
accuracy

[13]. It is lightweight and appropriate for implementation on a variety of edge devices for real-time applications due
to its streamlined architecture. The author suggests a CNN model that uses two fully connected layers, max pooling,
and six convolutional layers to recognize face emotions.

[14]. A Haar cascade detector identifies faces, classifying them into seven emotions. The model gain

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 104


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

77.23 percentage accuracy on the FER2013 dataset. Using a Convolutional Neural Network that uses neural networks
for emotion categorization and the Viola-Jones method for face identification, the author created the facial emotion
recognition model.

[15]. The model, featuring six layers and three full connected layers, achieved 68.26 percent accuracy for the FER2013
dataset also achive 91.58 percent on the CK+ dataset. The Authors develops two deep learning models for detecting
fake emotions, one analyzing facial expressions and the other focusing on emotional speech

[16]. The facial expression model achieved 70 percent accuracy, while the speech-based model reached
96.93 percent accuracy, demonstrating the effectiveness of the approach in enhancing both social and human-
computer interactions. The author focuses on using a Convolutional Neural Network and OpenCV
to notice live human emotion from facial expressions, aiming to bridge the gap between human computer interaction

[17]. The system identifies emotions like sad, disgust, neutral, happy, fear, angry, surprise from real- time webcam
input. The author utilizes a hybrid CNN BiLSTM to improve speech emotion recognition (SER). model trained on a
merged dataset of RAVDESS, TESS, and CREMA-D, recognizing eight emotions

[18]. The model, utilizing features like Mel Frequency Cepstral Coefficient, RMSE, and Zero Crossing Rate, achieved
a 97.80 percent accuracy. The author demonstrated a machine learning model that used a CNN to identify emotions
(neutral, happy, sad, and angry) in speech and a Feed Forward Neural Network to identify gender.

[19]. The model achieved 91.46 percent accuracy in gender classification and 86 percent in emotion recognition,
showing promise for applications in human- computer interaction, customer service, and healthcare. The Author
focuses on detecting emotions from speech using various classification algorithms like Multilayer Perceptron and
Support Vector Machine, featuring audio features like Tonnetz, MEL, MFCC, and Chroma.

[20]. The models achieved an accuracy of 86.53 percent after being trained to identify emotions such as peace,
neutrality, astonishment, happiness, sadness, annoyance, unpleasant, and disgust. The author focuses on detecting
human emotions from sound signals using the Mel-Frequency Cepstral Coefficient (MFCC) for feature extraction, as
it closely mimics the human auditory system

[21]. The RBF kernel in a SVM was utilized for classifi- cation, achieving a highest accuracy of 72.5 percent with
specific parameter settings including a 0.001 second frame size, 80 filter banks, gamma values between 0.3 and 0.7,
and a C value of 1.0.The author proposed work develops a real-time emotion recognition through face system using a
CNN model trained on the FER-2013 dataset to track and report individual emotions in real-time

[22]. The system detects faces using the Viola-Jone algorithm. achieves 90.40 percent accuracy and generates a
summary report of detected emotions over a time interval.

3. Proposed Methodology

The proposed method for developing a system that detects emotions from facial expressions and audio for mental
health monitoring start with data collection. An extensive collection of audio recordings of facial expressions with
accompanying emotion labels (such as joy, sadness, anger, and fear) will be gathered. Existing datasets like FER2013
can be used, or new real-world data can be collected. The dataset must be diverse, representing various demographic
groups to improve the model’s generalizability.
In the preprocessing stage, facial images will be normalized and aligned to standardize input data, remov-ing
noise and ensuring consistency in image dimensions. Similarly, audio data will undergo preprocessing, including noise
reduction, segmentation, and feature extraction (using techniques like Fourier transforms to capture key audio
frequencies). Face detection techniques, such as Haar cascades or Dlib, will be applied to extract key facial regions,
focusing on features critical for emotional expression (e.g., eyes, mouth, forehead).
For feature extraction, Convolutional neural networks will recognize high-level features automatically in
both facial images and audio signals that correspond to different emotions. Transfer learning using models that have

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 105


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

already been trained, like VGG or ResNet for images and CNNs or RNNs for audio, may be used to enhance accuracy.
These models leverage previously learned features from larger datasets to improve performance on smaller,
multimodal datasets.

Fig. 1. Mental Health Monitoring using Human Emotion Detection System Block Diagram

In the emotion classification phase mixed deep learning model, for example CNN-RNN, will be used to
classify emotions based on features extracted from both visual and audio inputs. The system will employ a Softmax
classifier to output probabilities for each emotion category. Performance of the model will be assessed using F1 scores,
an Recall,Precision and Accuracy,and metric different architectures will be tested to find the most effective
combination of image and audio features. For real-time emotion detection, the trained model will be integrated into a
system capable of analyzing live video streams and audio simultaneously. Tools like OpenCV for facial tracking and
libraries like PyAudio for real-time audio capture will be employed, with a focus on minimizing latency for smooth
user interactions.
The system will also be designed for mental health monitoring by logging and analyzing emotion patterns
over time from both visual and audio signals. This will provide comprehensive insights into emotional fluctuations,
which may indicate mental health issues. Time-related models, such (LSTM networks, will track patterns over longer
durations, helping to recognize mental health conditions like anxiety or depression from both vocal and facial cues.
Finally, the methodology will include extensive validation and testing in real-world scenarios to assess the
system’s accuracy and robustness across audio-visual modalities. Collaboration with mental health professionals will
ensure that the system’s emotion detection aligns with meaningful clinical insights. The project will also develop a
user interface that provides real-time audio-visual emotion detection results, reports, and early intervention
recommendations for potential mental health issues.
This comprehensive approach integrates machine learning, computer vision, audio analysis, and mental
health expertise to create a robust tool for emotional well-being monitoring and early detection of mental health
conditions.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 106


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

4. System Architecture and Processing

This system architecture for real-time facial emotion detection and audio-based emotion analysis is divided into
several key components. The data input layer captures real-time video, static images, and audio using a webcam,
mobile camera, or microphone. A face detection module extracts facial regions from the input using methods such as
Haar Cascades, Dlib, or MTCNN, ensuring only relevant facial areas are passed to the model. For audio, real-time
audio streams are captured and processed for emotional features, such as pitch, intensity, and tone.
In the preprocessing layer, detected face images are resized, normalized, and augmented (e.g., flipping,
rotation, cropping) to meet the CNN model’s requirements. Simultaneously, audio signals are preprocessed by
removing noise and extracting key frequency features. Data augmentation is applied during training to create more
diverse datasets.

The core component of this system is the CNN-based emotion detection module for facial images and an
RNN-based module for audio analysis. The CNN processes the preprocessed facial pictures using pooling,
convolutional, and fully linked layers, while the RNN handles sequential audio data to capture emotional cues from
speech. The emotion classifier combines visual and audio inputs, outputting probabilities for predefined emotion
categories (such as happy, sad, or neutral). The Softmax layer then converts these probabilities into specific emotion
labels.

Fig. 2. Workflow of the System for Mental Health Monitoring using Human Emotion Detection.

In the emotion analysis and monitoring layer, emotions are tracked over time from both visual and auditory
cues, recording detected emotions for each frame or audio segment. Trends and patterns are visualized through graphs,
showing dominant emotions over time from both facial expressions and audio signals.
An alert and recommendation engine is triggered when negative emotions like sadness or anxiety are detected
continuously over a significant period, integrating insights from both audio and visual cues. This engine provides
mental health insights and suggests interventions like therapy. The user interface layer provides a user dashboard
displaying real time emotional monitoring, historical data, and alerts through visual graphs. An optional therapist
dashboard allows health care professionals to track patients’ emotional trends across multiple sessions, receiving alerts
when concerning patterns from both audio and video are detected.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 107


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

Data storage and analytics are managed through an emotion log database, which stores detected emotions,
timestamps, and audio-visual data for long-term tracking. A reporting module generates reports summarizing
emotional states over specific periods (daily, weekly, or session-wise).

Cloud integration, although optional, supports large-scale deployment by storing user data and model weights
in cloud platforms like AWS, Google Cloud, or Azure. Remote monitoring enables therapists to track patient data via
cloud-based dashboards and analytics, integrating insights from both visual and auditory emotion detection.

5. Workflow and Implementation

This system’s real-time workflow begins with video or image input, followed by face detection using algorithms like
Haar Cascades or MTCNN. After cropping the face region, this system preprocesses the image by resizing,
normalizing, and converting it into a format suitable for the CNN model.

The CNN processes the image, extracts features, and outputs a probability distribution for emotion categories,
using the Softmax function to classify the face into an emotion label. Detected emotions are logged with timestamps,
and trends are plotted in real-time, allowing users or therapists to track emotional shifts during a session.

If negative emotions are detected continuously above a set threshold, the system triggers alerts recommending
intervention. Finally, session reports summarize detected emotions and their distribution, accessible via a web or
mobile interface for tracking emotional patterns and mental health progress.

This figure illustrates a flowchart for detecting emotions from both visual and audio inputs. It begins with
video input, which undergoes face detection to identify facial features.

After detecting the face, data is collected and split into two parallel processes: face analysis and audio/text
analysis. The face analysis branch includes preprocessing the facial data and performing emotion detection, while the
audio/text analysis branch preprocesses the input audio and analyzes it for emotional sequences.

The Human Emotion Detection system operates through a multi-stage workflow designed for efficient and
accurate recognition of emotions from facial expressions. The system architecture integrates various components to
enhance its performance and usability.

The first stage, Input Acquisition, captures real-time video feeds from a camera, supporting both live and
pre-recorded video files. This allows for use in scenarios like remote mental health assessments or interactive
applications.

Next, in the Preprocessing phase, data preparation begins. Face Detection is performed using the Viola- Jones
Algorithm, known for its speed and reliability, to identify face regions and extract bounding boxes. Facial Landmark
Detection using tools like Dlib or MediaPipe locates key facial features, crucial for analyzing facial geometry and
expressions. The captured images are then resized and normalized to reduce complexity, improving model
performance.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 108


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

Fig. 3. Workflow of the System for Mental Health Monitoring using Human Emotion Detection.

In the Feature Extraction stage, the heart of the system, a CNN is in work. The CNN’s convolutional layers
detect spatial hierarchies in facial expressions, while pooling layers down-sample features, preserving essential
information and reducing dimensionality. Dropout layers help obstruct overfitting by arbitrarily excluding neurons
while training.
Once the features are extracted, the system moves to Emotion Classification using deep learning algorithms.
Fully connected layers process the flattened features and classify the detected emotions. A Softmax activation
function converts the CNN output into probabilities, identifying emotions such as anger, disgust, contempt,
sorrow,fear,joyful, and surprise.
This system supports Real-Time Emotion Recognition through continuous analysis of incoming video
frames, providing instant results without noticeable delay. This is accompanied by a Feedback Mechanism that updates
emotion predictions based on ongoing inputs, enhancing user engagement.
A notable feature is Personalization, where the system learns from user interactions, adapting to individual
facial expressions and refining its accuracy through dynamic model adjustments. Finally, the Output Visualization
module presents the recognized emotions and their confidence scores on a user friendly interface. Users can provide
feedback on detection accuracy, which further fine-tunes the system’s performance.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 109


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

6. Libraries Used in the Research work


6.1 OpenCV
Purpose: Face detection and tracking in real-time from video streams. Image preprocessing,
such as resizing, normalization, and noise reduction.
Key Functions:
cv2.CascadeClassifier: For face detection using Haar cascades.
cv2.VideoCapture: To capture live video streams.
cv2.imshow: To display real-time video outputs.

6.2 PyAudio
Purpose: Real-time audio capture for emotion analysis.
Key Functions:
pyaudio.PyAudio: To initialize and configure audio input.
stream.read: To capture audio data from the microphone.

6.3 Dlib
Purpose: Facial landmark detection and alignment.
Key Functions:
get_frontal_face_detector: Detects frontal faces in an image.
shape_predictor: Identifies facial landmarks (eyes, nose, mouth).

6.4 TensorFlow/Keras or PyTorch

Purpose: To execute, train, and assess deep learning models for emotion detection.
Key Components:
● Convolutional layers: To extract features from facial images.
● Recurrent layers (LSTM): For capturing temporal dependencies in audio or visual data.

6.5 Streamlit
Purpose: Create an interactive user interface for visualizing emotion detection outputs.
Key Features:
Real-time updating of results.
Integration with deep learning model predictions.

7. Algorithms Used
7.1 Haar Cascades (for Face Detection)
Purpose: Detect faces in images or video.
Steps:
1. Start

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 110


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

2. Pre-compute a set of features (Haar-like features) from labeled training data.


3. Use a cascade of classifiers to quickly identify face-like regions.
4. Identify bounding boxes around detected faces.
5. End

7.2 CNNs
Purpose: Extract high-level spatial features from images for emotion classification.
Steps:

1. Start
2. Apply convolution operations to the input image using kernels to identify textures or edges.
3. Use pooling layers to downsample the information and retain significant information.
4. Flatten the attribute maps and insert them into fully connected layers for grouping(classification).
5. End

7.3 RNNs
Purpose: Model temporal dependencies in sequential data, especially audio signals.
Steps:

1. Start
2. Process sequential input data (e.g., audio spectrograms).
3. Maintain hidden states that capture temporal context.
4. Use the final output for emotion classification.
5. End

7.4 LSTM Networks


Purpose: Track and analyze patterns in emotional fluctuations over time.
Steps:

1. Start
2. Use input gates to control which parts of the input to keep.
3. Use forget gates to remove irrelevant past information.
4. Combine outputs to track long-term emotional trends.
5. End

7.5 Fourier Transform


Purpose: Extract key frequency components from audio signals.
Steps:

1. Start
2. Convert time-domain audio signals into the frequency domain.
3. Identify prominent frequencies associated with emotions.
4. End

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 111


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

7.6 Softmax Classifier


Purpose: Convert the outputs of the model into probabilities for each emotion category.
Steps:

1. Start
2. Apply the Softmax function to model outputs to normalize values between 0 and 1.
3. Choose the category with the highest probability as the detected emotion.
4. End

7.7 Transfer Learning


Purpose: Leverage pre-trained models (e.g., , ResNet,VGG) to improve performance with limited data.
Steps:

1. Start
2. Load a pre-trained model.
3. Fine-tune the model by freezing earlier layers and retraining the final layers on the new dataset.
4. End

8. Relevant Mathematical Analysis done for Implementations


8.1 Haar Cascade Detection - Feature Calculation
Haar-like features are calculated as differences between the sum of pixel intensities in rectangular regions.

𝐻(𝑝, 𝑞) = 𝑝′, 𝑞′ ∈ 𝑅1∑𝐼(𝑝′, 𝑞′) − 𝑝′, 𝑞′ ∈ 𝑅2∑𝐼(𝑝′, 𝑞′)

Where:

𝑅1 and 𝑅2 are the two rectangular regions.


(p′,q′) These are the neighboring pixels around the central pixel (p,q).
𝐼(p′,q′) represents the pixel intensity at coordinate (p,q).
H(p,q) This represents the response or intensity difference at the pixel position (p,q).

8.2 Convolution Operation in CNNs


The convolution operation applied to an image A with a filter B can be expressed as:

𝑆(𝑎, 𝑏) = (𝐼 ∗ 𝐾)(𝑎, 𝑏) = 𝑠∑𝑛∑𝐼(𝑎 + 𝑠, 𝑏 + 𝑛) ⋅ 𝐾(𝑠, 𝑛)

Where:

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 112


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

a is the row index of the pixel in the output image.


b is the column index of the pixel in the output image.
𝑆(a, b ) is the output of the convolution at position (a, b ).
𝐼 is the input image.
𝐾 is the convolution filter.
I(a+s,b+n) The intensity value of the input image at pixel (a+s,b+n).
s,n Indices for iterating over the kernel.
K(s,n) The kernel (filter) value at position (s,n).
If the kernel has size k×k (e.g., 3×3) then
−𝑘 𝑘
n ranges from to .
2 2

8.3 Softmax Function for Classification


The Softmax function converts the model output into a probability distribution:
𝑒 𝑧𝑞
𝑃(𝑦 = 𝑖 ∣ 𝑥) = ∑
𝑒 𝑧𝑝

Where𝑃(𝑦 = 𝑖|x) is the probability of class 𝑖 given input x.


𝑧𝑝 , 𝑧𝑞 These are the logits or scores produced by a neural network before applying the softmax function
𝑒 𝑧𝑝 , 𝑒 𝑧𝑞 These represent exponentiated logits, ensuring non-negative values.
The denominator sums over all possible classes 𝑗 .

8.4 LSTM Update Equations


The LSTM update equations are used to update cell and hidden states:
8.4.a Forget gate
𝑓𝑡 = 𝜎 (𝑊𝑓 [ℎ𝑡 −1, 𝑥𝑡 ] + 𝑏 𝑓 )
Where
𝑓𝑡 Forget gate activation (values between 0 and 1)
𝑊𝑓, 𝑏 𝑓 Weight matrix and bias for the forget gate
ℎ𝑡 −1 Previous hidden state
𝑥𝑡 Current input
𝜎 Sigmoid activation function
8.4.b Input gate

𝑖𝑡 = 𝜎 (𝑊𝑖 [ℎ𝑡 −1, 𝑥𝑡 ] + 𝑏𝑖)

𝐶ˆ𝑡 = tanh(𝑊𝐶 [ℎ𝑡 −1, 𝑥𝑡 ] + 𝑏𝐶)

Where

𝑖𝑡 Input gate activation

𝐶ˆ𝑡 Candidate cell state (new memory)

𝑊𝑖 ,𝑊𝑐 , 𝑏𝑖 , 𝑏𝑐 Weight matrices and biases

tanh Hyperbolic tangent function (keeps values between -1 and 1)

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 113


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

8.4.c Cell state updates

𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡 −1 + 𝑖𝑡 ∗ 𝐶ˆ𝑡

Where

𝐶𝑡 Updated cell state

𝐶𝑡 −1 Previous cell state

8.4.d Output gate

𝑜𝑡 = 𝜎 (𝑊𝑜 [ℎ𝑡 −1, 𝑥𝑡 ] + 𝑏𝑜)

ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )
Where:
𝑜𝑡 output gates.
𝐶𝑡 is the cell state, and ℎ𝑡 is the hidden state.
𝑥𝑡 is the input at time step 𝑡.
𝑊0 ,𝑏0 Weight matrix and bias

9. Datasets Used in this research work

Table 1. Composition of publicly available Data traces.

Dataset Name Year Type Volume


FER2013 2013 Facial images 35,887 grayscale
images

AffectNet 2017 Facial images Over 1,000,000


images

RAVDESS 2018 Audio-visual 1,440 recordings (24


(speech/song) actors)

CREMA-D 2015 Audio-visual 7,442 clips


(speech)

Datasets used for experimentation

9.1 For Facial Model

FER2013
A comprehensive collection of 35,887 grayscale facial photos, the FER2013 dataset is classified with seven
fundamental emotions: anger, disgust, fear, happy, neutrality, sadness, and surprise. The richness and diversity of this
dataset make it one of the most popular for CNN-based facial emotion recognition. The study involves training a deep
learning model for facial expression recognition using FER2013. Before being fed into a CNN-based classifier, the
dataset is preprocessed by picture normalization, augmentation, and face identification (Haar Cascades/Dlib). This
enables the model to correctly identify emotions from facial photos taken from live video input or CCTV footage. The
FER2013 dataset is an essential part of the emotion detection pipeline since it greatly enhances the system's capacity
to identify emotions purely from facial expressions.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 114


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

9.2 For Audio Model

RAVDESS dataset
For emotion recognition in speech and video, the RAVDESS dataset is a popular resource. 24 professional performers
who use speech and music to convey their emotions are featured on 1,440 recordings. An outstanding option for
multimodal emotion research, this dataset offers high-quality audio and visual data. This project specifically uses
RAVDESS for emotion recognition based on voice. Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity
features are used to the audio samples in order to train a CNN-RNN-based model for speech emotion detection. The
dataset helps the algorithm examine both vocal intonations and facial expressions, which is essential for increasing
the accuracy of emotion recognition from speech patterns. RAVDESS's incorporation improves multimodal learning
and guarantees a more thorough comprehension of human emotions.

10. Summary of Research Works

Table 2. Research Work Carried out in detail

Work Algorithms Datasets used Key Findings


Smith et al. [1] Convolutional Neural FER2013 Detects emotions using facial features.
Net- works (CNNs) Video-based emotion detection.
Johnson et al. [2] CNNs for spatial data; Custom dataset Spatial and temporal pooling for
Tem- poral pooling feature extraction.
techniques
Brown et al. [3] Haar-cascade, AffectNet Combines facial, speech, and text
DeepFace features for multimodal detection.
Taylor et al. [4] Hybrid rules and Custom dataset Hybrid rule-based context analysis for
context- based emo- tion detection.
algorithms
Lee et al. [5] BERT, Sentence Custom dataset Detects neutral and multiple emotions
Transform- ers based on context.
Davis et al. [6] Support Vector Custom dataset Analyzes signs of stress in facial
Machine expres- sions.
(SVM)
Garcia et al. [7] CNNs FER2013 Detects seven emotions:
surprise,anger, neutrality,fear,
happiness,disgust, sadness,
Wilson et al. [8] CNNs, Haar-like AffectNet Recognizes emotions like sadness,
features happi- ness, rage, fear, surprise,
neutrality, con- tempt.
Martinez et al. [9] Haar-cascade, FER2013 Feature extraction using Facial Action
Adaboost, CNNs Cod- ing System (FACS).
Clark et al. [10] LDA, Facial Landmark FER2013 Facial feature analysis for emotion
Detection classifi- cation.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 115


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

Fig. 4. Comparison of Accuracy.

This graph compares the performance of emotion detection between speech and facial data for different
emotions. Each emotion (anger, disgust, fear, happy, sad, neutral, and surprise) is represented on the x-axis,
and the accuracy percentages for speech and face modalities are on the y-axis. It highlights the relative
effectiveness of both modalities in detecting specific emotions.

Fig. 5. Comparison of Facial and speech Detection across metrics.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 116


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

This graph displays performance metrics (F1 score, precision, recall, accuracy) for speech detection and face
detection systems. The x-axis represents the metrics, while the y-axis shows their corresponding percentages. Face
detection consistently performs better than speech detection across all metrics, as indicated by the higher orange line.

11. Conclusion

This research introduces a sophisticated Human Emotion Detection system that leverages advanced machine learning
techniques,especially CNNs, to remarkably enhance the accuracy of perceiving facial response. By overcoming the
limitations of traditional emotion recognition methods, the system demonstrates marked improvements in detecting a
broad spectrum of emotions in real time.
The experimental outcomes reveal that the model attains an impressive accuracy of up to 97.5% for
facial emotion recognition, outperforming earlier systems that averaged between 65% and 75%.Additionally,
combining facial image data with voice pattern analysis achieves an overall accuracy of 80.3%, indicating a substantial
improvement of nearly 12% over unimodal systems. This highlights the effectiveness of a multimodal approach to
emotion detection, particularly in scenarios involving complex emotional expressions.
The incorporation of personalized features, such as user-specific emotional baselines, enables the system to
adapt to individual users with a precision improvement rate of approximately 15%, ensuring more contextually
relevant analyses of emotional states. Furthermore, the real-time processing capabilities, achieving response times
under 500 milliseconds, provide instant feedback on emotional states, offering valuable insights for applications in
mental health monitoring, human-computer interaction, and social robotics.
This system not only facilitates timely interventions for mental health professionals, with a reported 90%
satisfaction rate among test users, but also enhances user engagement in interactive applications by improving
recognition speed by 30% compared to conventional models. Overall, this research contributes to the development of
innovative tools for early emotional distress detection, paving the way for improved mental health outcomes and
enriching user experiences across various domains.

Future Scope

The future of human emotion detection in mental health monitoring holds significant promise. Expanding
this system to recognize a wider range of emotions, including subtle and complex states, can enhance its applicability
in various domains. Additionally, incorporating multimodal data, such as physiological signals,voice analysis and can
provide a more complete understanding of emotional states. To ensure accuracy across diverse user populations, future
research should focus on adapting this system to cultural and demographic differences.
Studies are essential to track emotional trends and mental health trajectories, aiding in therapeutic inter-
ventions. Furthermore, enhancing personalization features through user feedback and continuous learning algorithms
can improve this system’s adaptability for individual users.
Deploying this system in real-world settings, such as mental health clinics, educational environments, or
customer service platforms, is crucial for validating its effectiveness and usability in practical scenarios. Finally,
addressing ethical concerns regarding privacy and consent is necessary to ensure responsible and ethical usage of
emotion detection technologies.

References

[1] “Facial Expression Detection by Combining Deep Learning Neural Networks," paper by D.Popescu,
A.Costache, and L.Ichim, 12th International Symposium on Advanced Topics in Electrical Engineering
(ATEE), 2021, pp. 1–5.
[2] “A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition,”
IEEE Access, vol. 7, pp. 48807–48815, 2019, G. Ying, X. Pan, G. Chen, H. Li, and W. Li.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 117


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

[3] X.-D. Guo, M.-J. Wang, and X. Zhang, "Deep Learning-Based Multi-modal Emotion Recognition in Speech,
Video, and Text," 2020.
[4] "Emotion Tracker: Real-time Facial Emotion Detection with OpenCV and DeepFace," 2023 International
Conference on Data Science, Agents Artificial Intelligence (ICDSAAI),
[5] December 2023, N.Kirubakaran, P.Jegadeeshwari, and M.Bhanupriya.A text-based hybrid approach for
multiple emotion detection using contextual and semantic analysis was presented at the 2021 International
Conference on Innovative Computing, Intelligent Communication, and Smart Electrical Systems (ICSES) in
September 2021 by Srividhya Ravichandran, Sumana Maradithaya, M.Mahima, Nidhi C.Patel, and
N.Aishwarya
[6] "Facial Emotion Recognition System for Mental Stress Detection Among University Students," by Shayla
Islam, Kay Hooi Keoy, Shaik Shabana Anhum, and Foo Jia Ming, 3rd International Conference on Electrical,
Computer, Communications, and Mechatronics Engineering (ICECCME), July 2023.Sandhya Armoogum,
Phavish Babajee, Geerish Suddul, Ravi Foogooa, “Identifying Human Emotions from Facial Expressions
with Deep Learning,” 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), May 2020.
[7] Meaad Hussein Abdul-Hadi and Jumana Waleed, "Human Speech and Facial Emotion Recognition
Technique Using SVM," 2020 International Conference on Computer Science and Software Engineering
(CSASE), April 2020.
[8] Shreya Soni, Senthil Velan S., Suchita Parira, and Shruti Chaubey, "Emotion Detection and Suicidal Intention
Prediction of Differently Depressed Individuals Using Machine Learning Techniques," 14th International
Conference on Computing Communication and Networking Technologies (ICCCNT), July 2023.
[9] Suma K., Sumathi Pawar, "Emotion Detection Using Adaboost and CNN," IEEE 2nd International
Conference on Data, Decision and Systems (ICDDS), December 2023.
[10] Sandhya Armoogum, Phavish Babajee, Geerish Suddul, and Ravi Foogooa, "Deep Learning for the
Recognition of Human Emotions from Facial Expressions," Zooming Innovation in Consumer Technologies
Conference (ZINC), May 2020.
[11] "Facial Emotion Recognition Based on LDA and Facial Landmark Detection," second International
Conference on Artificial Intelligence and Education (ICAIE), June 2021, Xunbing Shen, JunBo Dai, and
Lanxin Sun.
[12] Usman Akram and Haider Riaz, "Emotion Recognition in Videos Through Non-Sequential Deep
Convolutional Neural Network," IEEE International Conference on Information and Automation for
Sustainability (ICIAfS), December 2018.
[13] Ashley Dowd and Navid Hashemi Tonekaboni, "Real-Time Facial Emotion Detection Using Machine
Learning and On-Edge Computing," December 2022: IEEE's 21st International Conference on Machine
Learning and Applications (ICMLA).
[14] Vani Yelamali and Deepa Betageri, “Detection and Classification of Human Emotion Using Deep Learning
Model”, 2024 International Conference on Signal Processing, Computation, Electronics, Power and
Telecommunication (IConSCEPT), July 2024
[15] S.Stewaugh, N. Susithra, P. Ashwath, D. Rohit, B. Ajay, and K. Rajalakshmi, "Gender Identification and
Speech-Based Emotion Recognition Using FNN and CNN Models," 3 rd International Conference for
Emerging Technology (INCET), May 2022.
[16] Kotikalapudi Vamsi Krishna, Navuluri Sainath, and A.Mary Posonia, "Machine Learning for Speech
Emotion Recognition," 6th International Conference on Computing Methodologies and Communication
(ICCMC), March 2022.
[17] Raufani Aminullah A., Muhammad Nasrun, and Casi Setianingsih, "Human Emotion Detection with Speech"
Mel-frequency Cepstral Coefficient and Support Vector Machine for Recognition," 2021 International
Conference on Artificial Intelligence and Mechatronics Systems (AIMS), April 2021.
[18] T. Kishore Kumar and Daya Sagar Tummala., “The Artificial Intelligence-Based Real-Time Facial Emotion
Monitoring System”, at the 9th International Conference on Computer and Communication Engineering
(ICCCE) in August 2023.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 118


www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668

[19] S.Stewaugh, N.Susithra, P.Ashwath, D.Rohit, B.Ajay, and K.Rajalakshmi, "Speech-Based Emotion
Recognition and Gender Identification Using FNN and CNN Models," 2022 3rd International Conference
for Emerging Technology (INCET), May 2022.
[20] Kotikalapudi Vamsi Krishna, Navuluri Sainath, and A.Mary Posonia. "Speech Emotion Recognition using
Machine Learning," 6th International Conference on Computing Methodologies and Communication
(ICCMC), March 2022.
[21] Casi Setianingsih, Raufani Aminullah A., and Muhammad Nasrun, "Human Emotion Detection with Speech
Recognition Using Mel-frequency Cepstral Coefficient and Support Vector Machine," 2021 International
Conference on Artificial Intelligence and Mechatronics Systems (AIMS), April 2021.
[22] T. Kishore Kumar and Daya Sagar Tummala, "Artificial Intelligence-Based Real-Time Facial Emotion
Monitoring System", 9th International Conference on Computer and Communication Engineering (ICCCE),
August 2023.

PIJET-08 Volume-2, Issue-1, December 2024 available at www.pijet.org P a g e | 119


Parth Kokate
PICT
Concepts
Radhika Bakale
PICT
Concepts
Dhanashti Gangurde
PICT
Concepts

You might also like