Project Report Template Phase II (7)
Project Report Template Phase II (7)
ON
BACHELOR OF ENGINEERING
INFORMATION TECHNOLOGY
BY
CERTIFICATE
is a bonafide work carried out by them under the supervision of Mrs. Archana
Kadam and it is approved for the partial fulfillment of the requirement of
Savitribai Phule Pune University for the award of the Degree of Bachelor of
Engineering (Information Technology).
This project report has not been earlier submitted to any other Institute or
University for the award of any degree or diploma.
Dr. S. T. Gandhe
SPPU External Guide Principal
Date:
Place:
i
Acknowledgement
Firstly, We are very much thankful to Mrs. Archana Kadam for guiding us throughout
the semester. The guidance and support provided by our guide Mrs. Archana Kadam
has inspired us to do the BE Project with thoughtful mind and helped us at every
phase of the project. We would also like to extend our gratitude to our reviewers, Dr.
Shyam Deshmukh and Mrs. Swapnaja R. Hiray, for their constructive feedback and
invaluable suggestions, which greatly improved the quality of our work. Our heartfelt
thanks go to our project coordinator, Mrs. Sumitra A. Jakhete, for her dedicated efforts
in ensuring the smooth progress of our project and for always being available to assist
us whenever needed. We are especially thankful to the Head of the IT Department,
Dr. A.S. Ghotkar, for providing us with all the necessary resources and facilities, which
greatly contributed to the successful completion of our project. We express our deepest
gratitude to the Principal Dr. S. T. Gandhe, whose leadership and encouragement have
fostered an environment conducive to learning and research. Lastly We would also like to
sincerely thank our family and friends for their unwavering support and encouragement
throughout this journey.
ii
Abstract
Human emotion detection has become an essential tool in mental health monitoring, of-
fering the potential for early detection of mental health disorders. Existing models for
emotion detection primarily rely on deep learning techniques such as Convolutional Neu-
ral Networks (CNNs) to analyze facial expressions and, in some cases, voice patterns.
These models have demonstrated the ability to recognize basic emotions such as anger,
disgust, fear, happiness, sadness, and surprise with a high degree of accuracy. However,
challenges remain in terms of real time processing and personalization for individual
users. This research introduces a novel system designed to enhance early mental health
detection through advanced human emotion detection techniques. The system focuses on
analyzing facial expressions and Audio patterns to identify potential signs of emo tional
distress. By leveraging deep learning models, specifically refined CNN architectures, and
additional data preprocessing techniques, we aim to achieve an accuracy rate of 90 per-
cent for emotion recognition using facial images alone.When combining facial image data
with Audio pattern analysis, the system reaches an overall accuracy of up to 77 percent.
The incorporation of real-time processing capabilities enables instant emotion detection
from live video and audio feeds, providing timely insights for mental health professionals.
Furthermore, the system features a personalization component that adapts to each user’s
unique emotional responses, improving detection accuracy over time. By combining facial
and voice data, the proposed system offers a comprehensive approach to human emotion
detection, with the goal of contributing to early intervention and better mental health
outcomes.
iii
Contents
Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Survey 3
2.1 Existing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Research Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
iv
3.6.5 Project Plan 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6.6 PERT Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Implementation 19
5.1 Stages of Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Implementation of Modules . . . . . . . . . . . . . . . . . . . . . 19
5.2 Experimentation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Results 22
6.1 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.2 Integration of Facial and Speech Emotion Analysis . . . . . . . . 24
6.2.3 Alert System Effectiveness . . . . . . . . . . . . . . . . . . . . . . 24
6.2.4 Video Upload and Processing . . . . . . . . . . . . . . . . . . . . 25
6.2.5 Emotion Analysis and Result Interpretation . . . . . . . . . . . . 25
6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References 33
Plagiarism Report 36
Base Paper 38
Review Sheets 55
Project Achievements 70
v
List of Figures
6.1 Dashboard for Video Upload for Emotion Detection and Analysis . . . . 22
6.2 Login Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Registration Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Output Of Human Emotion Detection Model . . . . . . . . . . . . . . . . 23
6.5 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Dashboard Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vi
Abbreviations
CNN : Convolutional Neural Networks
NN : Neural Network
vii
Human Emotion Detection in Mental Health Monitoring
1. Introduction
1.1 Introduction
Facial expressions play a crucial role in human-to-human communication, conveying emo-
tions that can significantly influence interactions. Research has demonstrated that rec-
ognizing and interpreting facial expressions is essential for effective communication, ac-
counting for a substantial portion of interpersonal interactions.
In the realm of human-computer interaction, the ability for machines to understand
and respond to human emotions is increasingly desired. This research focuses on devel-
oping a system that can accurately detect emotions from facial expressions, specifically
targeting mental health monitoring. By analyzing facial cues, the aim is to provide valu-
able insights into a person’s emotional state, potentially aiding in early detection and
intervention for mental health issues.
While previous research has primarily concentrated on facial emotion recognition, the
work extends this by incorporating audio and text analysis. This multimodal approach
allows for a more comprehensive understanding of a person’s emotional state, considering
the interplay of facial expressions, voice patterns, and linguistic cues. By combining these
modalities, the goal is to create a more robust and accurate system for human emotion
detection. This system can potentially be used to assist mental health professionals in
identifying individuals at risk of mental health problems and providing timely support.
Students are expected to write brief introduction of the seminar topic.
1.2 Motivation
We chose to focus on Human Emotion Detection using Convolutional Neural Networks
(CNNs) for our final year project due to our shared interest in the intersection of tech-
nology and psychology, particularly in the area of mental health monitoring. Accurately
detecting and interpreting human emotions is crucial for enhancing mental health inter-
ventions, and we believe our work can significantly contribute to this important field.
The increasing prevalence of mental health issues in today’s society motivates us to
develop innovative technological solutions that can provide timely support and inter-
vention. By creating systems that leverage CNNs to recognize emotions through facial
expressions, we aim to assist mental health professionals in effectively monitoring patients
1.3 Objectives
To enhance early detection of mental health disorders using advanced human emotion
detection techniques.
To analyze facial expressions and voice patterns to identify potential signs of emotional
distress.
To enable real-time emotion detection from live video and audio feeds for timely insights.
To provide tools for mental health professionals to monitor emotional trends and trigger
early interventions.
To improve efficiency of the existing model for more accurate emotion detection.
To develop a user-friendly interface for both users and therapists to monitor emotional
states effectively.
1.4 Scope
The scope of the project is encompasses real-time emotion detection through the analysis
of both facial expressions and voice patterns, aiming to monitor emotional states con-
tinuously. This includes mental health monitoring for the early detection of emotional
distress, providing insights into individuals’ emotional well-being. Additionally, the sys-
tem is designed with personalization features that allow it to adapt to each user’s unique
emotional patterns over time, ensuring more accurate and relevant emotional assessments.
The project will involve the development of a comprehensive system that integrates com-
puter vision, audio processing, and machine learning techniques. This integrated solu-
tion is intended for clinical applications, enabling mental health professionals to leverage
emotion detection in patient care, or for use in human-computer interaction systems,
enhancing the responsiveness and adaptability of technology to human emotions.
2. Literature Survey
2.1 Existing Methodologies
In [1].The system designed a model that detects human emotions based on facial image
datasets, achieving 93 percent accuracy. The video-based emotion detection algorithm
was presented in [2]. This system investigated different methods for pooling spatial and
temporal data, discovering that pooling spatial and temporal information together is
more efficient for video-based facial expression identification. This model presents multi-
modal emotion detection based on deep learning [3]. Emotion detection is based not only
on facial features but also on speech, video, and text. This system offers real-time facial
expression recognition using OpenCV for video, DeepFace for emotion analysis, and a
Streamlit interface for user interaction [4]. It effectively detects emotions and presents
results clearly. This reasearch paper introduces a hybrid method using rules, emotions,
and context to enhance word meaning detection [5]. It leverages sentence transformers
and BERT to identify human’s emotions, including neutral, and tags multiple emotions
based on context. This approach surpasses existing emotion detection methods. This
reasesrch paper aims to create a Facial Emotion Recognition System to help detect men-
tal stress, benefiting university students and counseling departments [6]. By analyzing
facial expressions, the system identifies signs of stress in individuals. This reasearch pa-
per proposes a technique for emotion recognition using both speech and facial expressions
with a support vector machine (SVM) [7]. Results show improved performance, with a
recognition accuracy of 92.88 percent for the facial model and 85.72 percent for the
speech model, outperforming recent methods while being time-efficient. This reasearch
paper used deep learning, particularly convolutional neural networks, to detect seven key
emotions: anger, disgust, fear, happiness, sadness, surprise, and neutrality [8]. This helps
monitor depressed individuals and predict suicide risk by analyzing their emotional state.
This system picked up emotions like sadness, happiness, rage, fear, surprise, neutrality,
and contempt [9]. This system focuses on detecting seven emotions through Haar-cascade,
Adaboost, and Convolutional Neural Networks algorithms. The pre-training phase in-
cludes a face detection system with noise removal and feature extraction. The classifica-
tion model predicts seven emotions from the Facial Action Coding System (FACS) [10].
Current results show 79.8 percent accuracy for detecting these emotions, without using
optimization techniques. This reasearch paper focuses on extracting facial features using
Linear Discriminant Analysis (LDA) and Facial Landmark Detection [11]. Test results
show that emotion recognition accuracy is 73.9 percent with LDA and 84.5 percent using
Facial Landmark Detection. The proposed method introduces a non-sequential deep con-
volutional neural network featuring multiple parallel networks [12]. Its evaluation uses
the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset, which includes videos of
four individuals expressing seven emotions. This model achieves 87.0 percent accuracy
using the FER2013, AffectNet, JAFFE, CK+, and KDEF datasets, outperforming cur-
rent real-time models, which typically achieve 65-75 percent accuracy [13]. Its simplified
architecture makes it lightweight and suitable for deployment on various edge devices for
real-time applications. This system proposes a CNN model for facial emotion, recogni-
tion using six convolutional layers, max pooling, and two fully connected layers[14]. A
Haar cascade detector identifies faces, classifying them into seven emotions. This model
achieved 77.22 percent accuracy on the FER2013 dataset.This system designed the facial
emotion recognition (FER) model using a Convolutional Neural Network (CNN) that
employs the Viola-Jones algorithm for face detection and neural networks for emotion
classification[15]. This model, featuring six convolutional layers and three fully connected
layers, achieved 68.26 percent accuracy on the FER2013 dataset and 91.58 percent on the
CK+ dataset. This system develops two deep learning models for detecting fake emo-
tions, one analyzing facial expressions and the other focusing on emotional speech[16].
The facial expression model achieved 70 percent accuracy, while the speech-based model
reached 96.93 percent accuracy, demonstrating the effectiveness of the approach in en-
hancing both social and human-computer interactions. This system focuses on using a
Convolutional Neural Network (CNN) and OpenCV to detect live human emotions from
facial expressions, aiming to bridge the gap between human computer interaction[17].
This system identifies emotions like neutral, happy, sad, surprise, angry, fear, and disgust
from real-time webcam input.The author focuses on improving speech emotion recognition
(SER) using a hybrid CNNBiLSTM model trained on a merged dataset of RAVDESS,
TESS, and CREMA-D, recognizing eight emotions[18]. This model, utilizing features
like Zero Crossing Rate (ZCR), Root Mean Square Energy (RMSE), and Mel Frequency
Cepstral Coefficient (MFCC), achieved a 97.80 percent accuracy.This system presented a
machine learning model that uses a Feed Forward Neural Network for gender identifica-
tion and a CNN for detecting emotions (neutral, happy, sad, angry) from speech[19]. This
model achieved 91.46 percent accuracy in gender classification and 86 percent in emotion
recognition, showing promise for applications in human-computer interaction, customer
service, and healthcare. This system focuses on detecting emotions from speech using
various classification algorithms like Support Vector Machine and Multilayer Perceptron,
with audio features such as MFCC, MEL, chroma, and Tonnetz[20]. This models were
trained to recognize emotions like calm, neutral, surprise, happy, sad, angry, fearful, and
disgust, achieving an accuracy of 86.5 percent.The author focuses on detecting human
emotions from sound signals using the Mel-Frequency Cepstral Coefficient (MFCC) for
feature extraction, as it closely mimics the human auditory system[21]. Support Vector
Machine (SVM) with the Radial Basis Function (RBF) kernel was used for classification,
achieving a highest accuracy of 72.5 percent with specific parameter settings including
a 0.001- second frame size, 80 filter banks, gamma values between 0.3 and 0.7, and a C
value of 1.0.This system proposed work develops a real-time facial emotion recognition
(FER) system using a CNN model trained on the FER-2013 dataset to track and re-
port individual emotions in real-time[22]. This system, using the Viola-Jones algorithm
for face detection, achieves 90.40 percent accuracy and generates a summary report of
detected emotions over a time interval.
3.2 Scope
This project aims to detect emotions in real-time using facial expressions and voice anal-
ysis, supporting continuous emotional monitoring. It focuses on early detection of emo-
tional distress for mental health applications and adapts to individual emotional patterns
for personalized assessments. By combining computer vision, audio processing, and ma-
chine learning, the system can be used in clinical settings or human-computer interaction
to enhance emotional awareness and responsiveness.
3.3 Objectives
The project aims to improve early detection of mental health issues by using advanced
emotion recognition techniques. It analyzes facial expressions and voice patterns to iden-
tify emotional distress in real time from live video and audio. The system also supports
mental health professionals by providing tools to monitor emotional trends and enable
timely interventions, while focusing on improving the accuracy and efficiency of the emo-
tion detection model.
used, or new real-world data can be collected. The dataset must be diverse, representing
various demographic groups to improve the model’s generalizability.
In the preprocessing stage, facial images will be normalized and aligned to standardize
input data, removing noise and ensuring consistency in image dimensions. Similarly, au-
dio data will undergo preprocessing, including noise reduction, segmentation, and feature
extraction (using techniques like Fourier transforms to capture key audio frequencies).
Face detection techniques, such as Haar cascades or Dlib, will be applied to extract key
facial regions, focusing on features critical for emotional expression (e.g., eyes, mouth,
forehead).
For feature extraction, convolutional neural networks (CNNs) will automatically iden-
tify high-level features in both facial images and audio signals that correspond to different
emotions. Transfer learning with pre-trained models, such as VGG or ResNet for images
and CNNs or RNNs for audio, may be used to enhance accuracy. These models lever-
age previously learned features from larger datasets to improve performance on smaller,
multimodal datasets.
In the emotion classification phase, a deep learning model, such as a hybrid CNN-
RNN, will be used to classify emotions based on features extracted from both visual and
audio inputs. The system will employ a Softmax classifier to output prob abilities for each
emotion category. The model’s performance will be evaluated using accuracy, precision,
recall, and F1 scores, and different architectures will be tested to find the most effective
combination of image and audio features.
For real-time emotion detection, the trained model will be integrated into a system
capable of analyzing live video streams and audio simultaneously. Tools like OpenCV for
facial tracking and libraries like PyAudio for real-time audio capture will be employed,
with a focus on minimizing latency for smooth user interactions.
The system will also be designed for mental health monitor ing by logging and ana-
lyzing emotion patterns over time from both visual and audio signals. This will provide
comprehen sive insights into emotional fluctuations, which may indicate mental health
issues. Temporal models, such as LSTM (Long Short-Term Memory) networks, will track
patterns over longer durations, helping to recognize mental health conditions like anxiety
or depression from both vocal and facial cues.
Finally, the methodology will include extensive validation and testing in real-world
scenarios to assess the system’s accuracy and robustness across audio-visual modalities.
Col laboration with mental health professionals will ensure that the system’s emotion
detection aligns with meaningful clin ical insights. The project will also develop a user
interface that provides real-time audio-visual emotion detection results, reports, and early
This system architecture for real-time facial emotion detection and audio-based emotion
analysis is divided into several key components. The data input layer captures real-
time video, static images, and audio using a webcam, mobile camera, or microphone. A
face detection module extracts facial regions from the input using methods such as Haar
Cascades, Dlib, or MTCNN, ensuring only relevant facial areas are passed to the model.
For audio, real-time audio streams are captured and processed for emotional features,
such as pitch, intensity, and tone.
In the preprocessing layer, detected face images are resized, normalized, and aug-
mented (e.g., flipping,rotation, cropping)to meet the CNN model’s requirements. Simul-
taneously, audiosignals are preprocessed by removing noise and extracting key frequency
features. Data augmentation is applied duringtraining to create more diverse datasets.
The core component of this system is the CNN-based emotion detection module for
facial images and an RNN-based module for audio analysis. The CNN processes the
prepro cessed facial images through convolutional, pooling, and fully connected layers,
while the RNN handles sequential audio data to capture emotional cues from speech.
The emotion classifier combines visual and audio inputs, outputting probabilities for
predefined emotion categories (such as happy, sad, or neutral). The Softmax layer then
converts these probabilities into specific emotion labels.
In the emotion analysis and monitoring layer, emotions are tracked over time from
both visual and auditory cue recording detected emotions for each frame or audio segment.
Trends and patterns are visualized through graphs, showing dominant emotions over time
from both facial expressions and audio signals.
5. Implementation
5.1 Stages of Implementation
The Human Emotion Detection System follows a multi-stage workflow to enable effi-
cient and accurate emotion recognition from facial expressions. It integrates several
core components such as real-time video capture, preprocessing, feature extraction, deep
learning-based classification, and result visualization.
The preprocessing phase begins with Input Acquisition, where real-time video feeds are
captured using a webcam. The system supports both live and pre-recorded video, making
it suitable for diverse applications like mental health monitoring and human-computer
interaction.
Next, relevant facial regions are extracted from each video frame. Face detection is
performed using tools such as Haar Cascade Classifier or MTCNN to identify and crop
face regions. For increased accuracy, facial landmark detection using Dlib or MediaPipe
locates key facial features (eyes, nose, mouth). These extracted facial images are then
converted to grayscale, resized to 48×48 pixels, and normalized by scaling pixel values to
a [0,1] range. This step reduces computational load and boosts model efficiency.
The core module of the system is the Feature Extraction and Classification module,
powered by a deep Convolutional Neural Network (CNN). The CNN architecture consists
of:
Convolutional layers for spatial feature extraction
Max pooling layers for downsampling feature maps
Fully connected layers for classification
Softmax layer for generating a probability distribution over seven emotion categories:
Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral
The system is trained on the FER-2013 dataset, a widely used dataset for facial
emotion recognition.
For real-time emotion detection, the trained model is deployed to process live webcam
feeds. The system logs emotions with timestamps, enabling users to observe emotional
trends during a session. An alert mechanism is triggered if negative emotions persist,
suggesting early interventions.
The system is deployed as a web-based application using:
FastAPI for backend inference
Streamlit for real-time visualization and user interface Users can upload images,
videos, or use live webcam input. The application displays real-time emotion predic-
tions and visualizes emotional trends through dynamic plots.
through the model 20 times. A batch size of 64 is used, which means that the model
updates its weights after processing every 64 images. This batch size strikes a balance
between training speed and stability of the learning process.
To prevent overfitting, where the model performs well on the training data but poorly
on unseen data, a Dropout layer with a rate of 50 percent is added. This technique
randomly disables half of the neurons during each training iteration, forcing the model to
learn more robust and generalizable features rather than memorizing the training data.
Additionally, data augmentation techniques such as horizontal flipping, slight rotation,
and zooming may also be applied during training to expose the model to more diverse
variations of facial expressions, thereby enhancing its ability to generalize across different
faces and conditions.
The performance of the trained model is monitored using metrics such as accuracy,
precision, recall, and confusion matrix analysis on the test dataset to ensure that the
model reliably detects and distinguishes between different emotional states.
6. Results
6.1 Results of Experiments
Figure 6.1: Dashboard for Video Upload for Emotion Detection and Analysis
The effectiveness of the model was evaluated using standard classification metrics. Accu-
racy was used to measure the proportion of correctly predicted emotions out of the total
predictions. Precision and recall were calculated to assess how well the model identified
each emotion, with precision indicating the accuracy of positive predictions and recall
measuring the model’s ability to capture all relevant instances. The F1 score, which is
the harmonic mean of precision and recall, provided a balanced evaluation metric, espe-
cially useful in cases of class imbalance. Additionally, a confusion matrix was generated
to visualize the performance of the model across different emotion categories, helping
identify specific areas of misclassification.
Model Performance Comparison:
Facial and speech emotion analysis enhances emotion recognition accuracy by leveraging
complementary visual and auditory cues. Facial expressions provide spatial features,
while speech patterns add temporal characteristics, improving classification reliability.
In this study, a Convolutional Neural Network (CNN) achieved 85% accuracy in facial
emotion recognition but struggled with visually similar emotions like Fear and Surprise.
To address this, a multimodal fusion approach combining CNN and Long Short-Term
Memory (LSTM) networks was implemented. The LSTM model captured sequential
speech patterns, improving emotion differentiation. This fusion increased accuracy to
91%, demonstrating the effectiveness of integrating speech and facial features in reducing
misclassification and enhancing system robustness.
The system monitors emotional trends and raises an alert if negative emotions persist
over a predefined threshold. Table 6.2 shows its effectiveness:
The system provides an interface for users to upload a video file for emotion analysis.
Upon receiving the video, the system extracts both facial and vocal features, ensuring
that the extracted data belongs to the same individual. These features are then processed
using deep learning models to identify the emotional states present in the video.
Once the extracted features are analyzed, the system generates a detailed report on the
detected emotions across different segments of the video. The final output displays the
most dominant emotion observed, helping in mental health monitoring. For instance, if
the system detects a predominantly ”Happy” emotion, it suggests that the individual is
not in distress and does not require immediate intervention.
6.3 Testing
White Box Testing
Unit Testing
Integration Testing
Integration testing was performed to ensure that all individual modules — facial emotion
detection (CNN), audio emotion detection (BiLSTM), the Flask-based backend, and the
web frontend — work cohesively as a unified system. The primary focus was to verify
data flow, synchronization, and the correctness of the emotion fusion logic.
Test Scenario 1: Video Input Processing Modules Involved: Frontend (Upload/Camera)
→ Flask API → Frame Extractor → CNN Model
Input: 5-minute CCTV footage with visible face and clear voice
Expected Output: Extracted frames processed, emotions detected for each segment,
and saved in logs
Actual Output: Frames split into segments, CNN returned detected emotions: [’happy’,
’neutral’, ’happy’, ’sad’, ’neutral’]
Test Scenario 2: Audio Emotion Detection Modules Involved: Audio Extractor →
Preprocessor → BiLSTM Model
Input: Extracted audio from uploaded video
Expected Output: Emotion labels with timestamps
Actual Output: Audio processed successfully with BiLSTM, emotions detected: [’neu-
tral’, ’sad’, ’sad’, ’neutral’, ’angry’]
Test Scenario 3: Fusion and Alert Logic Modules Involved: Face Emotion + Audio
Emotion → Fusion Layer → Alert Generator
Input: Detected face and audio emotions
Expected Output: Generate overall emotion per segment, and trigger alert if negative
emotions dominate
Actual Output: Combined emotions: [’happy’, ’sad’, ’sad’, ’sad’, ’neutral’]; Alert:
”Patient under emotional stress, needs attention.”
Test Scenario 4: Dashboard Output Modules Involved: Flask API → HTML/JS
Frontend → Chart Display Expected Output: Graph and dominant emotion displayed
on dashboard
Black Box
To verify that the system behaves correctly for given inputs, regardless of internal code
Test Cases
Test Case: Happy face and audio
Description: Testing the system’s detection when both face expression and voice tone
show happiness.
Expected Output: System should detect ”happiness” and no mental health alert should
be triggered.
Actual Result: ”Happiness” detected from both modalities. No alert shown.
Test Case: Angry face only
Description: Video contains only angry facial expressions without corresponding angry
audio.
Expected Output: ”Angry” emotion detected and mental health alert triggered due to
consistent facial anger.
Actual Result: ”Angry” detected. Alert was correctly triggered.
Test Case: Sad face + neutral audio
Description: Emotion mismatch between sad face and calm/neutral voice tone.
Expected Output: System should detect ”Sad” or ”Neutral” and possibly issue a mental
health alert.
Actual Result: System showed ”Sad/Neutral” and triggered an alert as expected.
Test Case: No speech in audio
Description: Audio is completely silent, while video may or may not show facial emotions.
Expected Output: Audio model should return ”Neutral” or ”No Emotion” without sys-
tem crash.
Actual Result: ”Neutral” audio emotion returned. System worked as expected using face
emotion fallback.
Test Case: Long video with emotion mix
Description: 5-minute video with a variety of emotions across time intervals.
Expected Output: Emotion graph generated over time. Alert generated if negative emo-
tion dominates.
Actual Result: Emotion timeline graph displayed. Dominant ”Sad” emotion detected.
Alert shown correctly.
This study presents an advanced Human Emotion Detection system that utilizes cutting-
Experimental results reveal that the facial emotion recognition model achieves an impres-
contrast, the audio-based emotion recognition model attains an accuracy upto 65 percent.
However, by integrating facial imagery with voice-based pattern recognition, the system’s
The system further incorporates personalized features like user-specific emotional base-
able for applications in mental health support, humancomputer interaction, and social
robotics.
Notably, the system aids in timely mental health interventions, with eighty-five percent
of test users reporting high satisfaction. It also boosts user engagement in interactive
models. Overall, this research represents a meaningful step toward the development of
intelligent tools for early detection of emotional distress, supporting better mental health
layed because the system must handle both video and audio simultaneously. This can be
with high-performance CPUs and GPUs. On less powerful devices, like as phones, it may
Data Combination Difficulty: It is difficult to combine facial emotions and audio data
in real time. It can slow down the system, particularly if one sort of data (such as audio)
Environmental Issues: Things like poor lighting, noise, and things concealing the face
Battery Drain: Real-time processing consumes a lot of power, which can quickly
Adapting to Different Users: The system attempts to learn and modify to each indi-
vidual’s unique manner of expressing emotions, but it may take some time to get it right
Battery Drain: Real-time processing consumes a lot of power, which can quickly
Internet and Data Issues: If the system sends data over the internet (e.g., for cloud
processing), slow network speeds, data limits, and privacy concerns can cause problems.
The future of human emotion detection in mental health monitoring holds significant
promise. Expanding this system to recognize a wider range of emotions, including subtle
and complex states, can enhance its applicability in various domains. Additionally, incor-
porating multimodal data, such as voice analysis and physiological signals, can provide
verse user populations, future re search should focus on adapting this system to cultural
Studies are essential to track emotional trends and mental health trajectories, aiding in
feedback and continuous learning algorithms can improve this system’s adaptability for
individual users.
Deploying this system in real-world settings, such as mental health clinics, educational
environments, or customer service platforms, is crucial for validating its effectiveness and
and consent is to ensure responsible and ethical usage of emotion detection technologies.
Bibliography
[3] X. Zhang, M.-J. Wang, X.-D. Guo “Multi-modal Emotion Recognition Based on Deep
Learning in Speech Video and Text” 2020 IEEE 5th International Conference on
Signal and Image Processing (ICSIP), pp. 328-333, 2020
[6] Foo Jia Ming, Shaik Shabana Anhum, Shayla Islam, Kay Hooi Keoy, “Facial Emo-
tion Recognition System for Mental Stress Detection among University Students”,
2023 3rd International Conference on Electri cal, Computer, Communications and
Mechatronics Engineering (ICEC CME), July-2023
[7] Meaad Hussein Abdul-Hadi, Jumana Waleed, “Human Speech and Fa cial Emotion
Recognition Technique Using SVM”, : 2020 International Conference on Computer
Science and Software Engineering (CSASE), April-2020
[8] Shreya Soni, Shruti Chaubey, Suchita Parira, Senthil Velan S., “Emotion Detection
and Suicidal Intention Prediction of Differently Depressed Individuals Using Machine
Learning Techniques”, 2023 14th Inter national Conference on Computing Commu-
nication and Networking Technologies (ICCCNT), July-2023
[9] Sumathi Pawar, Suma K., “Emotion Detection Using Adaboost and CNN”, 2023
IEEE 2nd International Conference on Data, Decision and Systems (ICDDS), Dec-
2023
[10] Phavish Babajee, Geerish Suddul, Sandhya Armoogum, Ravi Foogooa, “Identify-
ing Human Emotions from Facial Expressions with Deep Learning”, 2020 Zooming
Innovation in Consumer Technologies Con ference (ZINC), May 2020
[11] Lanxin Sun, JunBo Dai, Xunbing Shen, “Facial emotion recognition based on LDA
and Facial Landmark Detection”, 2021 2nd International Conference on Artificial
Intelligence and Education (ICAIE), June-2021
[12] Haider Riaz, Usman Akram ,“Emotion Detection in Videos Using Non Sequential
Deep Convolutional Neural Network” , : 2018 IEEE International Conference on
Information and Automation for Sustainability(ICIAfS), Dec-2018
[13] Ashley Dowd, Navid Hashemi Tonekaboni, “Real-Time Facial Emotion Detection
Through the Use of Machine Learning and On-Edge Computing”, 2022 21st IEEE
International Conference on Machine Learning and Applications (ICMLA), Dec-2022
[14] Deepa Betageri, Vani Yelamali “Detection and Classification of Human Emotion
Using Deep Learning Model” , 2024 International Conference on Signal Processing,
Computation, Electronics, Power and Telecommunication (IConSCEPT), July-2024
[15] [15] Renu Dalal, Manju Khari, Priyank Pandey, Samanvay Jatana, Vijay Joshi “Fa-
cial Emotion Recognition and Detection Using Convolutional Neural Networks”, 2023
3rd International Conference on Smart Generation Computing, Communication and
Networking (SMART GENCON), Dec-2023
[16] [16] Omar Sameh Badr, Nada Ibrahim, Amr EiMougy, “Fake Emotion Detection Us-
ing Affective Cues and Speech Emotion Recognition for Improved Human Computer
Interaction”, 2023 2nd International Conference on Smart Cities 4.0, Oct-2023
[17] [17] Sarwesh Giri, Gurcheten Singh, Babul Kumar, Mehakpreet Singh, Deepanker
Vashisht, Sonu Sharma, “Emotion Detection with Facial Feature Recognition Using
CNN OpenCV”, 2022 2nd International Conferenceon Advance Computing and In-
novative Technologies in Engineering(ICACITE), April-2022
[18] Auhona Islam, Md Foysal, Md Imteaz Ahmed, “Emotion Recognition from Speech
Audio Signals using CNN-BiLSTM Hybrid Model”, 2024 3rd International Conference
on Advancement in Electrical and Electronic Engineering (ICAEEE), April-2024
[20] Kotikalapudi Vamsi Krishna, Navuluri Sainath, A. Mary Posonia, “Speech Emotion
Recognition using Machine Learning”, 2022 6th International Conference on Com-
puting Methodologies and Communication (ICCMC), March-2022
[21] Raufani Aminullah A., Muhammad Nasrun, Casi Setianingsih, “Hu man Emotion
Detection with Speech Recognition Using Mel-frequency Cepstral Coefficient and Sup-
port Vector Machine”, 2021 International Conference on Artificial Intelligence and
Mechatronics Systems (AIMS), April-2021
Similarity 4%
URL: https://www.coursehero.com/file/83472522/PROJECT-REPORT-1pdf/
2
Fetched: 2025-04-23 11:11:00
URL: https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00175/full
1
Fetched: 2025-04-23 11:12:00
URL: https://arxiv.org/pdf/2503.08002
1
Fetched: 2025-04-23 11:11:00
URL: https://medium.com/@azizozmen/understanding-multicollinearity-its-impact-on-data-
analysis-and-machine-learning-94da6620569e 1
Fetched: 2025-04-23 11:11:00
URL: https://www.coursehero.com/file/240962458/Assignment-1-Part-1-Byte-Pair-Encoding-BPE-
Implementation-and-Evaluation-on-NLTK-Dataset-1pdf/ 1
Fetched: 2025-04-23 11:12:00
URL: https://m2.mtmt.hu/api/publication/33142353?format=xml&labelLang=hun
1
Fetched: 2025-04-23 11:12:00
URL: https://www.preprints.org/manuscript/202412.0637/v1
1
Fetched: 2025-04-23 11:11:00
URL: https://www.rmkec.ac.in/2023/wp-content/uploads/2023/02/News-letter-even.pdf
2
Fetched: 2025-04-23 11:11:00
URL: https://thesai.org/Downloads/Volume16No1/Paper_67-
Enhanced_Facial_Expression_Recognition.pdf 1
Fetched: 2025-04-23 11:12:00
URL: https://arxiv.org/html/2502.13080v1
1
Fetched: 2025-04-23 11:12:00
URL: https://researchr.org/alias/ashley-dowd
1
Fetched: 2025-04-23 11:12:00
1
URL: https://escholarship.org/content/qt14m2h5gn/qt14m2h5gn.pdf
1
Fetched: 2025-04-23 11:12:00
URL: https://kitsw.ac.in/homepage_pages/pdfs/annual_reports/2022-
23%2520Annual%2520Report.pdf 1
Fetched: 2025-04-23 11:12:00
URL: https://ijsrcseit.com/index.php/home/article/download/CSEIT251112266/CSEIT251112266/190
6 2
Fetched: 2025-04-23 11:11:00
URL: https://www.ulab.edu.bd/faculty/md-nazmul-abdal
1
Fetched: 2025-04-23 11:12:00
URL: https://www.ijeat.org/wp-content/uploads/papers/v12i1/A38021012122.pdf
1
Fetched: 2025-04-23 11:12:00
URL: https://www.technoarete.org/common_abstract/pdf/IJERCSE/v10/i11/Ext_19537.pdf
1
Fetched: 2025-04-23 11:12:00
Entire Document
A FINAL PROJECT REPORT ON Human Emotion Detection in Mental Health Monitoring
SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE OF BACHELOR OF ENGINEERING
INFORMATION
TECHNOLOGY BY Dhanashri Gangurde B400050717 Radhika Bakale B400050664 Parth Kokate B400050755 Under the
guidance of Mrs. Archana Kadam Department Of Information Technology Pune Institute of Computer Technology Pune
- 411 043. 2024-2025
SCTR’s PUNE INSTITUTE OF COMPUTER TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY C E R T I F I C
A T E This is to certify that the final project report entitled Human Emotion Detection in Mental Health Monitoring
submitted by Dhanashri Gangurde B400050717 Radhika Bakale B400050664 Parth Kokate
B400050755 is a bonafide work carried out by them under the supervision of Mrs. Archana Kadam and it is approved
for the partial fulfillment of the requirement of Savitribai Phule Pune University for the award of the Degree of Bachelor
of Engineering (
Information
Technology). This project report has not been earlier submitted to any other Institute or University for the award of any
degree or diploma. Mrs. Archana Kadam Dr. A. S. Ghotkar Project Guide HOD IT Dr. S. T. Gandhe SPPU External Guide
Principal Date: Place: i
2
www.pijet.org PICT’s International Journal of Engineering and Technology (PIJET) ISSN: 2584-2668
Abstract
Human emotion detection has become an essential tool in mental health monitoring, offering the potential for early
detection of mental health disorders. Existing models for emotion detection primarily Utilize (CNNs), in particular,
are deep learning technique, to perform analysis. facial expressions and, in some cases, voice patterns. These models
have demonstrated the ability to detect basic emotions such as sad, fear, anger, happy, disgust,surprise, neutral, and
sorrow with a high degree of accuracy. However, challenges remain in terms of real time processing and
personalization for individual users. This research introduces a novel system designed to enhance early mental health
detection through advanced human emotion detection techniques. The system focuses on analyzing facial expressions
and voice patterns to identify potential signs of emotional distress. By leveraging deep learning models, specifically
refined CNN architectures, and additional data preprocessing techniques, we aim to achieve an accuracy rate of 93
percent for emotion recognition using facial images alone. When combining facial image data with voice pattern
analysis, the system reaches an overall accuracy of up to 82 percent. The incorporation of real-time processing
capabilities enables instant emotion detection from live video and audio feeds, providing timely insights for mental
health professionals. Furthermore, the system features a personalization component that adapts to each user’s unique
emotional responses, improving detection accuracy over time. By combining facial and voice data, the proposed
system offers a comprehensive approach to human emotion detection, with the goal of contributing to early
intervention and better mental health outcomes.
Keywords: Human emotion detection, Mental health, Early intervention, machine learning, Deep learning, Facial
expression analysis, Voice analysis, Multimodal fusion, CNN, Neural Network
1. Introduction
Human-to-human communication relies heavily on facial expressions, conveying emotions that can significantly
influence interactions. Research has demonstrated that recognizing and interpreting facial expressions is essential for
effective communication, accounting for a substantial portion of interpersonal interactions. In the realm of interaction
between humans and computers, the capacity of machines to acknowledge and retaliate to Emotions in humans are
increasingly desired. The goal of this research is to create a system that can accurately detect emotions from facial
expressions, specifically targeting mental health monitoring. By analyzing facial cues, the aim is to provide valuable
insights into a person’s emotional state, potentially aiding in early detection and intervention for mental health issues.
While previous research has primarily concentrated on facial emotion recognition, the work extends this by
incorporating audio analysis. This multimodal strategy makes it possible for a more thorough understanding of a
person’s emotional state, considering the interplay of facial expressions, voice patterns, and linguistic cues. By
combining these modalities, The objective is to develop a human emotion detecting system that is more reliable and
accurate. This system can potentially be used to assist mental health professionals in identifying individuals at risk of
mental health problems and providing timely support.
2. Literature Survey
In [1]. The authors designed a model that detects human emotions based on facial image datasets, achieving 93
percent accuracy. The algorithm for detecting emotions in videos was introduced.
[2]. For video-based face expression detection, the authors looked into various approaches of pooling spatial and
temporal data and found that doing so simultaneously is more effective. Deep learning-based multi-modal emotion
detection is presented by the model.
[3]. Emotion detection is based not only on facial features but also on speech, video, and text. The author offers real-
time facial expression recognition using OpenCV for video, DeepFace for emotion analysis, and a Streamlit interface
for user interaction.
[4]. It effectively detects emotions and presents results clearly. The paper introduces a hybrid method using rules,
emotions, and context to enhance word meaning detection.
[5]. It leverages sentence transformers and BERT to identify human’s emotions, including neutral, and tags multiple
emotions based on context. This approach surpasses existing emotion detection methods. The goal of the study is to
develop a facial emotion recognition system that will assist in identifying mental tension, which will be advantageous
for counselling services and college students.
[6]. By analyzing facial expressions, the system identifies signs of stress in individuals. This study suggests a method
for identifying emotions. using both speech, facial expressions with support vector machine (SVM)
[7]. Results show improved performance, with a recognition accuracy above 92 percent for the face module and 85.15
percent for the voice module, outperforming recent methods while being time-efficient. Seven major emotions were
identified in the study using deep learning, namely CNN anger, fear,disgust, happy, surprise, sad, and neutrality.
[8]. This monitor depressed individuals and predict suicide risk by analyzing their emotional state. The system used
to discover emotions like sad, happy, contempt, fear, surprise, neutral, and rage.
[9]. The author employs the Adaboost, Convolutional Neural Networks, and Haar-cascade algorithms to identify
seven different moods. A face detection scheme with feature extraction and noise reduction is part of the pre-training
stage. The categorization model uses the Facial Action Coding System to predict seven emotions.
[10]. Current results show 79.8 percent accuracy for detecting these emotions, without using optimization techniques.
This paper focuses on extracting facial features with the use of facial landmark detection and linear discriminant
analysis.
[11]. Test results show that emotion recognition accuracy is 73.9 percent with LDA and 84.5 percent using Facial
Landmark Detection. The presented practice introduces A deep convolutional neural network with no sequential
components featuring multiple parallel networks
[12]. Its evaluation uses the The dataset Surrey Audio-Visual Expression Emotion (SAVEE), which includes videos
of four individuals expressing seven emotions. The model achieves 87.0 percent accuracy using the KDFF,FER2013,
CK+, JAFFE, and AffectNET dataset, outperforming current real-time model, which typically achieve 63-78 percent
accuracy
[13]. It is lightweight and appropriate for implementation on a variety of edge devices for real-time applications due
to its streamlined architecture. The author suggests a CNN model that uses two fully connected layers, max pooling,
and six convolutional layers to recognize face emotions.
[14]. A Haar cascade detector identifies faces, classifying them into seven emotions. The model gain
77.23 percentage accuracy on the FER2013 dataset. Using a Convolutional Neural Network that uses neural networks
for emotion categorization and the Viola-Jones method for face identification, the author created the facial emotion
recognition model.
[15]. The model, featuring six layers and three full connected layers, achieved 68.26 percent accuracy for the FER2013
dataset also achive 91.58 percent on the CK+ dataset. The Authors develops two deep learning models for detecting
fake emotions, one analyzing facial expressions and the other focusing on emotional speech
[16]. The facial expression model achieved 70 percent accuracy, while the speech-based model reached
96.93 percent accuracy, demonstrating the effectiveness of the approach in enhancing both social and human-
computer interactions. The author focuses on using a Convolutional Neural Network and OpenCV
to notice live human emotion from facial expressions, aiming to bridge the gap between human computer interaction
[17]. The system identifies emotions like sad, disgust, neutral, happy, fear, angry, surprise from real- time webcam
input. The author utilizes a hybrid CNN BiLSTM to improve speech emotion recognition (SER). model trained on a
merged dataset of RAVDESS, TESS, and CREMA-D, recognizing eight emotions
[18]. The model, utilizing features like Mel Frequency Cepstral Coefficient, RMSE, and Zero Crossing Rate, achieved
a 97.80 percent accuracy. The author demonstrated a machine learning model that used a CNN to identify emotions
(neutral, happy, sad, and angry) in speech and a Feed Forward Neural Network to identify gender.
[19]. The model achieved 91.46 percent accuracy in gender classification and 86 percent in emotion recognition,
showing promise for applications in human- computer interaction, customer service, and healthcare. The Author
focuses on detecting emotions from speech using various classification algorithms like Multilayer Perceptron and
Support Vector Machine, featuring audio features like Tonnetz, MEL, MFCC, and Chroma.
[20]. The models achieved an accuracy of 86.53 percent after being trained to identify emotions such as peace,
neutrality, astonishment, happiness, sadness, annoyance, unpleasant, and disgust. The author focuses on detecting
human emotions from sound signals using the Mel-Frequency Cepstral Coefficient (MFCC) for feature extraction, as
it closely mimics the human auditory system
[21]. The RBF kernel in a SVM was utilized for classifi- cation, achieving a highest accuracy of 72.5 percent with
specific parameter settings including a 0.001 second frame size, 80 filter banks, gamma values between 0.3 and 0.7,
and a C value of 1.0.The author proposed work develops a real-time emotion recognition through face system using a
CNN model trained on the FER-2013 dataset to track and report individual emotions in real-time
[22]. The system detects faces using the Viola-Jone algorithm. achieves 90.40 percent accuracy and generates a
summary report of detected emotions over a time interval.
3. Proposed Methodology
The proposed method for developing a system that detects emotions from facial expressions and audio for mental
health monitoring start with data collection. An extensive collection of audio recordings of facial expressions with
accompanying emotion labels (such as joy, sadness, anger, and fear) will be gathered. Existing datasets like FER2013
can be used, or new real-world data can be collected. The dataset must be diverse, representing various demographic
groups to improve the model’s generalizability.
In the preprocessing stage, facial images will be normalized and aligned to standardize input data, remov-ing
noise and ensuring consistency in image dimensions. Similarly, audio data will undergo preprocessing, including noise
reduction, segmentation, and feature extraction (using techniques like Fourier transforms to capture key audio
frequencies). Face detection techniques, such as Haar cascades or Dlib, will be applied to extract key facial regions,
focusing on features critical for emotional expression (e.g., eyes, mouth, forehead).
For feature extraction, Convolutional neural networks will recognize high-level features automatically in
both facial images and audio signals that correspond to different emotions. Transfer learning using models that have
already been trained, like VGG or ResNet for images and CNNs or RNNs for audio, may be used to enhance accuracy.
These models leverage previously learned features from larger datasets to improve performance on smaller,
multimodal datasets.
Fig. 1. Mental Health Monitoring using Human Emotion Detection System Block Diagram
In the emotion classification phase mixed deep learning model, for example CNN-RNN, will be used to
classify emotions based on features extracted from both visual and audio inputs. The system will employ a Softmax
classifier to output probabilities for each emotion category. Performance of the model will be assessed using F1 scores,
an Recall,Precision and Accuracy,and metric different architectures will be tested to find the most effective
combination of image and audio features. For real-time emotion detection, the trained model will be integrated into a
system capable of analyzing live video streams and audio simultaneously. Tools like OpenCV for facial tracking and
libraries like PyAudio for real-time audio capture will be employed, with a focus on minimizing latency for smooth
user interactions.
The system will also be designed for mental health monitoring by logging and analyzing emotion patterns
over time from both visual and audio signals. This will provide comprehensive insights into emotional fluctuations,
which may indicate mental health issues. Time-related models, such (LSTM networks, will track patterns over longer
durations, helping to recognize mental health conditions like anxiety or depression from both vocal and facial cues.
Finally, the methodology will include extensive validation and testing in real-world scenarios to assess the
system’s accuracy and robustness across audio-visual modalities. Collaboration with mental health professionals will
ensure that the system’s emotion detection aligns with meaningful clinical insights. The project will also develop a
user interface that provides real-time audio-visual emotion detection results, reports, and early intervention
recommendations for potential mental health issues.
This comprehensive approach integrates machine learning, computer vision, audio analysis, and mental
health expertise to create a robust tool for emotional well-being monitoring and early detection of mental health
conditions.
This system architecture for real-time facial emotion detection and audio-based emotion analysis is divided into
several key components. The data input layer captures real-time video, static images, and audio using a webcam,
mobile camera, or microphone. A face detection module extracts facial regions from the input using methods such as
Haar Cascades, Dlib, or MTCNN, ensuring only relevant facial areas are passed to the model. For audio, real-time
audio streams are captured and processed for emotional features, such as pitch, intensity, and tone.
In the preprocessing layer, detected face images are resized, normalized, and augmented (e.g., flipping,
rotation, cropping) to meet the CNN model’s requirements. Simultaneously, audio signals are preprocessed by
removing noise and extracting key frequency features. Data augmentation is applied during training to create more
diverse datasets.
The core component of this system is the CNN-based emotion detection module for facial images and an
RNN-based module for audio analysis. The CNN processes the preprocessed facial pictures using pooling,
convolutional, and fully linked layers, while the RNN handles sequential audio data to capture emotional cues from
speech. The emotion classifier combines visual and audio inputs, outputting probabilities for predefined emotion
categories (such as happy, sad, or neutral). The Softmax layer then converts these probabilities into specific emotion
labels.
Fig. 2. Workflow of the System for Mental Health Monitoring using Human Emotion Detection.
In the emotion analysis and monitoring layer, emotions are tracked over time from both visual and auditory
cues, recording detected emotions for each frame or audio segment. Trends and patterns are visualized through graphs,
showing dominant emotions over time from both facial expressions and audio signals.
An alert and recommendation engine is triggered when negative emotions like sadness or anxiety are detected
continuously over a significant period, integrating insights from both audio and visual cues. This engine provides
mental health insights and suggests interventions like therapy. The user interface layer provides a user dashboard
displaying real time emotional monitoring, historical data, and alerts through visual graphs. An optional therapist
dashboard allows health care professionals to track patients’ emotional trends across multiple sessions, receiving alerts
when concerning patterns from both audio and video are detected.
Data storage and analytics are managed through an emotion log database, which stores detected emotions,
timestamps, and audio-visual data for long-term tracking. A reporting module generates reports summarizing
emotional states over specific periods (daily, weekly, or session-wise).
Cloud integration, although optional, supports large-scale deployment by storing user data and model weights
in cloud platforms like AWS, Google Cloud, or Azure. Remote monitoring enables therapists to track patient data via
cloud-based dashboards and analytics, integrating insights from both visual and auditory emotion detection.
This system’s real-time workflow begins with video or image input, followed by face detection using algorithms like
Haar Cascades or MTCNN. After cropping the face region, this system preprocesses the image by resizing,
normalizing, and converting it into a format suitable for the CNN model.
The CNN processes the image, extracts features, and outputs a probability distribution for emotion categories,
using the Softmax function to classify the face into an emotion label. Detected emotions are logged with timestamps,
and trends are plotted in real-time, allowing users or therapists to track emotional shifts during a session.
If negative emotions are detected continuously above a set threshold, the system triggers alerts recommending
intervention. Finally, session reports summarize detected emotions and their distribution, accessible via a web or
mobile interface for tracking emotional patterns and mental health progress.
This figure illustrates a flowchart for detecting emotions from both visual and audio inputs. It begins with
video input, which undergoes face detection to identify facial features.
After detecting the face, data is collected and split into two parallel processes: face analysis and audio/text
analysis. The face analysis branch includes preprocessing the facial data and performing emotion detection, while the
audio/text analysis branch preprocesses the input audio and analyzes it for emotional sequences.
The Human Emotion Detection system operates through a multi-stage workflow designed for efficient and
accurate recognition of emotions from facial expressions. The system architecture integrates various components to
enhance its performance and usability.
The first stage, Input Acquisition, captures real-time video feeds from a camera, supporting both live and
pre-recorded video files. This allows for use in scenarios like remote mental health assessments or interactive
applications.
Next, in the Preprocessing phase, data preparation begins. Face Detection is performed using the Viola- Jones
Algorithm, known for its speed and reliability, to identify face regions and extract bounding boxes. Facial Landmark
Detection using tools like Dlib or MediaPipe locates key facial features, crucial for analyzing facial geometry and
expressions. The captured images are then resized and normalized to reduce complexity, improving model
performance.
Fig. 3. Workflow of the System for Mental Health Monitoring using Human Emotion Detection.
In the Feature Extraction stage, the heart of the system, a CNN is in work. The CNN’s convolutional layers
detect spatial hierarchies in facial expressions, while pooling layers down-sample features, preserving essential
information and reducing dimensionality. Dropout layers help obstruct overfitting by arbitrarily excluding neurons
while training.
Once the features are extracted, the system moves to Emotion Classification using deep learning algorithms.
Fully connected layers process the flattened features and classify the detected emotions. A Softmax activation
function converts the CNN output into probabilities, identifying emotions such as anger, disgust, contempt,
sorrow,fear,joyful, and surprise.
This system supports Real-Time Emotion Recognition through continuous analysis of incoming video
frames, providing instant results without noticeable delay. This is accompanied by a Feedback Mechanism that updates
emotion predictions based on ongoing inputs, enhancing user engagement.
A notable feature is Personalization, where the system learns from user interactions, adapting to individual
facial expressions and refining its accuracy through dynamic model adjustments. Finally, the Output Visualization
module presents the recognized emotions and their confidence scores on a user friendly interface. Users can provide
feedback on detection accuracy, which further fine-tunes the system’s performance.
6.2 PyAudio
Purpose: Real-time audio capture for emotion analysis.
Key Functions:
pyaudio.PyAudio: To initialize and configure audio input.
stream.read: To capture audio data from the microphone.
6.3 Dlib
Purpose: Facial landmark detection and alignment.
Key Functions:
get_frontal_face_detector: Detects frontal faces in an image.
shape_predictor: Identifies facial landmarks (eyes, nose, mouth).
Purpose: To execute, train, and assess deep learning models for emotion detection.
Key Components:
● Convolutional layers: To extract features from facial images.
● Recurrent layers (LSTM): For capturing temporal dependencies in audio or visual data.
6.5 Streamlit
Purpose: Create an interactive user interface for visualizing emotion detection outputs.
Key Features:
Real-time updating of results.
Integration with deep learning model predictions.
7. Algorithms Used
7.1 Haar Cascades (for Face Detection)
Purpose: Detect faces in images or video.
Steps:
1. Start
7.2 CNNs
Purpose: Extract high-level spatial features from images for emotion classification.
Steps:
1. Start
2. Apply convolution operations to the input image using kernels to identify textures or edges.
3. Use pooling layers to downsample the information and retain significant information.
4. Flatten the attribute maps and insert them into fully connected layers for grouping(classification).
5. End
7.3 RNNs
Purpose: Model temporal dependencies in sequential data, especially audio signals.
Steps:
1. Start
2. Process sequential input data (e.g., audio spectrograms).
3. Maintain hidden states that capture temporal context.
4. Use the final output for emotion classification.
5. End
1. Start
2. Use input gates to control which parts of the input to keep.
3. Use forget gates to remove irrelevant past information.
4. Combine outputs to track long-term emotional trends.
5. End
1. Start
2. Convert time-domain audio signals into the frequency domain.
3. Identify prominent frequencies associated with emotions.
4. End
1. Start
2. Apply the Softmax function to model outputs to normalize values between 0 and 1.
3. Choose the category with the highest probability as the detected emotion.
4. End
1. Start
2. Load a pre-trained model.
3. Fine-tune the model by freezing earlier layers and retraining the final layers on the new dataset.
4. End
Where:
Where:
Where
𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡 −1 + 𝑖𝑡 ∗ 𝐶ˆ𝑡
Where
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )
Where:
𝑜𝑡 output gates.
𝐶𝑡 is the cell state, and ℎ𝑡 is the hidden state.
𝑥𝑡 is the input at time step 𝑡.
𝑊0 ,𝑏0 Weight matrix and bias
FER2013
A comprehensive collection of 35,887 grayscale facial photos, the FER2013 dataset is classified with seven
fundamental emotions: anger, disgust, fear, happy, neutrality, sadness, and surprise. The richness and diversity of this
dataset make it one of the most popular for CNN-based facial emotion recognition. The study involves training a deep
learning model for facial expression recognition using FER2013. Before being fed into a CNN-based classifier, the
dataset is preprocessed by picture normalization, augmentation, and face identification (Haar Cascades/Dlib). This
enables the model to correctly identify emotions from facial photos taken from live video input or CCTV footage. The
FER2013 dataset is an essential part of the emotion detection pipeline since it greatly enhances the system's capacity
to identify emotions purely from facial expressions.
RAVDESS dataset
For emotion recognition in speech and video, the RAVDESS dataset is a popular resource. 24 professional performers
who use speech and music to convey their emotions are featured on 1,440 recordings. An outstanding option for
multimodal emotion research, this dataset offers high-quality audio and visual data. This project specifically uses
RAVDESS for emotion recognition based on voice. Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity
features are used to the audio samples in order to train a CNN-RNN-based model for speech emotion detection. The
dataset helps the algorithm examine both vocal intonations and facial expressions, which is essential for increasing
the accuracy of emotion recognition from speech patterns. RAVDESS's incorporation improves multimodal learning
and guarantees a more thorough comprehension of human emotions.
This graph compares the performance of emotion detection between speech and facial data for different
emotions. Each emotion (anger, disgust, fear, happy, sad, neutral, and surprise) is represented on the x-axis,
and the accuracy percentages for speech and face modalities are on the y-axis. It highlights the relative
effectiveness of both modalities in detecting specific emotions.
This graph displays performance metrics (F1 score, precision, recall, accuracy) for speech detection and face
detection systems. The x-axis represents the metrics, while the y-axis shows their corresponding percentages. Face
detection consistently performs better than speech detection across all metrics, as indicated by the higher orange line.
11. Conclusion
This research introduces a sophisticated Human Emotion Detection system that leverages advanced machine learning
techniques,especially CNNs, to remarkably enhance the accuracy of perceiving facial response. By overcoming the
limitations of traditional emotion recognition methods, the system demonstrates marked improvements in detecting a
broad spectrum of emotions in real time.
The experimental outcomes reveal that the model attains an impressive accuracy of up to 97.5% for
facial emotion recognition, outperforming earlier systems that averaged between 65% and 75%.Additionally,
combining facial image data with voice pattern analysis achieves an overall accuracy of 80.3%, indicating a substantial
improvement of nearly 12% over unimodal systems. This highlights the effectiveness of a multimodal approach to
emotion detection, particularly in scenarios involving complex emotional expressions.
The incorporation of personalized features, such as user-specific emotional baselines, enables the system to
adapt to individual users with a precision improvement rate of approximately 15%, ensuring more contextually
relevant analyses of emotional states. Furthermore, the real-time processing capabilities, achieving response times
under 500 milliseconds, provide instant feedback on emotional states, offering valuable insights for applications in
mental health monitoring, human-computer interaction, and social robotics.
This system not only facilitates timely interventions for mental health professionals, with a reported 90%
satisfaction rate among test users, but also enhances user engagement in interactive applications by improving
recognition speed by 30% compared to conventional models. Overall, this research contributes to the development of
innovative tools for early emotional distress detection, paving the way for improved mental health outcomes and
enriching user experiences across various domains.
Future Scope
The future of human emotion detection in mental health monitoring holds significant promise. Expanding
this system to recognize a wider range of emotions, including subtle and complex states, can enhance its applicability
in various domains. Additionally, incorporating multimodal data, such as physiological signals,voice analysis and can
provide a more complete understanding of emotional states. To ensure accuracy across diverse user populations, future
research should focus on adapting this system to cultural and demographic differences.
Studies are essential to track emotional trends and mental health trajectories, aiding in therapeutic inter-
ventions. Furthermore, enhancing personalization features through user feedback and continuous learning algorithms
can improve this system’s adaptability for individual users.
Deploying this system in real-world settings, such as mental health clinics, educational environments, or
customer service platforms, is crucial for validating its effectiveness and usability in practical scenarios. Finally,
addressing ethical concerns regarding privacy and consent is necessary to ensure responsible and ethical usage of
emotion detection technologies.
References
[1] “Facial Expression Detection by Combining Deep Learning Neural Networks," paper by D.Popescu,
A.Costache, and L.Ichim, 12th International Symposium on Advanced Topics in Electrical Engineering
(ATEE), 2021, pp. 1–5.
[2] “A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition,”
IEEE Access, vol. 7, pp. 48807–48815, 2019, G. Ying, X. Pan, G. Chen, H. Li, and W. Li.
[3] X.-D. Guo, M.-J. Wang, and X. Zhang, "Deep Learning-Based Multi-modal Emotion Recognition in Speech,
Video, and Text," 2020.
[4] "Emotion Tracker: Real-time Facial Emotion Detection with OpenCV and DeepFace," 2023 International
Conference on Data Science, Agents Artificial Intelligence (ICDSAAI),
[5] December 2023, N.Kirubakaran, P.Jegadeeshwari, and M.Bhanupriya.A text-based hybrid approach for
multiple emotion detection using contextual and semantic analysis was presented at the 2021 International
Conference on Innovative Computing, Intelligent Communication, and Smart Electrical Systems (ICSES) in
September 2021 by Srividhya Ravichandran, Sumana Maradithaya, M.Mahima, Nidhi C.Patel, and
N.Aishwarya
[6] "Facial Emotion Recognition System for Mental Stress Detection Among University Students," by Shayla
Islam, Kay Hooi Keoy, Shaik Shabana Anhum, and Foo Jia Ming, 3rd International Conference on Electrical,
Computer, Communications, and Mechatronics Engineering (ICECCME), July 2023.Sandhya Armoogum,
Phavish Babajee, Geerish Suddul, Ravi Foogooa, “Identifying Human Emotions from Facial Expressions
with Deep Learning,” 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), May 2020.
[7] Meaad Hussein Abdul-Hadi and Jumana Waleed, "Human Speech and Facial Emotion Recognition
Technique Using SVM," 2020 International Conference on Computer Science and Software Engineering
(CSASE), April 2020.
[8] Shreya Soni, Senthil Velan S., Suchita Parira, and Shruti Chaubey, "Emotion Detection and Suicidal Intention
Prediction of Differently Depressed Individuals Using Machine Learning Techniques," 14th International
Conference on Computing Communication and Networking Technologies (ICCCNT), July 2023.
[9] Suma K., Sumathi Pawar, "Emotion Detection Using Adaboost and CNN," IEEE 2nd International
Conference on Data, Decision and Systems (ICDDS), December 2023.
[10] Sandhya Armoogum, Phavish Babajee, Geerish Suddul, and Ravi Foogooa, "Deep Learning for the
Recognition of Human Emotions from Facial Expressions," Zooming Innovation in Consumer Technologies
Conference (ZINC), May 2020.
[11] "Facial Emotion Recognition Based on LDA and Facial Landmark Detection," second International
Conference on Artificial Intelligence and Education (ICAIE), June 2021, Xunbing Shen, JunBo Dai, and
Lanxin Sun.
[12] Usman Akram and Haider Riaz, "Emotion Recognition in Videos Through Non-Sequential Deep
Convolutional Neural Network," IEEE International Conference on Information and Automation for
Sustainability (ICIAfS), December 2018.
[13] Ashley Dowd and Navid Hashemi Tonekaboni, "Real-Time Facial Emotion Detection Using Machine
Learning and On-Edge Computing," December 2022: IEEE's 21st International Conference on Machine
Learning and Applications (ICMLA).
[14] Vani Yelamali and Deepa Betageri, “Detection and Classification of Human Emotion Using Deep Learning
Model”, 2024 International Conference on Signal Processing, Computation, Electronics, Power and
Telecommunication (IConSCEPT), July 2024
[15] S.Stewaugh, N. Susithra, P. Ashwath, D. Rohit, B. Ajay, and K. Rajalakshmi, "Gender Identification and
Speech-Based Emotion Recognition Using FNN and CNN Models," 3 rd International Conference for
Emerging Technology (INCET), May 2022.
[16] Kotikalapudi Vamsi Krishna, Navuluri Sainath, and A.Mary Posonia, "Machine Learning for Speech
Emotion Recognition," 6th International Conference on Computing Methodologies and Communication
(ICCMC), March 2022.
[17] Raufani Aminullah A., Muhammad Nasrun, and Casi Setianingsih, "Human Emotion Detection with Speech"
Mel-frequency Cepstral Coefficient and Support Vector Machine for Recognition," 2021 International
Conference on Artificial Intelligence and Mechatronics Systems (AIMS), April 2021.
[18] T. Kishore Kumar and Daya Sagar Tummala., “The Artificial Intelligence-Based Real-Time Facial Emotion
Monitoring System”, at the 9th International Conference on Computer and Communication Engineering
(ICCCE) in August 2023.
[19] S.Stewaugh, N.Susithra, P.Ashwath, D.Rohit, B.Ajay, and K.Rajalakshmi, "Speech-Based Emotion
Recognition and Gender Identification Using FNN and CNN Models," 2022 3rd International Conference
for Emerging Technology (INCET), May 2022.
[20] Kotikalapudi Vamsi Krishna, Navuluri Sainath, and A.Mary Posonia. "Speech Emotion Recognition using
Machine Learning," 6th International Conference on Computing Methodologies and Communication
(ICCMC), March 2022.
[21] Casi Setianingsih, Raufani Aminullah A., and Muhammad Nasrun, "Human Emotion Detection with Speech
Recognition Using Mel-frequency Cepstral Coefficient and Support Vector Machine," 2021 International
Conference on Artificial Intelligence and Mechatronics Systems (AIMS), April 2021.
[22] T. Kishore Kumar and Daya Sagar Tummala, "Artificial Intelligence-Based Real-Time Facial Emotion
Monitoring System", 9th International Conference on Computer and Communication Engineering (ICCCE),
August 2023.