88 Submission-1
88 Submission-1
Technology
A. B. Desai Neeraj Kumar Pandey
Aman Negi
Dept. of Computer Science & Computer Science & Engg.
Computer Science and Engineering
Engineering, Graphic Era University
Graphic Era Hill University
Graphic Era Hill University, Dehradun, Dehradun, India
Dehradun, India
(India) [email protected]
[email protected]
[email protected]
Surabhi Chaubey Amit Kumar Mishra
Anmol Gusain
Computer Science and Engineering Dept. of Computer Science &
Computer Science and Engineering
Graphic Era Hill University Engineering,
Graphic Era Hill University
Dehradun, India Graphic Era Hill University, Dehradun,
Dehradun, India
[email protected] (India)
[email protected]
[email protected]
Abstract— Mental illness has a significant impact on a treatment due to the current shortage of psychiatrists. Objective
person's accomplishments, subjective well-being, and physical physiological markers for depression diagnosis are currently
health. Numerous studies have revealed that the physiological unavailable, and its underlying causes remain unclear.
and behavioral signs of people suffering from mental illnesses
The field of deep learning has garnered increasing interest among
differ from those of healthy people. Variations in brain
researchers, given its rapid development in recent years. Deep
activity, galvanic skin reaction, eye contact, voice, and facial
learning, a subset of machine learning, employs algorithms based on
movements are cases of these signs. Facialexpressions are the
most continuous and accessible nonverbal indicators of artificial neural networks to analyze data representations. It surpasses
mental health. We propose a model to analyze facial shallow models in feature extraction and model fitting, excelling at
expressions from still pictures and videos by collecting data capturing abstract distributed function representations with excellent
based on a person’s emotional state. Neural networks are used generalization capabilities. Deep learning has demonstrated the
to extract features of facial expressions and identify seven potential to address previously challenging problems [1].
distinct expressions such as anger, disgust, fear, neutrality, In the domain of user behavior and emotional state classification,
happiness, surprise, and sadness. Expression rates obtained
various approaches have been proposed. Kyogu Lee focused on
for accuracy and performance as compared to the earlier
utilizing facial muscle movements, while Minsu Cho aimed to
work done by researchers are more satisfactory.
recognize Action Units (AU) based on facial features [2].
Keywords- Emotion Recognition, Convolutional Convolutional Neural Networks (CNNs) have gained popularity in
Neural networks, Depression, Mental health emotion recognition, accompanied by evolving methodologies.
Numerous imaging studies have investigated the differences in brain
I. Introduction activation between natural viewing (without explicit emotion
Mental disorders, characterized by extremely psychotic regulation) and re-appraisal, as well as between natural viewing and
conditions, can cause individuals to think, act, and behave suppression. Additional investigations have compared different
abnormally, leading to difficulties in maintaining a emotion regulation strategies, specifically re-appraisal versus
connection with reality and functioning effectively in suppression, to gain insights into their impact on brain activity and
daily life. Various mental disorders, including depression, subjective emotional experiences [3].
schizophrenia, bipolar disorder, dementia, psychoses, and Emotions represent short-lived physical responses to an individual's
formative disorders, such as chemical imbalances, mood or behavior, and they offer potential avenues for research.
exemplify these conditions. Depression, a prevalent Kassam KS assessed the effectiveness of multiple approaches on the
mental disorder and a leading cause of disability same data and found that DBN's integration of two-dimensional and
worldwide, poses significant challenges in diagnosis and three-dimensional features yielded globally applicable and efficient
1
results [4]. Several studies have explored the utilization of development of facial recognition systems, enabling the detection of
facial expressions for depression detection. Notably, 68 crucial facial landmarks. Through the utilization of these detectors,
Benyoussef Abdellaoui conducted a significant study facial attributes like the mouth, eyebrows, and eyes can be extracted,
[5][6], collecting a dataset of facial expressions from 56 allowing for the measurement of distances and the identification of
depressed and 56 healthy participants and employing expressions such as smiles or surprise. The system can leverage FER+
machine learning algorithms for classification. The results measurements or its own heuristics to make predictions about various
demonstrated high accuracy in distinguishing between emotions, including happiness or sadness [8].
depressed and healthy individuals based on facial The Long Short-Term Memory Network (LSTM), a type of recurrent
expressions [7]. neural network (RNN), specifically addresses the challenges
II. Methodology associated with handling long-term dependencies in sequential data.
The primary emphasis lies in the exploration of how RNNs are constructed as a series of interconnected neural network
convolutional neural networks (CNNs) can be effectively modules designed for sequential input processing. Typically, each
utilized in the realms of deep learning, machine learning, module comprises a single tanh layer, with the output of one module
and emotion identification. In the context of machine serving as input to the next. However, RNNs often encounter
learning, the objective is to enable machines to generate difficulties with long-term dependencies due to the vanishing gradient
accurate predictions by leveraging various techniques and problem. To overcome this obstacle, the LSTM architecture
approaches. incorporates specialized memory cells and gating mechanisms that
Deep learning, as a technique, trains systems to perform enable the network to selectively retain and propagate information
tasks in a manner akin to human learning through over extended time intervals [2].
experiential knowledge. It finds extensive application in Convolutional Neural Networks (CNNs) are a widely employed and
categorization tasks encompassing images, text, and sound, powerful deep learning technique extensively utilized in image
often surpassing human accuracy levels. The training processing and various deep learning tasks [9]. These networks are
procedure involves employing labeled data and composed of fundamental components such as convolution layers,
constructing multi-layered neural network architectures. pooling layers, and fully connected layers [9]. By leveraging the
One of the prominent deep learning techniques utilized is backpropagation algorithm, CNNs can autonomously learn
the Histogram of Oriented Gradients (HOG), which aids in hierarchical spatial features, allowing them to emulate human brain
object detection in computer vision and image processing. activities when analyzing images [10]. Given the computational
Like the Canny Edge Detector, HOG analyzes the demands and complexity associated with CNNs, optimizing the
distribution of gradient orientations within localized image network for efficient computation becomes crucial. One notable
segments, generating histograms of edge orientation. By application of CNNs is in emotion classification, where well-
considering both gradient magnitude and orientation, the constructed CNN models can effectively classify different emotional
HOG descriptor effectively captures an object's shape and states based on image inputs.
appearance, distinguishing itself from other edge On the other hand, Fully Convolutional Networks (FCNs) represent a
descriptors. specific architecture primarily employed for semantic segmentation
By computing histograms based on the magnitude and tasks [11]. FCNs exclusively utilize internally connected layers for
orientation of the gradient, visual features are generated pooling, convolution, and up-sampling, resulting in a more parameter-
[8]. In the context of image processing, a region of interest efficient training approach [12]. Additionally, the internal connections
(ROI) denotes a specific area that one may wish to modify in FCNs enable them to handle varying dimensions. To recover fine-
or filter, often represented as a binary mask image where grained spatial information that might be lost during the down-
pixels inside the ROI are marked as 1, while those outsides sampling process, FCNs employ skip connections. The network
are marked as 0. ROIs play a significant role in the consists of a down-sampling path for feature retrieval and an up-
2
sampling path for context localization and interpretation a deep CNN with a remarkable 50-layer structure that excels in
[13]. An advanced attention mechanism, comprising an accurately classifying images into 1000 different object classes. Its
activation layer, a feature input layer, a convolution layer, exceptional performance can be attributed to its pre-training on a vast
and a full connection layer, has been integrated into this collection of over a million images from the ImageNet database,
FCN model, enhancing its performance and capabilities. enabling it to acquire detailed feature representations for a wide array
A Deep Neural Network (DNN), often known as Deep of visual data. Notably, the network takes input images of size 224 by
Nets, represents a sophisticated type of neural network 224 [18].
characterized by a stacked structure comprising multiple A. Dataset Description:
layers, including at least one hidden layer situated between The training dataset employed for model training is known as
the input and output layers. DNNs are extensively FER2013, sourced from the Kaggle Facial Expression Recognition
employed to handle unstructured and unlabeled data, Challenge. This dataset comprises grayscale face images with
making them widely regarded as the industry standard for dimensions of 48x48 pixels. The images are classified into seven
tackling diverse computer vision tasks. distinct emotion categories, namely anger, surprise, disgust,
VGG (Visual Geometry Group) refers to a deep CNN happiness, fear, sadness, and neutrality. The dataset contains a total
architecture renowned for its numerous layers [14]. The of 26,217 images, with the majority being happy, sad, and neutral
term "deep" in this context signifies the substantial number emotions [19]. The FER+ annotations offer new labels that have been
of layers, such as VGG-16 (16 convolutional layers) or assigned by ten crowd-sourced taggers, providing higher-quality
VGG-19 (19 convolutional layers). The VGG architecture ground truth for still image emotions than the original FER labels.
serves as a foundational framework for advanced object Using these annotations, researchers can calculate the emotion
recognition models and has achieved remarkable probability distribution for each face, allowing for the generation of
performance on various datasets and tasks, extending statistical distributions or multi-label outputs [20].
beyond ImageNet [15]. VGGNet, a deep neural network, B. Model Description
has consistently surpassed benchmark performances, A convolutional neural network (CNN) with multiple layers is used
solidifying its significance in the field. to analyze the distinctive features present in the input image. The
The Inception v3 model for image recognition has architecture of this network consists of an input layer, followed by
established itself as one of the most widely employed many convolutional layers, pooling layers, ReLUactivation layers,
architectures in the field, demonstrating its popularity fully connected layers, and an output layer. These layers are arranged
among researchers and practitioners. It has been proven to in a linear stack.
achieve an impressive accuracy exceeding 78.1% on the Facial expression identification using Convolutional Neural
ImageNet dataset [16]. This model is a culmination of Networks (CNN) is a widely used approach in computer vision. The
diverse ideas proposed by multiple researchers, primarily primary aim of the CNN model is to accurately recognize the
drawing inspiration from the influential paper titled emotional state conveyed by a person's face[21]. To achieve this goal,
"Rethinking the Inception Architecture for Computer the FER dataset,containing grayscale facial images labeled with one
Vision" and its collaborators. Its design incorporates a of the seven emotions - happiness, disgust, anger, fear, sadness,
combination of symmetric and asymmetric building surprise, and neutral, is commonly used [22]. The CNN model for
blocks, including convolutions, concatenations, average facial expression recognition comprises various layers.
pooling, max pooling, dropouts, and fully connected layers There are pooling, convolutional, fully connected layers in the CNN
[17]. Throughout the model, batch normalization is model for facial expression detection.
extensively utilized to normalize activation inputs. The model takes a grayscale image of size 48x48 pixels as its input,
Furthermore, the loss computation in this model employs with the first layer being a convolutional layer that extracts features
SoftMax [14]. On the other hand, ResNet-50 stands out as from the input image. The number of filters in the convolutional layer
can be adjusted depending on the complexity of the problem [23]. To
3
increase the model's translation invariance and reduce the generated in collections. In our specific neural network, each
size of the feature maps, a pooling layer is added after the convolutional layer produces 256 feature maps [2].
convolutional layer [24]. In between the convolutional layers, ReLU (Rectified Linear Unit)
To make the model more resilient to translation and to function of activation was applied. Each feature maps dimensionality
shrink the size of the feature maps, a pooling layer is added is reduced using the MaxPooling method, it is a popular pooling
after the convolutional layer [25]. method.
To extract more complicated characteristics, more MaxPooling considers (2, 2) windows and only retains the most pixel
convolutional and pooling layers are then added. values inside the window from the feature map. This helps in retaining
The initial layer of the model is the input layer, which has important information while reducing the dimensionality. After
a fixed size. Prior to feeding the image into the model, a pooling, the pixel values form a new image with reduced dimensions
preprocessing step is performed, involving face detection with the factor of four.
using OpenCV. Haar-Cascades, combined with Adaboost, In the deep layers of the network, the convolutional and pooling layers
are employed to swiftly identify and resize the face. extract meaningful features from the input image. These features are
Resulting resized face then modified to grayscale and then utilized by the dense layer to classify the image into distinct
resized to dimensions of 48 by 48 pixels. This categories. These layers consist of trainable weights that transform the
preprocessing stage considerably reduces the image extracted features. The training process involves forward propagation
dimensions from RGB format (3, 48, 48) to grayscale and backward propagation of the training data and the errors to adjust
format (1, 48, 48), making it easier to pass as a numpy array the model. The proposed specific model comprises two fully
to the input layer. connected layers connected in sequence. It demonstrates the ability to
Convolutional Layers generalize properly to new photos and regularly adjust its parameters
The Convolution2D layer incorporates a group of distinct until errors are minimized. To prevent overfitting, a dropout of 20%
kernels, which are initialized with generation of random was implemented, effectively reducing the model's sensitivity to noise
weights as one of its hyperparameters [10]. Each feature throughout education whilst retaining the suitable level of complexity
detector has a receptive field size of (3, 3) and scans across within the structure [26].
the original image to generate a corresponding feature map. The output layer of our implemented model is designed as a deep
an application with a more fluid interface. The application representation that uses a horizontal axis to show the false alarm
includes several screens, which include ones for clicking probability and a vertical axis to represent the hit probability. The curve
images of users' facial expressions and presenting the was generated based on the results obtained using different stimulus
results. The application receives input in the form of conditions and judgment criteria.
Flutter was used to create the application. score of the machine learning classifier. Subsequently, the falsepositive
rate (FPR) and true positive rate (TPR) were computed as shown in the
5
Loss Function: Recall= TP /(TP + FN) (9)
Iinachine learning, a loss function, also referredto as a cost
function, is used to assign a non- negative real number to
F1 is equal to the harmonic mean of the precision and recall rates,
the value systems of a random event or its associated
which combines the two measures into a single score, shown in
random variables. This number represents the level of "risk"
equation (10).
or "loss" associated with the random event.
score of F1 =2*TP*1/(2*TP+FP+FN) (10)
Weight Calculation
Different expressions are categorized with accuracy rates into different
weights, therefore landmarks with higher recognition rates can have
greater accuracy. For the same expression, Fig. 2 (a) & (b). shows the
calculation of the proportion of different landmarks recognition rates.
Where m = Gabor, LBP, UG, MC.
Fig. 2 (a) Proportion of landmark recognition rates
The classifier's performance is often assessed using
accuracy, which is computed by the number of correctly
classified samples divided by overall samples in each test
dataset. The accuracy metric measures the proportion of
correct predictions relative to the entire sample size.
The precision was determined by dividing the number of
accurate predictions by overall samples, shown in equation
(7).
III. Conclusion
The findings from these investigations underscore the promising
Fig. 3. Expressions using multiple subjects potential of employing facial expression analysis as a valuable
"Journal of Affective Disorders," "Journal of Medical instrument for the identification and continuous evaluation of mental
Internet Research," and "PLOS ONE" are all prominent health disorders, including Major Depressive Disorder (MDD),
7
anxiety disorders, and fluctuations in mental well-being. Structural Features," Frontiers in Neuroscience,
The amalgamation of a comprehensive methodology in the vol. 16, 2022.
proposed model endeavors to yield meaningful insights 2. B. Abdellaoui, A. Moumen, Y. E. B. El Idrissi,
and enrich the domain of mental health evaluation through and A. Remaida, "The emotional state through
the utilization of facial expression analysis in conjunction visual expression, auditory expression and
with sophisticated machine learning methodologies. physiological representation," SHS Web of
Notably, the attained expression rates demonstrate a Conferences, 3rd International Conference on
heightened level of accuracy and performance in Quantitative and Qualitative Methods for Social
comparison to preceding scholarly endeavors, boasting an Sciences (QQR’21), vol. 119,2021.
impressive accuracy rate of 82.5%. 3. Y. Y. Ghadi, A. A. Rafique, T. al Shloul, S. A.
Alsuhibany, A. Jalal, and J. Park, "Robust Object
IV. Future Scope Categorization and Scene Classification over
Facial expression detection systems wield the capability to Remote Sensing Images via Features Fusion and
revolutionize the landscape of mental health care, holding Fully Convolutional Network," Remote Sensing,
promise as technology advances. Their potential impact on
the enhancement of mental health outcomes is profound. vol. 14, no. 4, 2022.
Progressing the frontier of real-time emotion recognition 4. P. Bobade and M. Vani, "Stress Detection with
necessitates concentrated attention on two pivotal domains:
Machine Learning and Deep Learning using
fine-tuning the Convolutional Neural Network (CNN)
architecture through meticulous adjustments to parameters, Multimodal Physiological Data," 2020 Second
learning rates, dropout rates, and stride sizes; and the International Conference on Inventive Research
adaptation of datasets to faithfully emulate real-time
scenarios, encompassing challenging conditions like low in Computing Applications (ICIRCA),
lighting and noisy backgrounds. A paramount Coimbatore, India, 2020, pp. 51-57, doi: 10.1109/
consideration involves aligning the distribution of training
datasets with the characteristics of real-time subjects. This ICIRCA48905.2020.9183244.
congruence is indispensable for the veracity of the system. 5. U. Kose, O. Deperlioglu, J. Alzubi, and B. Patrut,
Furthermore, concerted efforts must be directed towards
"Deep Learning for Medical Decision Support
bolstering system resilience in uncontrolled settings.
Enhanced calibration of the CNN architecture has the Systems," Springer Science and Business Media
potential to yield improvements in system performance. LLC, 2021.
The multifaceted advantages of facial expression detection
encompass objective, unobtrusive tracking, facilitation of 6. T. Gorasiya, A. Gore, D. Ingale, and M. Trivedi,
diagnosis, ongoing progress monitoring, and the tailoring "Music Recommendation based on Facial
of personalized treatment strategies for mental health
Expression using Deep Learning," in Proceedings
practitioners.
of the 2022 7th International Conference on
Acknowledgements Communication and Electronics Systems
Authors would like to acknowledge and express deep sense (ICCES), Coimbatore, India, 2022.
of gratitude to the Graphic Era Hill University for their 7. Devaiah K N, Anita H.B, "Classification of
support, assistance in extending the infrastructure, Architectural Designs using Deep Learning”,
resources, and constant encouragements to carry out this International Journal of Engineering and
work. Advanced Technology, (IJEAT), vol. 9, Issue-3,
2020.
References
8. K. Sahib, A. Melouah, F. Touré, and A. Slim, "W-
1. X. Tan, J. Wu, X. Ma, S. Kang, et al.,
net and inception residual network for skin lesion
"Convolutional Neural Networks for
segmentation and classification," Applied
Classification of T2DM Cognitive
Intelligence, vol. 51, no. 9, pp. 1-19, Sep. 2021.
Impairment Based on Whole Brain
8
9. S. Gilda, H. Zafar, C. Soni, and K. Classification Using Ensemble of Fine-Tuned
Waghurdekar, "Smart music player Deep Learning Models," Applied Sciences, vol.
integrating facial emotion recognition 11, no. 17, 2021.
and music mood recommendation," in 14. Y. T. Jo, S. W. Joo, S. H. Shon, H. Kim, Y. Kim,
Proceedings of the 2017 International and J. Lee, “Diagnosing schizophrenia with
Conference on Wireless network analysis and a machine learning
Communications, Signal Processing method,” International Journal of Methods in
and Networking (WiSPNET), Psychiatric Research, vol. 29, no. 1, 2020.
Chennai, India, pp. 154-158, 2017. 15. J. Zhang, X. Yang, W. Li, S. Zhang, and Y. Jia,
10. R. Nijhawan, N. Sule, M. Verma, B. "Automatic detection of moisture damages in
Sharma, and I. Bansal, "Analysis of asphalt pavements from GPR data with deep
Coloboma Defected Eyes using CNN and IRS method," Automation in
Automated Pre-trained CNN Construction, vol. 113, pp. 103119, 2020.
Models," in Proceedings of the 2022 16. "Artificial Neural Networks and Machine
3rd International Conference on Learning," ICANN 2016, Springer Nature, 2016.
Computation, Automation and 17. S. Srinivasagopalan, J. Barry, V. Gurupur, and S.
Knowledge Management (ICCAKM), Thankachan, “A deep learning approach for
Pune, India, pp. 1-5, 2022 diagnosing schizophrenic patients,” Journal of
11. A. B. Desai, D. R. Gangodkar, B. Experimental & Theoretical Artificial
Pant, and K. Pant, "Comparative Intelligence, vol. 31, no. 6, pp. 803–816, 2019.
Analysis using Transfer Learning 18. V. Atliha, "Improving image captioning methods
Models VGG16, Resnet 50 and using machine learning approaches," M.S. thesis,
Xception to Predict Pneumonia," in Vilnius Gediminas Technical University, 2023.
Proceedings of the 2022 2nd 19. M. Niu, Z. Zhao, J. Tao, Y. Li and B. W. Schuller,
International Conference on "Selective Element and Two Orders
Innovative Sustainable Vectorization Networks for Automatic
Computational Technologies Depression Severity Diagnosis via Facial
(CISCT), Dehradun, India, pp. 1-6, Changes," in IEEE Transactions on Circuits and
2022, doi: Systems for Video Technology, vol. 32, no. 11,
10.1109/CISCT55310.2022.1004650 pp. 8065-8077, Nov. 2022, doi:
7. 10.1109/TCSVT.2022. 3182658..
12. A. Demir, F. Yilmaz, and O. Kose, 20. S. Kang and S. K. Kim, "Game Outlier Behavior
"Early detection of skin cancer using Detection System Based on Dynamic," in CMES-
deep learning architectures: Resnet- Computer Modeling in Engineering & Sciences,
101 and Inception-v3," in Proceedings 2021.
of the 2019 Medical Technologies 21. R. C. Borges, "Audio-based coldstart in music
Congress (TIPTEKNO), Istanbul, recommendation systems," M.S. thesis,
Turkey, pp. 1-4, 2019. Universidade de Sao Paulo, Agencia USP de
13. N. Kausar, A. Hameed, M. Sattar, R. Gestao da Informacao Academica (AGUIA),
Ashraf, A. S. Imran, M. Z. ul Abidin, 2022.
and A. Ali, "Multiclass Skin Cancer 22. R. Szeliski, "Computer Vision: Algorithms and
9
Applications," Springer Science and
Business Media LLC, 2017.
23. Y. Yang, "A Recursive Least Squares
Training Approach for Convolutional
Neural Networks," M.S. thesis,
Colorado State University, 2022.
24. A. E. Tate, R. C. McCabe, H. Larsson,
S. Lundström, P. Lichtenstein, and R.
Kuja-Halkola, “Predicting mental
health problems in adolescence using
machine learning techniques,” PLoS
One, vol. 15, no. 4, Article ID
e0230389, 2020.
25. Jetli Chung and Jason Teo, “Mental
Health Prediction Using Machine
Learning: Taxonomy, Applications,
and Challenges” in 2022.
10