0% found this document useful (0 votes)
18 views15 pages

Detection of Stress, Anxiety and Depression (SAD) in Video Surveillance Using ResNet-101

Uploaded by

rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Detection of Stress, Anxiety and Depression (SAD) in Video Surveillance Using ResNet-101

Uploaded by

rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Microprocessors and Microsystems 95 (2022) 104681

Contents lists available at ScienceDirect

Microprocessors and Microsystems


journal homepage: www.elsevier.com/locate/micpro

Detection of stress, anxiety and depression (SAD) in video surveillance


using ResNet-101
Astha Singh *, Divya Kumar
Department of Computer Science, Motilal Nehru National Institute of Technology Allahabad, India

A R T I C L E I N F O A B S T R A C T

Keywords: Emotional disruptions are associated with the psychological state of a person that comes out in the form of non-
Stress verbal signals. The usage of medical resources for the identification of emotional activities is a complex and
Anxiety expensive task. Computer vision techniques equipped with artificial intelligence are capable of bringing auto­
Depression
matic and fast identification of emotional variations of the human mind. Emotional variations may contain
Kanade-Lucas-Tomasi
ResNet 101
overlapping stages in which multiple non-separable emotional symptoms are more difficult to classify. The
Facial feature extraction objective is to draw up an investigation of such a non-verbal body signal and correlate it with the psychological
Kalman filter state of the person. Artificial intelligence techniques explore the identification of psychological states using pixel
Video based analysis intensity information from datasets of facial expressions. The proposed study explores the classification of
emotional symptoms into stress, anxiety and depression from facial expressions in a real-time video surveillance
dataset. The second objective of the proposed study is to maintain classification accuracy for variation of real-
time noise that may distort feature information. The study exhibits the use of the Kalman filter for the locali­
zation of intensity-based features and the use of the bilateral filter, contrast enhancement and adaptive filter
algorithms for the removal of noise. Finally, ResNet 101 architecture has been used to classify symptoms of
stress, anxiety and depression. The robustness of the proposed classification algorithm has been compared with
other algorithms, such as PCA, Gradient boosting algorithm, KNN, Decision tree, Naïve Bayes, and SVM. It has
been observed that ResNet 101 outperformed other models with a notable 98.4% accuracy.

1. Introduction un-willingness to participate in any interrogatory psycho test. Therefore,


the disease is mostly ignored and converted into high-order depression.
Recognition of human non-verbal behavior has been studied widely The number of suicidal cases is increasing due to delays in recognizing
in various scenarios. The challenging study includes investigating psychological disorders. The psychological disorder is complex, and its
negative non-verbal patterns and justifying their reasonable cause. investigation takes much time. In psychological disorders, anxiety is the
Stress, anxiety, and depression are severe psychological issues obtained early stage that is generated during day-to-day adverse events. This
due to some event in a person’s life. These mental issues are very creates negative non-verbal signals [5] from a person’s facial expression.
complex to investigate and explore the treatment. Medical science has These temporary signals cause a volatile psychological disorder from
evaluated various psycho tests such as beck depression inventory [1], which a person can recover soon. The following psychological disorder
Hamilton rating scale [2], Raskin depression rating [3], Barthel index stage is stress, which is long-term anxiety caused due to various negative
score [4], etc. All these tests require the patient’s cooperation in con­ thoughts. A series of anxiety over multiple days can convert into stress
ducting their non-verbal signals. In such a case, getting the patient’s that stays long-term. The subsequent and most extreme psychological
involvement is complex. Generally, stressed or depressed people show disorder is depression, which lasts long and is a non-volatile

Abbreviations: AI, artifical intelligence; KNN, K-nearest neighbors algorithm; PCA, pricipal component analysis; CNN, convolutional neural network; SVM, support
vector machine; VGG, visual geometry group; SAD, stress anxiety depression; ResNet, residual neural network; LPQ-TOP, local phase quantization from three
orthogonal planes; AVEC, audio/visual emotion challenge; RNN, recurrent neural network; HDRS, Hamilton depression rating scale; LR, linear regression; DASS-42,
depression anxiety and stress scale -42; ROC, receiver operating characteristic; KLT, Kanade–Lucas–Tomasi; IEMOCAP, interactive emotional dyadic motion capture;
FFT, fast Fourier transform.
* Corresponding author.
E-mail addresses: [email protected] (A. Singh), [email protected] (D. Kumar).

https://doi.org/10.1016/j.micpro.2022.104681
Received 29 April 2022; Received in revised form 8 August 2022; Accepted 6 September 2022
Available online 15 September 2022
0141-9331/© 2022 Elsevier B.V. All rights reserved.
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig.1. . General Architecture of depression detection scheme.

psychological disturbance. This stage can occur when multiple stress stress detection.
factors combine in which a person starts overthinking. These three The objective is to demodulate audio signals from a video clip if
stages of mental disorder are challenging to recognize accurately. These video data is inappropriate to distinguish between stress, depression,
stages of cognitive disorder produce symptoms in the person in which a and anxiety. The demodulation audio will be separately analyzed into
negative facial expression and body gesture/posture can easily be feature levels to determine if a person has stress, depression, or anxiety.
noticed. The earlier study investigated the facial expression and body In the earlier times, computer vision-based applied stress analysis had
movement using machine learning/deep learning to recognize mental been widely adopted as it is fast and less expensive than any psycho­
disorders. Besides the earlier studies, it is still challenging to accurately logical test in medical terms. Depression detection started with emotion
identify and distinguish between stress, anxiety, and depression. Acute classification into anger, sadness, happiness, etc., from facial expres­
psychological disorder is a mental disorder that remains for a short sions. The features analysis was measured under various pre-processing
period. Chronic psychological disorder [6] is the type in which stress and machine learning techniques to classify a person’s emotional sub­
remains for an extended period. It occurs due to a series of continuous ject. Wen et al. [8] proposed depression detection using a sequence of
minor stress and failure. It makes a person think more profound for an facial images divided into 60 facial regions from which LPQ-TOP fea­
extended period. tures are extracted. The components were trained using the support
The current study is based on features analysis from images and vi­ vector regression (SVR) model to obtain depression recognition accu­
suals in which non-verbal signs and expressions are classified using rately. The cognitive symptoms of an individual are those systems that
machine learning/deep learning techniques. The proposed method de­ can be recognized easily. It contains changing behavior, shouting,
tects stress, anxiety, and depression from recorded video surveillance in irregular body movement, facial expression, improper talking, etc.
which a person’s activities are secretly recorded without acknowledging These symptoms are easily captured from the outside. The features of
the patient. This system gives simpler user input and expressions. The these symptoms may overlap as they are very closely related to psy­
proposed model performs feature extraction from the captured video chological activities. Stress, depression, and anxiety features resemble
clips and explores the non-verbal signs and movements. The proposed each other as their psychological activities are correlated. Therefore, the
system uses a deep neural network model for feature training and severity level must be analyzed to distinguish between stress, depres­
testing. sion, and anxiety. The level of severity can be measured through various
This study aims to leverage contact-free video cameras for disorder tools.
detection. The study will find the facial signs and expressions that can The outline of the chapter scheme is as follows: Section 2 will discuss
provide insights into the identification of stress, anxiety, depression, and the related work occurred in this field. Section 3 contains methodology
symptoms that are usually linked with fluctuations in physiological used in the proposed system. Section 4 shows the results achieved in the
disorder and physical activities. The study will focus on extracting facial proposed system. Conclusion of the entire study section is presented by
signs such as mouth activity, head motion, heart rate, blink rate, spatial Section 5. The last section is reserve for the references taken from
gaze distribution, pupil dilation, and eye movements from different various studies.
facial regions or using the Facial Action Coding System [7] and extracted
Action Units from the face frames for stress detection. The proposed
study leverage and integrates user action cues to enhance video-based

2
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 1 encoded from low to high intensity of stress level in which six numbers
Distribution of categories of facial features types with stress or depression. of emotions, such as anger, embarrassment, etc., are studied with three
Head Eyes Mouth Gaza Pupil different intensity levels of stress. This study achieved about 69% ac­
curacy in recognizing the intensity level of pressure on the ADFES-BIV
head eye blink movement of direction of variation of the
nod mouth gaze size of pupil dataset. These three intensity levels are stratified into low order stress,
color of movement of teeth sharp gaze pupil movement intermediate order stress, and high order stress based on emotional
skin eyebrows depression expression from facial images. Sonmez et al. [4] challenged proposing a
Facial Eyelid Swallow rate Low gaze Other variation classification model on the ADFES-BIV dataset in which the author
PPG variation
performed the classification based on sparse representation of features
by considering local information on facial data. The study was able to
2. Related work achieve 80% of accuracy for the recognition of those three intensity
levels of stress.
Cohn et. at. [1] Introduce an automatic assessment of depression Deep learning-based convolution neural networks (CNN) have been
using feature analysis from facial images. The study achieved 79% ac­ frequently applied to extract features from video-based facial datasets
curacy in distinguishing between depressed and non-depressed classes. for depression analysis. Al jazaery et. at. [9] employed a recurrent
The study was explored over the Pitts-burgh static image dataset in neural network (RNN) to represent features of video-based input into
which an active appearance model and support vector machine was deep learning neural network for better high-quality depression detec­
applied to classify depressed and non-depressed people. The experiment tion. The RNN is used to obtain temporal information encoded in the
was not robust as it neither justifies the stress level nor can distinguish feature space sequence. Zhou et al. [10] introduced a neural network for
between short-term and long-term disorders. Alghowinem et al. [2] a visual dataset to learn depression relation-rich features from facial
performed stress detection on the BlackDog dataset [6], in which a expressions. The methodology identifies salient regions of the image and
collection of facial images of various expressions are stored. This study transforms them into histogram plots to study the variation of pixels and
applied over 128 images from which statistical features such as eye relate them to changing expressions. Jan et al. [11] proposed a histo­
movement, eyelid changing, etc., had been studied under the SVM gram technique to represent feature space in a histogram to dawn an
classifier. This study achieved about 88.3% in recognizing stress over the analysis of variation of features from input facial image. This technique
BlackDog dataset [6]. The model was further tested over the AVEC used the partial least square (PLS) method and the linear regression (LR)
dataset, which analyzed images based on visual geometry group (VGG) model to conduct a depression detection experiment. The method was
features and obtained an 87.4% F1 score. applied over AVEC 2014 [11] video-based dataset. Giannakakos et al.
Wingenbach et al. [3] implemented three levels of emotional [12] introduced a methodology for facial cues to classify emotional
expression detection from a video-based dataset called ADFES-BIV stages by observing a person’s eye, mouth, and head movement. This
(Amsterdam dynamic facial expression). The emotional expression was work implemented a model that distinguishes between short-term and

Fig. 2. Various features extraction on facial expression.

3
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 2 action unit contains patient actions such as expression, eye movement,
Table representing stress/depression recognition using various scaling meth­ body gesture, etc. These action units are analyzed and mapped with
odology and the related model. psychological events to conduct a classification of emotion categories.
Author Description Methodology Emotion processing does involve various facial features and the noted
[15] Support system in a student life Psychological assessment
expression that connects with stress/depression. The different variation
under his stress of facial parts has been described in Table 1. The movement of the head,
[16] Examine exposure to stressors Stress level has been checked under eyes, mouth, gaze, and pupil may reveal various measurements of
among student-employees various parameters like time thoughts and is connected to emotional variations.
pressure, social mistreatment,
Fig. 2 explains the extraction and correlation of emotional features
friendship problem, academics
alienation etc. with the psychological states of a person. The features contain facial
[17] Study of stress exposure in fresh Sample based survey on Stress level geometry, texture, local binary information, haar points, etc. These
student specially females in graduate female students features are encoded into various facial action units, and the correlation
enrolled in marriage under family will be established with emotional states. Table 2 contains the descrip­
pressure.
[18] Explored the impact of financial, Psychology student stress
tion and methodology used in multiple recent research works.
emotional, and academic on Questionnaire Stress levels in patients can be measured using various sets of stan­
graduate student psychology that dard scales proposed and tested through multiple studies. The Holmes
leads to generate stress problems. and Rahe Stress Scale [12] bases the sources of our mental stress on our
[19] Explored the organizational role in Organizational role stress scale
life events and contains a list of forty-three life events based on which a
building stress in women
[20] Facial expression tracked using Classification algorithm like KNN, relative score is calculated. This scale has a low accuracy leading to poor
features like smooth energy, SVM and GMM has been applied to overall performance. The Depression and Anxiety Scale (DASS-42) has
MFCC, mean, standard deviation classify FAU. 92% of accuracy has 42 questions to calculate individual stress, anxiety, and depression
geometry of face been obtained. scores. A shorter version of this scale with 21 questions has also been
[20] GSR and ECG accelerometer data Decision tree, naïve bayes and SVM
constructed and verified [11]. The Hamilton Anxiety Scale [11] consists
sources are used to extract spectral algorithms has been applied and
and time domain features. obtained 92.4% of recognition of 14 items and can measure psychic and somatic anxiety. There are
accuracy. various scales for measuring depression levels in patients. The Hamilton
[21] Questionnaire and skin Applied SVM in which radial based Depression Rating Scale [8], Montgomery-Aberg Depression Rating
conductance data sources are used function and linear kernel is used
Scale (Hamilton, Schutte & Malouff, 2001), Raskin Depression Rating
to features like mean, standard for the classification. The study also
deviation and mobility radius. tested under PCA and KNN. The Scale [21] Beck Depression Inventor [3], Geriatric Depression Scale
study obtained 75% of recognition (GDS) [14], Zung Self-Rating Depression Scale [25] and the Patient
accuracy. Health Questionnaire (PHQ) [20] are all different scales and question­
[22] The experiment performed stress The model used naïve bayes, naires that provide a score rating that gives a relative measure of
detection using speech signal decision tree and SVM for the
depression level among patients taking the test. Gabriel Tsechpenakis
processing. BVP,ST and GSR data experiment and obtained 90.10%
sources are used in which features of recognition rate for stress. The et al. in 2005 [21] researched deception detection, where it is required
like mean IBI, BV amplitude, GSR model is totally based on speech to detect and track the regions of interest in the examined video, i.e., the
mean value etc. are extracted. recognition not the facial head and hands, using the skin color-based method. Then, remove the
expression.
movement descriptors used in the recognition from the extracted blob
[23] The objection of the study is to Perceived stress scale, modified
identify stress/depression in ways of coping scale and COPE
features, i.e., positions and orientations. Finally, the HMM-based (Hid­
student life and hence the study scale has been applied to find level den Markov Model) approach is used to detect and recognize two
utilized 322 students of different of stress in students. possible behavioral states, agitation, and over-control, which indicate
institutes. possible deception. HMM, a method is helpful in gesture, gait, and sign
[24] The author defines seven types of The study used Microsoft Kinect for
language recognition.
emotional classes such as natural, 3-dimensional face modeling.
surprise, joy, sad, disgust, anger Don Ardell stress test [15] is a separate robust stress test to find
and fear. 121 different feature specific stress levels in a person’s life. It offers a balanced assessment of
points in the face are generalized varied stress sources. It finds the importance of including all aspects of
using the modeling tool
life in understanding stress. The Ardell Wellness Stress tests analyze a
[25] The study performs RCNN based The facial portion has been
classification to avoid poor graded detected by using region proposal
person’s physical, mental, emotional, spiritual, and social aspects to
features from facial extract. The networks to obtain high quality outline a balanced assessment. The test contains a six-point scale,
system proposes real time features. The study achieved 94% including a neutral point in which no negative and no positive emotions
recognition. of accuracy using active shape reside in life. Ardell wellness stress test is also an effective way to
model ad boost classifier.
analyze the stress in people. It has seven basic score points, which reflect
the person’s mood. To implement this test in a population, a set of
long-term anxiety disorders. Also, it performs the detection of depres­ questionnaires is prepared and asked a student that acts as a stress
sion separately. The methodology was robust but failed to maintain a source. The responses of the student are recorded in terms of 7 funda­
better accuracy rate on a different dataset. A study conducted by Cohn mental scores i.e. +3, +2, +1, 0, 1, 2, 3. After the sum of all recorded
et al. [13] shows the extraction of mid-level facial features, i.e., facial scores from each questionnaire is calculated and based on the total
action units (FAUs) [14], for depression analysis. The method shows score, the final stress level of a person has been decided. These tech­
excellent recognition over AVEC 2016 video-based dataset. Fig. 1 shows niques are based on a questionnaire approach in which the patient needs
the general architecture of the depression detection scheme. to answer an interrogator’s questions. The decision on stress level in the
The general architecture shown in Fig. 1 is prevalent in the earlier patient is taken based on their answers. This technique is obsolete and
studies where mostly machine learning and deep learning algorithms are quite fragile as it requires the physical intervention of the patient. The
applied to obtain classification from input facial images. Fig. 1 shows the patient may not co-operate or provide false information, which leads to
basic flow of methodology to receive recognition of emotions from facial decreased stress recognition accuracy. Earlier techniques were applied
activity containing swallow rate, eye movement, etc. Table 1 offer mostly on static images taken from standard datasets such as ADFES-BIV
various facial features based on which emotional states can be classified. [16], AVEC [17], DemntiaBank [18], Reach Out Triage shared task [19],
The action units are stored in terms of feature information. Each etc. The dataset is purely symmetric and does not contain any dynamic
variations. Some limitations are also in the existing emotional

4
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig.3. . Flow chart of the proposed scheme.

recognition models. Hardware deployment for the identification of Kanade-Lucas-Tomasi (KLT) algorithm [27] for facial portion extraction,
emotional activities is one of the unavoidable challenges found in in which the algorithm plots a rectangular box around the face based on
various studies. It contains many variations and complexes depending the local features. The KLT algorithm identifies the regional binary
on the type of body signal. Another challenge in the current work is pattern on the facial way to avoid the detection of any noise or unwanted
mapping the correlation between body signals and psychological events. details. The algorithm performs local optimization on video frames to
In emotion identification, the challenging task is to design a model to extract facial portions in a rectangular block. Here, the pre-processing
measure the severity level of emotional symptoms. The symptom of one task improves the visual quality of the video frames so that it becomes
class of emotion resembles each other, and so it becomes challenging for easier for the KLT algorithm to plot the track-able feature on the first
a model to distinguish among the emotion categories. The earlier study frame and trace each feature in the subsequent frames with the help of
found that the AI model remains convenient compared to medical displacement. The displacement of a particular feature is defined as the
investigation for detecting emotional activities. But the challenge is that displacement that minimizes the sum of differences. The KLT algorithm
the model must be able to process all body signals. The AI model is found produces output as extracted facial portion from video frames. The
to be effective for specific body signals in various existing works. selected facial images are then undergone robust feature extraction that
Another challenge is to design a cost and time-efficient model for is performed by the Kalman filter algorithm. This algorithm is used to
emotion category identification. The model must be robust so that it can track the movements and variations in facial expression. The prediction
handle any error. Some challenges exist in artificial intelligence tech­ of stress levels primarily depends on the interpretation of expressions
niques as this technique may fail to configure the dataset if it contains based on which depression, anxiety, and stress will be classified. The
impurity. Model overfitting and underfitting are some issues observed in Kalman filter [28] performs tracking of the geometry of eye movement,
AI-based practices. Working with low-frequency user input data may be nose activity, head pose, and the movement of the mouth. All the
insufficient for emotion recognition analysis. Low-frequency data may movements are relatable to various levels of stress. Facial movement and
contain insufficient features for making a correlation with emotional expression are translated into local binary features, which are fed into
features. ResNet 101 model [11]. The ResNet model performs the classification of
The proposed technique has taken a standard surveillance video features into stress, anxiety, and depression based on feature range,
dataset [26] which contains about 300 people’s surveillance streaming type, value, geometry, and orientation. Other classification algorithms
video data. The objective of the proposed model is to ensure the include PCA, Gradient boosting algorithm, KNN, Decision tree, Naïve
collection of natural facial expressions and features without using a Bayes and SVM. These algorithms are also tested on the common feature
person’s interaction in any special interrogation. Unlike previous space to perform a comparative study with the proposed ResNet 101
questionnaire-based techniques in which a person needs to answer or model.
respond to a set of particular questions in front of a video recorder, the
proposed method of video recording from a surveillance system captures 3. Methodology
a person’s natural experiences that they may feel during different levels
of stress. The model requires the person’s interaction in real time to As discussed earlier, the proposed model uses a surveillance video
avoid capturing non-manipulated facial expressions. Fig. 3 shows the dataset in which several video frames of 300 numbers of people are
flow diagram of the proposed system. The model extracts real-time recorded and analyzed individually. The proposed model aims to
surveillance video input, then performs pre-processing of the frames to distinguish between stress, anxiety, and depression correctly. Fig. 3
improve feature quality. The proposed method applies below shows a flow chart of the proposed scheme.

5
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig.4. . Sample of video frames of IEMOCAP dataset.

Fig. 3 shows the overall flow diagram of the proposed system. The dataset. The image dataset contains static images, and its analysis is
model first takes the input surveillance video and performs pre­ limited. The image of patients may not give rich information about
processing. After, the model extracts the facial portion from the pre­ patients’ behavior as the video dataset does. In a video dataset, video
processed video frame. The Kalman filter performs the feature input from the uses can be analyzed frame to frame, which contains
extraction. Finally, the classification of features takes place using ResNet various patient information such as expression, body movement, etc.
101 and machine learning algorithms. Each frame can be analyzed separately with the correlation to the
adjacent frames. The best algorithm, on average, that researchers
3.1. Dataset generation recommend is the Kalman filter, which can extract features from video
frames. Another best algorithm is based on a neural network with the
IEMOCAP [26] (Interactive emotional dyadic motion capture) is a highest capability to analyze video features. Kalman filter is a highly
standard video dataset taken at the SAIL lab. It contains 12 h of sur­ trained tool that enables the capture of feature information from a dy­
veillance video recordings. It captures various expressions, visuals, namic input dataset. In the case of video datasets, the tool can analyze
speech, movements, etc. The dataset also contains multiple dyadic ses­ detailed information frame-by-frame.
sions, which are based on detailed interrogation. The frames of the video
dataset are labeled with various emotions such as sadness, anger, anx­ 3.2. Pre-processing
iety, stress, neutral, etc. The proposed methodology uses surveillance
videos of about 300 people containing different emotions from the The input video frames may contain various noises such as low
IEMOCAP dataset [26]. The sample of video dataset is shown below in contrast, asymmetric color variation, blurring effect, etc. These noises
Fig. 4. must be removed before the feature extraction to enhance the feature
The dataset is divided into training and testing sub-parts in 70 to 30 space that helps in classification. The model may fail to read features
ratios. The 70% of data has been verified for the training in which each effectively, so the correlation of features with anxiety, stress, and
video is resized into equal dimensions first. Each video frame contains a depression also degrades. Data pre-processing, in which data interpre­
size of 338 × 320 pixels. These frames are loaded, and the subject’s tation and filtering has been made for the removal of any noise from the
facial portion is extracted with a dimension of 144 × 138 pixels. Few data. The data signals may carry unwanted signals and noise. And so, the
samples of extracted facial images of 5 subjects have been shown in pre-processing step is applied to extract the region of interest from the
Fig. 4. Each image is extracted from individual video frames. data. Data pre-processing is used to refine the feature vectors of the body
There is a significant relevance of the study conducted for emotional signals. The pre-processing task must recognize stress, depression, and
recognition through video surveillance. Video dataset may increase anxiety symptoms accurately. The pre-processing task makes the feature
complexity in feature analysis, but it is found to be very effective for more visible to the model. It enhances the feature vector by applying
recognizing patients’ emotional activities. The video dataset contains various filters to the input dataset. After a certain threshold of the pre-
the patient’s body movement and other activities that record the per­ processing task, information loss can occur. To avoid any information
son’s behavior and body movement in various situations. It is very easy loss, the filtration process must be kept under the specific unit of a
to find emotional categories through a video dataset as all the relevant threshold value that gives the sufficient intensity of the filtration pro­
features are available. This may not be effectively possible using other cess. Hence, the proposed model performs pre-processing of the video
datasets such as audio and image. The audio dataset contains the pa­ frames first to enhance the overall feature quality. The proposed model
tient’s audio which the patient can fabricate. The audio dataset includes applied the following filter to enhance feature space.
features that may overlap with other emotional activities. The accuracy
of emotion recognition based on the audio dataset is low, and there may a) Bilateral filter
be higher false positive and false negative rates due to feature overlap
cases. The image dataset is also not effective as compared to the video This filter is used to obtain non-linear combination of nearby pixels

6
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig. 5. Sample of pre-processing task.

Fig. 6. Flow chart of Object Detection from Video Input.

in order to smooth the images in video frames while keeping the edges Convert the frame into fuzzy domain:-
preserved. It collects the weighted average of local information that
F(xab ) = (1 − cos(πxab / 255) / 2)
contains intensity information. The weighted average of a sample pixel
(W(S)) is calculated over the intensity (I) is given as:- (xab ) is the member function of the grey scale frames. The apply
∑ ′ ′ ′ contrast enhancement equation for pixel value xab from a frame:-
s ∈S I(s )S(S − s )T(I(S) − I(s ))

W(S) = ∑ ⎧
′ ′
s ∈S S(s − S)I(I(S) − I(s ))
′ ⎪
⎪ x2ab

⎨ 0 ≤ xab ≤ xthreshold
xthreshold
Here S(S − s ) and T(I(S) − I(s ) are the spatial and tonal weights of s’
′ ′
F(xab )
⎪ (1 − xab )2
pixel information. Then Gaussian function is applied on pixel and its ⎪

⎩1 − xthreshold ≤ xab ≤ 1
intensity to smooth its properties. 1 − xthreshold

1 Fig. 5 shows the sample of input image from video frame, pre-
√̅̅̅̅̅e− S /2σ
′ 2 2
s (S) =
σ S 2π processed image (after applying all the proposed pre-processing algo­
rithms) and the histogram representation of pre-processed image.
1 The proposed model applies preprocessing techniques, including
√̅̅̅̅̅e− S /2σ
2 2
I(S) =
σ I 2π bilateral filter, adaptive filter, and contrast enhancement algorithm. The
proposed preprocessing methods are working best to enhance feature
vectors of action units of the input dataset. The proposed preprocessing
a) Adaptive filter algorithms are robust and consume less time to improve the feature
vector. Other preprocessing algorithms are also existing Pixel brightness
In adaptive filter, the noise removal process depends on number of transformations, Brightness corrections, Geometric Transformations,
pixels associated. It works both as high pass filer and low pass filter Image Filtering and Segmentation, Fourier transformation, Image res­
based on number of associate pixel. It applies adaptation based on filter tauration, Laplacian Filtering, and Directional Filtering, etc. These
level automatically that ranges from 0 to max. The filter level is achieved preprocessing algorithms are complex and may trigger the loss of in­
as follows:- formation from the image that may adversely affect the recognition rate.
{ These preprocessing algorithms are time-consuming and show the best
max(0, f (I − 1) − fdec ), q(I − 1) < qf ;
f (I)
min(0, f (I − 1) + finc ), q(I − 1) > qf performance on specific datasets. Therefore, the proposed model utilizes
bilateral, adaptive, and contrast enhancement algorithms as pre­
fdec and finc are the decrement and increment units based on level of processing algorithms to enhance feature quality in the video dataset.
filtration process changes by adaptive filter. q(I − 1) is the quantization The proposed preprocessing technique does not cause any information
parameter that is set to surpass the error estimation up-to few threshold. loss.
This makes the model flexible and adaptive to carry out pixel
enhancement.

a) Contrast enhancement algorithm

7
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 3
Sample of histogram obtained from video frames.

Fig. 7. Types of classification Algorithm.

3.3. Object detection from video input using Kanade-Lucas-Tomasi (KLT) steps of the KLT algorithm have been shown below :-
algorithm Step1: Take the input video containing sequence of captured frames.
Step2: Plot the region of interest on the facial portion to detect face
The KLT algorithm is used to extract the facial portion from a pre- components.
processed video input frame by making a plot of features on the facial Step3: The localization is made in a rectangular block for one frame
portion and transforming the extraction into a rectangular block. The image. The same localization is made in other frames based on plotting

8
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig.8. . Basic flow diagram of ResNet101 model.

Fig 9. The image showing sample of people in stress, depression and anxiety.

Table 4
Score calculation of Anxiety.
Image 1 Image 5 Image 8 Image 16 Image 20 Image 32 Image 36 Score Anxiety

3 1 6 2 0 0 0 4 Extremely severe
4 2 4 4 6 5 8 6 Moderate severe
1 0 6 5 0 7 9 3 Normal
3 0 0 4 3 6 0 0 Mild

Table 5
Score calculation of Depression.
Image 3 Image 7 Image 18 Image 29 Image 29 Image 34 Image 42 Score Depression

3 1 6 2 0 0 4 4 Extremely severe
4 2 3 4 6 3 4 6 Moderate severe
1 0 6 5 3 7 9 3 Normal
3 0 0 4 3 6 0 0 Mild

Table 6
Score calculation of Stress.
Image 3 Image 7 Image 18 Image 29 Image 29 Image 34 Image 42 Score Stress

3 1 6 2 0 0 4 4 Extremely severe
4 2 3 4 6 3 4 6 Moderate severe
1 0 6 5 3 7 9 3 Normal
3 0 0 4 3 6 0 0 Mild

the tracking points. frequency variation of region of interest of the input video dataset. This
Step4: The tracker is used to estimate the scale, rotation and trans­ analysis helps to map or correlate the features with the emotion cate­
lation between previous and new-points. gory. Recognition of stress, anxiety, and depression has been performed
Feature analysis from each video frame is based on the study of the based on features containing pixel information in frequency
frequency of every pixel information. The frequency analysis can be seen components.
by histogram representation which contains pixel information. The fast Fig. 6 shows the flow diagram of the functioning of the KLT algo­
Fourier transform (FFT) algorithm has been applied in various existing rithm in which feature points are firstly located in each video frame. The
models to fetch frequency components. Each such component contains feature displacement has been calculated to track the facial portion of
pixel information of an image. The proposed model also analyzes the the subject. Fig. 4 shows one of the video frames of a subject from which

9
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 7 performs head movement tracking using a Kalman filter in which X-Y-Z
Confusion Matrix obtained by different ML methods on Anxiety, Depression and coordinates of various head pose has been evaluated. The X coordinate
Stress. shows movement at a horizontal level. The Y coordinate reflects the
Method Name Anxiety Depression Stress head pose in the vertical direction, and the Z coordinate contains the
PCA [22 0 0 0 0] [22 0 0 0 0] [24 0 0 0 0]
head pose in the depth direction that contains pitch, yaw, and roll
[0 39 18 0 0] [0 39 18 0 0] [0 29 17 0 0] movements. These head poses are encoded into axes by the Kalman
[0 16 1 0 0] [0 16 60 0 0] [0 16 59 0 0] filter. The normalization of these actions has been done in the model to
[0 5 0 20 0] [0 5 0 1 0] [0 6 0 1 0] suppress variation caused by head pose. The head movement (M) can be
[0 0 0 8 1] [0 0 0 8 1] [0 0 0 8 1]
calculated using the equation below.
Gradient boosting algorithm [39 0 0 0 0] [31 0 0 0 0] [29 0 0 0 0]
[7 21 10 0 0] [7 21 3 0 0] [0 20 3 0 0]
1 ∑M
1 ∑6 ⃒⃒ ⃒⃒
[0 16 35 6 0] [0 6 27 2 0] [0 6 26 5 0] M= ⃒⃒XL − YLtef ⃒⃒
[0 0 0 1 0] [0 0 0 10 0] [0 6 0 7 0] N j=1 6 L=1
[0 0 8 0 25] [0 0 8 0 17] [0 0 0 0 13]
ResNet 101 algorithms [38 0 0 2 0] [20 0 0 0 0] [24 0 0 2 0] Here, N is the number of frames in video input. L is the number of
[0 30 3 0 0] [0 21 0 0 0] [0 16 0 0 0] head movement tracked by the model in X and Y coordinates. The speed
[1 6 12 0 0] [0 6 11 2 0] [6 0 2 0 0]
(v) of head movement is computed as:-
[0 0 0 4 3] [2 2 0 5 1] [0 2 0 12 0]
[0 0 0 0 12] [0 0 0 0 10] [0 0 0 0 8]
1 ∑M
1 ∑6
KNN [29 0 0 0 4] [18 0 0 0 4] [20 3 8 0 0] v= ||XL (t) − YL (t − 1)||
[0 17 0 0 0] [0 15 0 0 0] [0 13 0 0 0] N j=1 6 L=1
[0 20 10 0 0] [21 0 7 0 6] [22 0 20 0 0]
[5 0 0 15 0] [5 0 0 31 0] [0 0 0 26 0] Here, t is the angle at which head movement is recorded.
[0 0 0 0 24] [0 0 0 0 16] [0 0 0 0 19] Eye gaze: The psychological emotions are also reflected by eye gaze
Decision Tree [20 0 8 0 4] [18 0 0 0 4] [28 0 2 0 6]
that has been tracked using the Kalman filter. The Kalman filter is used
[0 19 0 0 0] [0 15 0 0 0] [4 15 0 0 0]
[3 0 13 6 0] [21 0 7 0 6] [0 0 22 0 0]
here to define the angles at which eye movement happens. The model
[0 0 0 20 0] [5 0 0 31 0] [0 2 0 12 0] encoded the movement into the X and Y axis and computed the variation
[0 0 0 0 2] [0 0 0 0 16] [0 0 0 0 2] of eye gaze. The model normalized each person’s gaze by taking the
Naïve bayes [12 0 2 0 2] [18 0 2 2 0] [20 0 2 2 0] difference in median values calculated over the entire video sequence.
[3 15 0 0 0] [3 17 0 0 0] [3 17 0 0 0]
[0 5 19 0 0] [0 5 21 0 0] [0 4 21 0 0] 1 ∑M ⃒ ⃒
[0 0 0 0 2] [0 5 1 12 0] [2 5 0 17 0] E= ⃒Xj Yj+1 − Yj Xj+1 ⃒
[0 0 0 0 2] [0 0 0 0 2] [0 0 0 0 5]
2 j=1
Support Vector Machine [23 0 2 0 0] [15 0 2 3 0] [19 0 3 3 0]
[3 8 0 0 0] [5 2 0 0 0] [4 7 0 3 0] Here, M is the number of equal segmentation applied over X and Y
[1 0 20 6 0] [3 0 15 0 0] [0 0 2 1 0] coordinates by Kalman filter in order to track eye movement.
[0 5 0 12 0] [5 0 0 18 0] [4 0 0 20 0] Facial Action Units (AUs): The facial action unit is encoded in terms
[0 0 0 0 2] [0 0 0 0 4] [0 0 0 0 8]
of various facial expression movements termed as facial action units.
These action units are movement/actions of localized facial muscles.
the KLT algorithm has extracted the facial portion. The Kalman filter is used to track and localize the movement of facial
muscles. It converts these movements into action units. Each action unit
defines the cue of the emotional state of a person. The proposed model
3.4. Feature space identifies 15 action units such as AU1, AU2, AU3, AU4, AU5, AU6, AU7,
AU8, AU9, AU10, AU11, AU12, AU13, AU14, and AU15. Action units
Head pose: The movement of the head correlates with emotions contain a variety of action intensities in the form of changing co­
containing happiness, confidence, fear, stress, etc. The proposed model ordinates. These action units contain valuable features for determining

Fig. 10. Basic GUI (graphical user interface) of working model.

10
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 8
Statistical measures of different classification methods.
Classifier Mental illness Accuracy Error Rate Precision Recall F1 ROC Area

PCA Anxiety 0.638 0.37 0.531 0.638 0.649 0.849


Depression 0.724 0.2744 0.550 0.724 0.730 0.850
Stress 0.708 0.292 0.534 0.708 0.716 0.876
Gradient boosting algorithm Anxiety 0.720 0.28 0.785 0.720 0.714 0.949
Depression 0.803 0.21 0.826 0.803 0.807 0.930
Stress 0.826 0.244 0.822 0.826 0.825 0.916
ResNet 101 algorithms Anxiety 0.865 0.135 0.816 0.865 0.869 0.992
Depression 0.838 0.139 0.806 0.838 0.850 0.980
Stress 0.861 0.267 0.785 0.861 0.881 0.996
KNN Anxiety 0.733 0.267 0.785 0.733 0.72 0.939
Depression 0.707 0.293 0.7777 0.707 0.731 0.970
Stress 0.707 0.293 0.777 0.707 0.731 0.926
Decision Tree Anxiety 0.779 0.221 0.843 0.779 0.779 0.959
Depression 0.867 0.133 0.914 0.8657 0.865 0.960
Stress 0.865 0.151 0.885 0.849 0.836 0.936
Naïve bayes Anxiety 0.769 0.231 0.808 0.769 0.760 0.899
Depression 0.795 0.205 0.829 0.795 0.791 0.940
Stress 0.816 0.184 0.846 0.846 0.814 0.956
Support Vector Machine Anxiety 0.793 0.207 0.819 0.793 0.792 0.969
Depression 0.750 0.259 0.730 0.750 0.752 0.970
stress 0.757 0.243 0.752 0.757 0.758 0.946

are not considered in this experiment. These action units are not easily
visible in the video dataset, or if they are visible, they contain much less
information. The feature information in these action units may not be
able to analyze through a video dataset as video frames are not very
sensitive to detecting sensitive movement. Various body sensors can be
used to measure action units containing sensitive features. Such action
units may have overlapping features also, e.g., skin conductance can be
due to various emotional factors, and so it may not be classified in
specific emotion categories due to which they are not considered in the
proposed scheme. The correlation (R) on features (a) is computed by
following equation.

N
R(a) = ai / N
J=1

The features of facial signs are the action units for which correlation
Fig.11. . Accuracy of the proposed result with the comparison of
has been found. The facial sign (s) is computed using binomial distri­
Different classifier.
bution as given below.
( )
stress, anxiety, and depression. The power of movement is tacked using n r
s(r, n, p) = p (1 − p)n− r
local feature points plotted over the facial expression by the Kalman r
filter. Each action unit is normalized by taking the difference of 1st
Here, n is the number facial sign reported the algorithm. r is the
quartile over the entire video frame. The facial action units are stored by
number of agreement and p is the prior probability.
means of finding correlations in similar actions. The other action units
Histograms: All the facial features are tracked, then transformed
that contain body heating, conductance, vibrations, small gestures, etc.,

Fig. 12. ROC Area results of the proposed and several existing methods in the form of IEMOCAP Dataset.

11
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Table 9 3.4.1. Feature space extraction from images using Kalman filter
Shows the comparison of the proposed model (ResNet101) with other recent Tracking the region of interest from the sequence of the facial portion
works. is a most sensitive task since it requires displacement mathematics for
Refs. Dataset Analysis Model Result feature extraction. The Kalman filter has been used in the proposed
(Average) system, which is used to track the movement and speed of the subject,
Carneiro AVEC Depression ResNet-50 Error rate such as head, mouth, eyes, etc. It first locates the region of interest on the
de Melo 2013 is 7.97 extracted facial subject and then performs tracking in a sequence of
[30r] frames where the subject’s movement is effectively traced. Tracing
Wang et al. DAIC-WOZ Depression SS-LSTM- F1 score is
movements of the region of interest also depends on the frame rate and
[29] dataset and Anxiety MIL 0.783
Haque DAIC-WOZ Depression C-CNN-AVL F1 score is the search region. Tracking is the localization of features across the
et al. dataset 0.769 frame sequence that is well performed by the Kalman filter. Kalman
[26] filter takes account of subject representation and its association for
Uddin [30] AVEC 2014 Depression Bi-LSTM Error tracking purposes. Kalman filter initialized with estimating apriori
and Anxiety rate is
0.74
parameter for the tracking of features. This parameter shows the pat­
Vázquez- Oz (DAIC- Depression Ensemble- Accuracy terns’ location so that the Kalman filter can update the predicted stage.
Romero WOZ) 50 1D CNN is 72% Kalman filter again performs prediction of the patterns in the next frame
et al. database based on the difference of movement of information from the previous
[31]
frame. The Kalman filter performs a time update reflecting the current
de Melo AVEC Depression ResNet-50 Error rate
et al. 2013, is 8.23 state’s forward movement with respect to time. It also estimates
[32] AVEC covariance error to compute the apriori parameter in the next frame.
2014 The model also updates the measurement in feedback that includes
Proposed IEMOCAP Depression, ResNet-101 Accuracy improved posterior values. Algorithm 1
Scheme [26] anxiety and is 99.4%
stress and F1
score is 3.5. Feature classification algorithms
0.875

The proposed system applies ResNet 101 model for the classification
of the extracted feature space into stress, depression, and anxiety classes.
Algorithm 1 The proposed method also uses other classification algorithms on the
Algorithm for feature extraction using Kalman filter. same feature space to compare the results and prove the efficiency of the
ResNet 101. These classifiers are PCA, Gradient boosting algorithm,
Step 1: - Input extracted facial components
Step 2: - Time update:- Dimensional reduction algorithm, KNN, Decision tree, Naïve Bayes and
Subject state (Xk )

SVM. Fig. 7 shows several types of classification algorithms used in this
Xk = (xk yk xk yk )
′ ′ T paper.
xk and yk are the series of x, y axis location in the frames of extracted facial subject
from video input. xk and yk are the xk , yk axis speed.
′ ′
A) PCA (Principal component analysis)
Xk = MXk− 1 + Nwk

M and N are the system parameter that is in matrix form. wk is the kalman weight used Principle component analysis is a dimensionality reduction approach
in tracking the features.
in which the dimension of the feature space has been reduced into
Subject error covariance (Pk )

various components. Each component is individually processing for the


Pk = MPk− 1 MT + Q

Step 3:- Measurement update:- classification be PCA algorithm. The algorithm of PCA has been
Calculate Kalman gain described as below.
Pk t T

Suppose a1 , a2 , a3 …an are the feature vectors.
Gk =
tPk tT + R Step 1:-
Here, t is the parameter in multi-measurement system.
Update estimation using measurement model (Zk ). Zk is the observation vector. 1 ∑N

Xk = Xk + Kk (Zk − tXk )
′ ′
a= ai
N i=1
Zk = tXk + Vk
Update error covariance
Step 2:- Subtract the mean:-
Pk = (1 − Kk t)Pk

Pk is the aposteriori estimate ∅n = ai − â


Step 4:- Feature estimate for each pixel
Xt = Xt− 1 + wt (Xt− − Lt ) Step 3:- For feature matrix Y=[∅1 , ∅2 ,….. ∅n ]
Xt is the feature estimation for each pixel at average time t, wt is the kalman weight for
time t and Lt is the luminance intensity value of the pixel. 1 ∑N
1
D= ∅n ∅n T = YY T
N N=1 N
into histogram representation which shows the frequency distribution of Step 4:- Compute eigenvalues
the entire feature space for a person. The histogram representation
contains the change in frequency based on expression changes. Histo­ D : μ1 > μ2 > …μn
gram computation has been done for eye movement, head movement, μ1 , μ2 … μn are orthogonal values of input image matrix. a − a is the

mouth movement, and expression action units. The intensities of fea­ linear combination of eigenvectors.
tures are already encoded in facial action units that are transformed into
histogram representations ranging from 90 to 90 in the X and Y axes. The ′

N
a − a = b1 μ1 + b2 μ2 + …bn μn = bn μn
histogram is computed using three equal-spaced bins, which contain i=1
features of stress, anxiety, and depression. Table 3 shows the histogram
formation of some of the video frame that is used as a dataset in the ′
(a − a ).μn
proposed scheme. bi =
μ 2
n

12
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

Fig. 13. Shows the comparison of the proposed model (ResNet101) with other existing works.

Step 5:- Feature reduction:- Here, E(X)’ is the summation of entropy of the entire subset of actual
dataset.

H ∑n Xi Xi
Splitinformation (X) = − i=1 X log2 X

a − a= bn μn where H≫M
k=1 Gain ratio = SplitGain
information

Here, H is the highest eigenvalue. Step 3:- After calculation of the information gain of each feature, the
Dimensionality reduction is performed by:- objective is to select the feature base that has maximum information
gain to make it a root node. In this way, other internal nodes of the tree
b1

b2 = μT (a − a ) will be decided.
bk
A) Naïve bayes

A) Gradient boosting algorithm The naïve Bayes classifier uses the Bayes theorem to find the prob­
ability of belongingness of a feature space into a class that could be
Gradient boosting algorithm is similar to adaboost algorithm. In anxiety, stress, and depression.
gradient boosting algorithm, initial weights are assigned with low de­ The equation of the Bayes theorem is defined as:-
cision making capability. The weights are increase to covert a weak P(B|X)P(X)
P(X /B) = P(B)
classifier into a strong classifier. The prediction capability of the model
is increased gradually with the training information. Gradient boosting Here, P(X|B) is the probability of X such that event B is already true.
algorithm defines an error/loss function with is caused by the difference X and B are the two events. The naive bayes classifier compares the
between actual predicted value f(Xi ) and the target value (Yi ). The al­ probabilities of belongingness of various feature space in a class. Feature
gorithm aims to reduce this difference using weight updates. The that has maximum probability is classified into a specific class.
equation of loss function is given as:-
L(Xi ) = |f(Xi ) − Yi | A) SVM (Support Vector Machine)

A) KNN (K-nearest neighbor) SVM is used in the proposed system to classify extracted feature
space into classes like stress, anxiety, and depression. The SVM is used to
KNN algorithm finds the nearest neighbor using similarity index plot all the training features into three-dimensional X-Y-Z planes. The
generated by Euclidean distance algorithm between the feature spaces. SVM classifies the feature points in segregated classes using decision
The nearest features will belong to same class. The equation of KNN boundary/hyper-plane. The equation of linear decision boundary is
algorithm is shown below. given as:-
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
∑N ⃒⃒ ⃒̅ Y = A.B + C
d(x, y) = i=1 xi − yi

Here Y is the predictor; A is the slope that decides the inclination of
Here, x and y are the two feature space. the best fit of the decision boundary. B is the training feature for which
the predictor will be calculated. C is the intercept. SVM also draw non-
A) Decision tree linear decision boundary for the large and complex dataset in which the
target value/class depends on more than one feature space. The equa­
A decision tree algorithm is used for classification using feature space tion for the non-linear decision boundary is given as:-
∑N
which is translated into a tree structure in which each node contains the i=1 αi Yi K(si , z) + b = 0
information gained from each feature space. The information gained Here αi is the learning rate with is a constant value. si is the number
from each node has been calculated to construct a decision tree. The leaf of support vector in SVM and z is the training pattern.
node contains the class, i.e., stress, anxiety, and depression.
Algorithm:- A) Proposed classification method using ResNet 101
Step 1:- Entropy of each feature attribute:-

E(X) = ni=1 − Pi log2 Pi ResNet 101 is a neural network that contains various perceptrons and

Where Pi and Pi are the probabilities of occurrence and non- forms a learning base for the incoming feature space. The network in­

occurrence of features in the dataset (X). cludes an input layer at which the features are fed to the network. The
Step 2:- Information gain (G) hidden layer comes that performs black-box processing of the features in
G(X) = E(X) − E(X)

which the model trained itself. Then, the output layer resides where the

13
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

output of classification of features will generate that is further fed to the class at a high dimension. The performance of naïve Bayes and KNN are
input layer for any error resolution. Fig. 8 shows the basic functioning of almost similar. The SVM algorithm also performs well as its non-linear
the ResNet 101 model. The Y predictor variable has been generated as kernel efficiently distinguishes classes in high dimensions. The perfor­
output for the X input feature using the F perceptron function and b bias mance of the decision tree is also better, but the formulation of infor­
input. Fig. 8 shows the basic flow diagram of the ResNet 101 model. mation gained for each decision node is time-consuming. PCA classifier
shows poor accuracy results in the given feature space, but it consumes
4. Experimental results less computation time.
Fig. 11 shows the comparison of various classifiers in a graphical
The proposed scheme classified the obtained feature into anxiety, view. The graph contains three color bars in which the green color shows
stress, and depression using various classification models. The confusion the accuracy for Anxiety, the blue color indicates the accuracy for stress,
matrix of the respective models, including the proposed one, is described and the yellow colored bar offers the accuracy for depression.
below. Fig. 12 shows the ROC curve comparison of various classification
Fig. 9 shows a few samples of images of people who have stress, models on the IEMOCAP dataset. The ROC curve shows three lines for
depression, and anxiety. The proposed model’s objective is to classify stress, depression, and anxiety. It is concluded from the figure that the
each individual’s psychological state. The psychological conditions of an ResNet 101 model is showing better accuracy on the IEMOCAP dataset
individual depend on facial features that rely on facial action units and as compared to other classification modes. Table 9 shows the compari­
movements. son of the proposed model (ResNet101) with other recent works.
Tables 4–6 are the score calculation of anxiety, depression, and Fig. 13 shows the comparison graph of Table 9. It concludes that
stress, respectively. These score points are calculated on random stress recognition-related works have been performed on various data­
extracted images taken from the dataset. Each category is also divided sets in recent years. The earlier results show the detection of depression
into extremely severe, moderate-severe, regular, and mild. These cate­ on AVEC, DAIC-WOZ, etc., datasets. Most works were accomplished
gories are based on score points taken based on the feature space of using neural network models such as ResNet-50, SS-LSTM-MIL, CNN, Bi-
images. LSTM, etc. The proposed work applied the ResNet-50 model to recognize
Fig. 10 shows the GUI of a working model. The figure is divided into depression, stress, and anxiety from video input. The proposed works are
two sections. The upper section shows the accuracy rate. The lower found to be more acceptable in terms of accuracy and precision as
section shows the error rate. The other details of the experiment are compared to the other techniques. The proposed model does the
mentioned on the right side of the graph. recognition of stress, anxiety, and depression. The stress analysis is
The confusion matrix in various algorithms reflects necessary infor­ significant as it improves real-time-based applications’ efficiency and
mation such as accuracy rate, error rate, precision, recall, and F1 score. reduces medical expenses. The stress analysis is automated in the pro­
These different parameters are evaluated as defined below. posed model that uses artificial intelligence and its related algorithm.
The proposed model applies an algorithm found to be robust for stress
sum of diagonals of confusion matrix
Accuracy rate = analysis. The significance of the proposed model is its robustness and
total number of instances classify
applications that reduces space and time complexity.
ErrorRate = 1 − Accuray rate
5. Conclusion and future scope
TP
Precision =
TP + FN . The proposed scheme is successfully able to distinguish features in
anxiety, stress, and depression classes. The proposed algorithm effi­
Kappa =
Total Accuracy − random Accuracy ciently applied pre-processing techniques to enhance the feature quali­
1 − random Accuracy ties of video frames. The proposed study performed pre-processing on
surveillance video frames in which bilateral filer, adaptive filter, and
F1 Score =
2 ∗ precision ∗ recall contrast enhancement algorithm are applied to enhance the quality of
precision + recall video frames and feature statistics. These filters and techniques improve
feature quality for non-symmetric feature space. The facial portion from
Here TP is true positive = Diagonals of matrix the video frames is extracted using the Kanade-Lucas-Tomasi algorithm.
FN is false negative = sum of the corresponding row for class The algorithm applied rectangular blocks around the facial subjects.
FP is false positive = sum of the all corresponding column for class Then Kalman filter is used for feature extraction. The algorithm applies a
TN is true negative = sum of the all row and column tracking system to capture the movement of the eye, head, mouth, and
facial action units to draw out the feature points. The classification task
Table 8 contains the accuracy-related parameters of the various classi­ has been carried out by various algorithms such as PCA, Gradient
fication algorithms. Table 2 has accuracy, error rate, precision, recall, F1 boosting algorithm, KNN, Decision tree, Naïve Bayes, and SVM. The
scope, and ROC area. The table shows that the average accuracy of the proposed classification algorithm is ResNet 101, which uses a neural
ResNet101 is highest compared to other classification algorithms. The network to classify the features into stress, anxiety, and depression. The
proposed model has applied algorithms such as principle component proposed algorithm shows an average accuracy of approximately 98.4%
analysis, gradient boosting algorithm, Naïve Bayes, K-nearest neighbor, on the IEMOCAP [26] dataset, which is found to be higher than other
decision tree, support vector machine, and Resnet- 101. Table 8 in the classification algorithms. Hence, the proposed scheme concludes that
manuscript contains performance measures of all the applied algorithms the ResNet 101 model is very efficient in the given feature space and
in terms of accuracy, error rate, precision, recall, F1 score, and ROC robust against any error. In the future, the proposed system can be
area. From Table 8, the best suits algorithm in terms of robustness and extended to evaluate other psychological states, such as happiness,
accuracy is found to be Resnet-101. Various statistical measures are anger, excitement, etc., that correlate with facial expressions. In the
presented in Table 8 with the help of the confusion matrix obtained at future, other robust neural network algorithms can be applied to in­
each classification technique. It is evident from Tables 1 and 2 that crease the robustness of the model. In the future, the work can be
ResNet 101 network is efficient for classifying the emotional status of extended to apply recognition of various other emotions by using the
various subjects chosen from the surveillance video dataset. ResNet 101 proposed algorithms to carry out a better accuracy rate. Attacks can also
performs best in the neural network category and can distinguish each be applied to video frames to check the robustness of the recognition
model. In the future, the face recognition task can be performed using

14
A. Singh and D. Kumar Microprocessors and Microsystems 95 (2022) 104681

thermal images from video input to minimize the noise effect. In [16] K.J. Sher, P.K. Wood, H.J. Gotham, The course of psychological distress in college:
a prospective high-risk study, J. Coll. Stud. Dev. (1996).
real-time, the study for detecting stress, anxiety, and depression features
[17] G. Jogaratnam, P. Buchanan, Balancing the demands of school and work: stress and
significantly reduces suicidal and criminal cases. The study is also employed hospitality students, Int. J. Contemp. Hosp. Manag. (2004).
relevant to increasing business productivity by applying the model to [18] M. Polson, R. Nida, Program and trainee lifestyle stress: a survey of AAMFT student
employees to capture their moods. Automatic detection of stress, anxi­ members, J. Marital Fam. Ther. 24 (1) (1998) 95–112.
[19] N. Cahir, R.D. Morris, The psychology student stress questionnaire, J. Clin. Psychol.
ety, and depression using a machine learning model is relevant to saving 47 (3) (1991), 414-4.
lives and reducing medical expenses. The study also helps improve the [20] L. Manea, S. Gilbody, D. McMillan, Optimal cut-off score for diagnosing depression
business strategy by identifying the customer’s mood.The proposed with the Patient Health Questionnaire (PHQ-9): a meta-analysis, CMAJ 184 (3)
(2012) E191–E196.
research is relevant to assist the medical investigation in recognizing [21] S. Naveen, M. Swapna, K. Jayanthkumar, S. Manjunatha, Stress, anxiety and
patients’ psychological activities so that crimes and suicidal cases can be depression among students of selected medical and engineering colleges,
prevented. Bangalore-a comparative study, Int. J. Public Ment. Health Neurosci. 2 (2) (2015)
25–28.
[22] A. Raskin, J. Schulterbrandt, N. Reatig, J.J. McKEON, Replication of factors of
Ethical approval psychopathology in interview, ward behavior and self-report ratings of
hospitalized depressives, J. Nerv. Mental Dis. (1969).
[23] M. Shah, S. Hasan, S. Malik, C.T. Sreeramareddy, Perceived stress, sources and
This paper has not submitted to anywhere and published anywhere. severity of stress among medical undergraduates in a Pakistani medical school,
It does not contain any studies with human participants or animals BMC Med. Educ. 10 (1) (2010) 1–8.
performed by any one of the authors. The submitted work is original and [24] P. Svanborg, M. Åsberg, A comparison between the Beck Depression Inventory
(BDI) and the self-rating version of the Montgomery Åsberg Depression Rating
not published elsewhere in any form or language Scale (MADRS), J. Affect. Disord. 64 (2–3) (2001) 203–216.
[25] P. Vitasari, M.N.A. Wahab, A. Othman, T. Herawan, S.K. Sinnadurai, The
relationship between study anxiety and academic performance among engineering
Declaration of Competing Interest students, Procedia-Soc. Behav. Sci. 8 (2010) 490–497, 18 above18 above18 above.
[26] Haque, A., Guo, M., Miner, A.S. and Fei-Fei, L., 2018. Measuring Depression
Symptom Severity from Spoken Language and 3D Facial Exressions. arXiv Preprint.
The authors declare that they have no known competing financial arXiv:1811.08592.
interests or personal relationships that could have appeared to influence [27] Aziz, M., 2004. Role stress among women in the Indian information technology
the work reported in this paper. sector. Women in Management Review.
[28] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee,
S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database,
References Langu. Res. Evalu. 42 (4) (2008) 335–359.
[29] Y. Wang, J. Ma, B. Hao, P. Hu, X. Wang, J. Mei, S. Li, Automatic depression
[1] J.F. Cohn, T.S. Kruez, I. Matthews, Y. Yang, M.H. Nguyen, M.T. Padilla, F. Zhou, detection via facial expressions using multiple instance learning, in: Proceedinngs
F. De la Torre, Detecting depression from facial actions and vocal prosody, in: of the IEEE 17th International Symposium on Biomedical Imaging (ISBI), IEEE,
Proceedings of the 3rd International Conference on Affective Computing and 2020, pp. 1933–1936.
Intelligent Interaction and Workshops, IEEE, 2009, pp. 1–7. [30] M.A. Uddin, J.B. Joolee, Y.K. Lee, Depression level prediction using deep
[2] S. Alghowinem, R. Goecke, M. Wagner, J. Epps, M. Hyett, G. Parker, M. Breakspear, spatiotemporal features and multilayer bi-ltsm, IEEE Trans. Affect. Comput.
Multimodal depression detection: fusion analysis of paralinguistic, head pose and (2020).
eye gaze behaviors, IEEE Trans. Affect. Comput. 9 (4) (2016) 478–490. [31] A. Vázquez-Romero, A. Gallardo-Antolín, Automatic detection of depression in
[3] T.S. Wingenbach, C. Ashwin, M. Brosnan, Validation of the Amsterdam Dynamic speech using ensemble convolutional neural networks, Entropy 22 (6) (2020)
Facial Expression Set–Bath Intensity Variations (ADFES-BIV): a set of videos p.688.
expressing low, intermediate, and high intensity emotions, PLoS One 11 (1) (2016), [32] W.C. De Melo, E. Granger, A. Hadid, Depression detection based on deep
p.e0147112. distribution learning, in: Proceedings of the IEEE International Conference on
[4] E.B. SÖNMEZ, An automatic multilevel facial expression recognition system. Image Processing (ICIP), IEEE, 2019, pp. 4544–4548.
Süleyman Demirel Universities Fen Bilimleri Enstitüsü, Dergisi 22 (1) (2018)
160–165.
[5] Afzali, A., Delavar, A., Borjali, A. and MIRZAMANI, M., 2007. Psychometric Divya Kumar is an Assistant Professor in the Department of
properties of DASS-42 as assessed in a sample of Kermanshah High School students. Computer Science and Engineering at Motilal Nehru National
[6] B. Armoon, Y. Mokhayeri, J. Haroni, M. Karimy, M. Noroozi, How is the quality of Institute of Technology Allahabad, India. He received his PhD
life of students? The role of depression, anxiety and stress, Polish Psycholog. Bull. Motilal Nehru National Institute of Technology Allahabad,
(2019) 43–48. India His-research interest are software engineering, Soft
[7] A.T. Beck, R.A. Steer, G. Brown, Beck depression inventory–II, Psychol. Assess. computing, evolutionary optimization and reliability engi­
(1996). neering and Machine Learning
[8] C. Carmassi, F. Pardini, V. Dell’Oste, A. Cordone, V. Pedrinelli, M. Simoncini,
L. Dell’Osso, Suicidality and illness course worsening in a male patient with bipolar
disorder during Tamoxifen treatment for ER+/HER2+ breast cancer, Case Rep.
Psychiatry, 2021 (2021).
[9] A. Ghaderi, M. Salehi, A study of the level of self-efficacy, depression and anxiety
between accounting and management stu-dents: Iranian evidence, World Appl. Sci.
12 (9) (2011) 1299–1306.
[10] A. Singh, D. Kumar, Gauging stress among Indian engineering students, in:
International Conference on Computational Intelligence in Communications and
Business Analytics, Springer, Cham, 2021, pp. 175–186. Astha Singh is a PhD Candidate in the Department of Com­
[11] M. ISLAM, A. KHAN, A.M.K. SHERWAN, Prevalence of iron deficiency anaemia puter Science at Motilal Nehru National Institute of Technology
among the reproductive age group women attending the Unani Hospital, Allahabad, India. She received his M.tech in Computer Science
Bangalore, Karnataka, India, J. Clini. Diagn. Res. (12) (2020) 14. from Centre for Advanced Studies Lucknow, India. Her
[12] J.C. Gillies, D.J. Dozois, The depression anxiety stress scale: features and research interests include Machine Learning, Natural Language
applications. The Neuroscience of Depression, Academic Press, 2021, pp. 219–228. Processing
[13] J.P. Maher, D.J. Hevel, E.J. Reifsteck, E.S. Drollette, Physical activity is positively
associated with college students’ positive affect regardless of stressful life events
during the COVID-19 pandemic, Psychol. Sport. Exerc. 52 (2021) p.101826.
[14] H.K. Kim, Effects of stress, depression, self-efficacy, and social support on quality of
life of community dwelling elderly with chronic diseases, Medico Legal Update 20
(4) (2020) 1234–1238.
[15] W.W. Zung, A self-rating depression scale, Arch. Gen. Psychiatry 12 (1) (1965)
63–70.

15

You might also like