0% found this document useful (0 votes)
35 views18 pages

1 s2.0 S1568494623000157 Main

This research presents a deep learning framework for facial expression recognition using fine-grained facial action unit detection and deep convolutional neural networks with bilinear pooling. The proposed system addresses challenges such as model complexity and limited training samples, achieving performance metrics of 48.15%, 80.34%, and 64.17% on various benchmark databases. The study demonstrates that the model outperforms existing methods, enhancing feature representation and classification through advanced techniques like image augmentation and transfer learning.

Uploaded by

xiangyu Ren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views18 pages

1 s2.0 S1568494623000157 Main

This research presents a deep learning framework for facial expression recognition using fine-grained facial action unit detection and deep convolutional neural networks with bilinear pooling. The proposed system addresses challenges such as model complexity and limited training samples, achieving performance metrics of 48.15%, 80.34%, and 64.17% on various benchmark databases. The study demonstrates that the model outperforms existing methods, enhancing feature representation and classification through advanced techniques like image augmentation and transfer learning.

Uploaded by

xiangyu Ren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Applied Soft Computing 134 (2023) 109997

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Fine-grained image analysis for facial expression recognition using


deep convolutional neural networks with bilinear pooling

Sanoar Hossain a , Saiyed Umer a , , Ranjeet Kr. Rout b , M. Tanveer c
a
Department of Computer Science and Engineering, Aliah University, Kolkata, India
b
Department of Computer Science and Engineering, National Institute of Technology, Srinagar, J & K, India
c
Department of Mathematics, Indian Institute of Technology Indore, Indore, M.P., India

article info a b s t r a c t

Article history: Facial expressions reflect people’s feelings, emotions, and motives, attracting researchers to develop
Received 20 March 2022 a self-acting automatic facial expression recognition system. With the advances of deep learning
Received in revised form 17 December 2022 frameworks for automatic facial expression recognition, the model complexity, limited training
Accepted 2 January 2023
samples, and subtle micro facial muscle movements make the facial emotion expression system
Available online 10 January 2023
challenging. This research proposed a deep learning framework using fine-grained facial action
Keywords: unit detection to identify facial activity, behavior, and mood and recognize a person’s emotions
Fine-grained facial expression based on these individual patterns. The proposed facial expression recognition system involves pre-
Convolutional neural networks processing, feature representation and normalization, hyper-parameter tuning, and classification. Here,
Bi-linear pooling two different convolutional neural network models have been introduced because of feature learning
Fine tuning and representation, followed by classification. Various advanced feature representation methods, such
Matrix normalization as image augmentation, matrix normalization, fine-tuning, and transfer learning methods, have been
Transfer learning
applied to improve the performance of the proposed work. The proposed work’s performance and
efficiency are evaluated under different approaches. The proposed work has been tested on standard
Static Facial Expressions in the Wild, short name SFEW 1.0, SFEW 2.0, and Indian Movie Face (IMFDB)
benchmark databases. The performances of the proposed system due to these databases are 48.15%,
80.34%, and 64.17%, respectively. The quantitative analysis of these results is compared with the
standard existing state-of-the-art methods that show the proposed model outperforms the other
competing methods.
© 2023 Elsevier B.V. All rights reserved.

1. Introduction affective computing, cognitive sciences, and computer anima-


tions [6]. Facial expression recognition and emotion classification
Facial expressions are the arrangement of facial muscles are the challenging problems in computer vision [8]. As humans
movement, or a sort of deformity that can convey [1] tiny or mi- are emotional beings, there is no exaggeration that emotions are
cro information about the emotional state [2], physical state [3]. an integral part of life that affect decision-making and mental and
Intention [4] of a person could feel. It is a crucial non-verbal physical health [9].
An emotion detection system is essential in the field of social
method of communication [2] for human beings. According to
media to detect the emotions of the users [9]. An Automatic
different researchers [5,6], verbal components convey thirty per-
facial expression recognition system (FERS) [10] has been recently
cent of human communication, while nonverbal components
attracted in industry and academic for its comprehensive range of
convey seventy percent. Among various interpersonal commu-
applications in e-Healthcare [11], Emotion-AI [12], Social-IoT [13],
nication systems, facial expressions are the major components Cognitive-AI [14], online learning engagement [15], criminal sus-
for nonverbal information communication channels in [7]. In pect and children literature. Compared with other soft biometric
the last few decades, the research on human facial expression systems, the facial expression recognition system can recognize
recognition (FER) for emotion classification has gained millions non-cooperative objects in a non-intrusive manner. The FERS can
of attention over the diversified fields, i.e., in the perceptual, be applied to border control, surveillance security systems, digi-
tal entertainment, forensics investigation, etc. Facial expression
∗ Corresponding author. recognition works on heterogeneous fields and achieves great
E-mail addresses: [email protected] (S. Hossain), success, such that suspicious detection from surveillance cam-
[email protected] (S. Umer), [email protected] (R.K. Rout), eras [16] to approaching human-level performance. Hence, there
[email protected] (M. Tanveer). are some cues concerning FERS, which are as follows:

https://doi.org/10.1016/j.asoc.2023.109997
1568-4946/© 2023 Elsevier B.V. All rights reserved.
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

• Psychology of Facial Expression: Facial expressions [17] is


a universal way for people to identify their mood or mental
state by some basic facial expressions (e.g., anger, happiness,
disgust, fear, sad, neutral, and surprise). The changes present
in the facial expression are identified and determined by the
minor deformations in wrinkles or bulges [18], or significant
deformations in the region of interest (ROI), i.e., mouth,
eyes, nose, forehead, cheeks, and eye-brow [19]. Many ob-
servers measure facial expressions by looking at behaviors Fig. 1. An example of seven different facial expressions and emotions [34].

like mood, attitudes, personality, and emotion. Ekman and


Friesen [20] described two conceptual approaches to facial
behavior measuring in an orthogonal dimensional space us- units from an input face — lip tightening, eye, eyebrow,
ing eight discrete emotions, valence and arousal levels [21]. mouth, nose, forehead, and cheek raising. The AUs are the
In discrete emotion annotations, where human annotations minor visually discriminable face muscular movements with
are usually high, predicting valence and arousal levels in the some additional qualifications [31]. The primary deforma-
continuous space is complex. tion AU units are defined by tiny facial muscular movements
• Facial Expression vs. Emotion: Emotion has been stated with their characteristics, and types [32]. Facial action unit
as strong feeling inference and deriving from one’s cir- (AU) detection recognizes facial expressions by analyzing
cumstances, relationship with others, and mood or state cues about the movement of specific atomic muscles in the
of facial expression [22]. Facial expression is a nonverbal local facial area. The values of AU are calculated by detecting
the facial feature points, and these values are used to classify
communication method associated with facial activity, be-
and recognize emotion classes.
havior, and appearance caused by facial muscles movements
beneath the skin of the face. The facial expression is al-
• FER approaches: Geometric and appearance [33] based fa-
cial expression recognition systems (FERS) are primarily
most used to figure out every aspect of emotional states in
used as two famous and well-known methods based on
all research, including psycho-physiology [23] and emotion
static image and dynamic image or video data. Facial ex-
disorder [24]. Emotions are biologically-based psychological
pression recognition techniques can be broadly classified
states carried out by neuro-physiological changes, gener-
into statistical methods using handcraft features and very
ally associated with feelings, thoughts, behavioral responses,
deep or ultra-deep learning methods with deep feature rep-
and a degree of happy or unhappy [25]. Facial emotions are resentation schemes. The statistical process involved two
classified into six or seven [26] classes which are: (i) Anger types: (i) local feature extraction technique and (ii) local
(combination of the lowered forehead, upper leaf raiser, to global feature extraction technique, while deep learning
lips tightener, leaf tightener); (ii) Disgust (combination of approaches use hybrid feature representation techniques.
lip corner depressor, nose wrinkle, lower lip depressor);
(iii) Fear (combination or arrangement of inner/outer brow
Contributions:
raiser, upper lip raiser, brow lowered, lip tighter, jaw drop
and lip stretcher); (iv) Happy (arrangement of cheek raiser • Recently, very deep learning methods using a convolutional
and lip corner puller); (v) Neutral (normal appearance); (vi) neural network (CNN) did not provide the desired result for
Sad (arrangement of lip corner depressor, inner brow raiser, some complex and standard challenging data sets. Our pro-
and brow lowered) (vii) Surprise (combination of inner or posed research gives better results for challenging data sets
outer brow raiser, upper lip raiser, jaw drop), and (viii) because it reduces residual noise during face pre-processing.
Contempt (arrangement of facial muscle, nose and mouth At the same time, most CNN architectures are not concerned
contraction) [27] etc. Here, the proposed facial expression about noise artifacts.
recognition system (FERS) considers these expressions for • Our model reduces the computational cost involves high
predicting emotions. Fig. 1 shows examples of people’s facial computational costs and overcomes the problem of using
expressions on the human face. too many parameters. The proposed model reduces param-
eters by a weight decay rate ∂ factor representing a matrix
• Facial Action Coding System: The facial action coding sys-
into a product of low-rank factors. Here, parameter reduc-
tem (FACS) measures various asymmetric facial actions that
tion reduces the over-fitting problem and improves run time
appear on each side of faces [28]. FACS identifies micro-
efficiency using fewer operations.
deformation or discrete facial movements extracts the ge-
• The traditional CNN and earlier deep CNN give better results
ometrical features and produces temporal profiles of each
for standard image sizes but not for arbitrary image sizes.
facial movement [29]. FACS encodes the micro-individual
The poor performance of these earlier classical and deep
signals of facial muscles from slightly different instances
CNN models for facial expression recognition depicts the
of displacement in facial appearance. FACS task is to infer complexity in the database obtained, efficiently tackled by
from these final facial appearances to categorize emotions’ the proposed hybrid bilinear model CNNBilinear approach.
physical expressions systematically. Explicitly, FACS distin- • We have developed a deep learning model for a fine-grained
guishes between facial actions and inferences about what expressions recognition system for symmetric and non-
they mean. Here the observer FACS task predicts actual symmetric faces that represent different types of expres-
labels by making inferences about facial behavior like — sions due to variations of the nation or micro muscular
attitude, mood, emotion, traits, and personality. The FACS movements in AUs. The proposed method uses a feature
is used to code the intensity of each facial action on a five- map function that uniquely maps the emotion type and cor-
point scale [28] that generates seven basic universal emo- responding facial expression. This one-to-one or injective-
tions like anger, contempt, disgust, happiness, fear, sadness, mapped approach-based proposed CNN model with an off-
and surprise facial expression. the-shelf multi-linear solver has been employed. It is ad-
• Facial Action Unit Detection System: Action units [28] are vantageous to achieve large-scale and highly tuned solvers
facial movements attributed to the contraction of different to learn bi-linear classifiers with thousands of features and
muscles [30]. Facial action unit (AU) detection detects action images.
2
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

The proposed research work has been organized in this pa- programming framework for facial expression recognition using
per as follows: Section 2 briefly describes the literature review three functional steps, i.e., texture and geometric-based feature
related to the research work. Section 3 describes a detailed de- selection, feature level fusion technique, and a binary classifier
scription of the framework for face pre-processing techniques for spontaneous and posed FER databases. Yi et al. [45] proposed
from an input image and introduces the proposed CNN archi- a facial expression recognition framework for a sequence of in-
tecture using bi-linear pooling for better feature representation tercepted videos conducting five experiments using feature block
and classification for the various sparse facial databases. Section 4 texture variation and feature point movement trend of facial
describes the database description and demonstrates the exper- patches. Barman et al. [46] proposed a combined and individual
imental results, solution, and analysis. Section 5 concludes the framework for facial expression recognition using texture, dis-
paper. tance signature, and stability indices from a feature set that shows
better performance for the FER system. Some recent work in facial
2. Related work expression recognition (FER) systems has been demonstrated in
Table 1.
Since the last few decades, several shared representations in- Choosing the appropriate neural network architectures and
the-lab and in-the-wild facial expression databases have been CNN models for feature representation is the most challenging
collected to develop a robust Facial Expression Recognition Sys- work for building an automatic facial expression recognition sys-
tem (FERS) using affective computing [21]. The number of input tem (FERS). Most real-world in-lab facial expression databases
face images extracts emotional features that express the char- collected from the internet suffer from data sparsity, residual
acteristic. Attributes associated with facial image annotation are noise, and low intra-class and high inter-class variance, leading
Expressions (Anger, Sadness, Happiness, Fear, Surprise, and Dis- to worst recognition performance. In comparison, facial images
gust), Illumination (High, Medium-Bad), Pose (Frontal, Right, Left, collected from videos suffer significant variations in illumination,
Up and Down), Occlusion (Glasses, Ornaments, Beard, Hair, and age, resolution, and blur. If facial features are ambiguous, then
Hand), Age (Old, Young, Child, and Middle), Gender (Male and visual information around the face has high importance in recog-
Female). These characteristics and nature are critical for develop- nizing the expression. As emotion labels do not provide too much
ing and accessing an automatic FERS [35]. It is also important to information for inter-class dissimilarities, it is essential to learn
mention the diversified disturbing factors associated with these fine-grained expression representation for inter-class comparison
attributes and characteristics, including pose variation, identity, shown in Fig. 2. Here, Fig. 2(i) indicates only happy classes having
illumination, etc. Most of the state of benchmark in-the-wild the right two images are more similar compared to the first;
and in-the lab database provides labels of facial expression, pose Fig. 2(ii) shows two different categories while the right two im-
variation, and identity but suffers from a lack of label information ages have visually identical to the left, which belong to the class
and other disturbing factors. Some previously proposed methods of angry; Fig. 2(iii) shows three different categories while the left
have coped with some alarming factors ignoring heavy entan- two images are visually more similar expressions compared to
glement among facial expressions and disturbing factors. The the third. However, face recognition is a generic object detection
face is the most crucial component of the human body for our task. Our primary objective is to locate face regions known as the
daily mutual interaction [36]. A novel deep learning framework classification region, covering an extensive range of scales. Faces
for facial expression recognition techniques using distinctive and have their unique object structural configuration corresponding
discriminant features has been proposed in [7]. to each attribute and action unit (AUs), i.e., the distribution of
Image enhancement methods are mainly used at the pre- different parts of the face and characteristics (e.g., skin color).
processing steps before higher level-vision tasks, such as object These facial attributes and features have significant visual varia-
recognition and classification [37,38]. The image enhancement tions such as pose variation, occlusion, and illumination changes,
method is applied to improve the visual quality of the input imposing substantial challenges for automatic real-time facial
image. Image enhancement techniques overcome or reduce the expression recognition. Most facial expression databases have
complexity and challenges in facial images, i.e., noise, blur com- been collected from videos, the internet, and real-world in-the-
pression artifacts, or poor contrast. Gaussian smoothing, His- wild. They have been processed and synthesized in the lab under
togram Equalization, Bilateral filtering, and weighted least squares different collecting conditions and subjective annotating systems.
are well-known image enhancement techniques that use 3 × 3 Necessarily, it causes data inconsistency and bias. However, mod-
Sobel filter. These techniques are expensive and more time- ern adversarial mechanisms for image synthesis have overcome
consuming for high-resolution images. To overcome these draw- these issues [52].
backs, convolutional neural network [39] and bilinear pooling [40,
41] have been proposed [39] and successfully simulate large-scale 3. Proposed method
image enhancement by training massive input samples and gen-
erate output image pairs during the run-time performance. These In this paper, we have proposed a robust facial expression
methods eliminate the lowest discriminant power of nodes and recognition system (FERS). The tree structure part model detects
wavelets. Moreno et al. [42] proposed a 3 − D information model the facial region from an input image in the pre-processing
for facial expression recognition using the Gabor function. Gabor step. Then advanced feature learning techniques, image pre-
response functions used three Gabor parameters (frequency (ν), processing, enhancement, bilinear feature representation, matrix
Gaussian envelope width (σ ) and orientation (η)) for feature normalization, and deep CNN models have been employed. Fur-
extraction and representation. The authors uniformly sampled ther image augmentation, transfer learning, score level fusion,
the orientation η for a crucial facial feature and computed the and fine-tuning strategies are utilized to overcome the over-
ν and σ values in every eta slice. fitting problems and improve recognition performance. The func-
He et al. [43] proposed a multiple impact feedback recognition tional flow diagram of the proposed method has been shown
model inspired by a progressive enhancement method for facial in Fig. 3. Here, the proposed FERS has been introduced using
expression identification for multiview facial expression with two novel methodologies (i) a basic convolutional neural network
occlusion (FMEO) using discrete wavelet decomposition and a set (CNNBasic ) and (ii) a bilinear CNN [53] architectures (CNNBilinear ).
of collateral Support Vector Machines classifiers for large scale in- The Bilinear CNN architecture establishes a dependency among
lab and in-the-wild databases. Ghazouani [44] proposed a genetic large-scale spatial or Spatiotemporal learning factors for the
3
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 1
Demonstration of some recent works of the facial expression recognition system.
Method Feature representation Remarks
and classification
Sun et al. [47] Extract discriminative global and local Deep feature label fusion
features from action units by the methods with improved
global-based and region-based modules. conditional generative
Discriminative loss function to maximize adversarial network
inter-class variations and minimize the
intra-class distances
Kamal et al. [48] This method uses feature extraction A human–computer
techniques by embedding 2D-LDA and interaction method as a
2D-PCA. Support Vector Machine and medium for children who
K-Nearest Neighborhood classifiers. could not emote verbally
Yang et al. [49] Vision transformer, multi-layer perceptron, Face parsing and vision
cross-attention mechanism for feature Transformer along with a
representation, Soft-max classifier for cross-attention based
classification tasks. method
Yolcu et al. [50] Cascaded CNNs, first CNN to segment Facial expression
crucial facial components and second CNN recognition to make
for classification tasks using softmax distinction between
classifier. different neurological
diseases like children’s
autism spectrum disorders.
Yan et al. [51] Multi-Feature Fusing Local Directional Multi-Feature Fusing Local
Ternary Pattern along with Principal Directional Ternary Pattern
Components for feature extraction, Support
vector Machine as classifier

Fig. 2. Intra-Inter class variations of fine-grained facial expressions due to ages, genders, poses, lighting, and illuminations.

Fig. 3. Proposed system’s working flow diagram.

visual recognition tasks. This Bilinear CNN model uses matrix- problems under occlusions and pose variations. Our contribut-
rank restriction as a separable filter interpreter that better rep- ing method performs for adaptive enhancing the features on an
resents the facial images into matrices and tensors that contain input image basis via bilinear pooling, enabling the CNN to selec-
several hidden textual information. tively enhance essential features from the region of interest that
The facial expressions are mainly defined by Action Unit (AUs) significantly improve image classification. The proposed facial
detection and FACS measurement. We propose extending our expression recognition system (FERS) copes with the following
CNN model-based image enhancement training to incorporate tasks: (a) pre-processing: the region of interest extraction from
the high-level goals to solve many facial expression classification normalized image I ; (b) The visual quality of the normalized
4
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

images I are enhanced by a bilinear model for better represen- 3.2. Feature representation for emotion recognition
tation of features; (c) feature representation by informative and
effective facial feature synthesis; (d) finally, selection of suitable Visual feature extraction provides a semantic and robust rep-
and appropriate classifier for final facial expressions recognition. resentation to recognize different objects. Feature representation
We have employed training and testing data distribution match- is essential in facial expression identification and emotion classi-
ing techniques to obtain the best performance recognition. Here, fication tasks. Raw and natural features are not feasible to directly
pre-processing steps are different from training and testing. The classify the facial expressions. This work employs robust and
tree-structured part model has been applied to detect the face powerful local-to-global feature representation schemes as fea-
region from an input image I . This detection of the face region is ture synthesis and representation. Some valuable methods enrich
based on the selection of top-down corner points on I that help the discriminating information during features extraction from
to predict a rectangular box region of the facial region F , which is facial regions, which are as follows:
further normalized to F200×200×3 image, fed into the CNN models.
Features synthesized from a particular AU represent a specific • Enhancement method: Image enhancement is adjusting
feature. The receptive field (RF) measures the feature extracted digital images to make the results more suitable for feature
from a fixed-size region of the input face. These receptive fields learning and analysis. The enhancement or noise-removal
are used in CNN activation layers to measure the association techniques are applied to facial regions [27]. The bilinear
of an output feature of the first hidden layer to the input re- pooling and matrix normalization techniques enhance the
gion or patch using a particular feature map function. The first denoised images, eliminate the residual noise, and make
block of CNN combines receptive fields (filters), sub-sampling feature extraction more optimal and redundant.
layers, and weight replication to cope with invariance, i.e., pose, • Fine Grained Feature Adaptation: Adapting local features
different image scales, and lighting conditions. Recently, deep for fine-grained facial expression recognition tasks, such as
learning [7] and ultra-deep learning [54] techniques have revolu- identifying facial emotions like anger, fear, disgust, happi-
tionized the FERS by providing a more profound representation of ness, surprise, and sadness, is quite challenging [53]. Here,
facial images and improved the results for the fine-grained facial the local features are extracted from a particular action unit
expressions recognition. Here the success behind these tech- or block by looking at some specific pattern of facial muscle
niques includes complicated network architecture that generates movement and their characteristics, behaviors, and activi-
millions of parameters using an ensemble of feature learning al- ties. These patterns and activities have changed due to micro
gorithms [55] with novel training techniques by utilizing bilinear or tiny variation or actions on facial muscles, eyes, nose, lips,
pooling [56], bilinear CNN [40], and L2-Regularization [57] meth- and forehead. The factors responsible for causing these mi-
ods. The proposed method uses different hyper-parameter learn- cro information changes are viewpoint changes pose varia-
ing techniques such as data augmentation, matrix normalization, tions, and location of the object within the input facial image
transfer learning, and fine-tuning layers. Here, the pre-trained F . Here, the proposed CNN architectures have orderless bi-
bilinear network models extract meaningful, orderless, invariant linear pooling strategies followed by domain-specific trans-
non-redundant, and optimal texture features from the input facial fer learning and fine-tuning parameters to convey more gen-
image I . Finally, a suitable softmax classifier is adapted to do eralized and discriminative contextual information for fine-
the classification task. Here, the features are bilinear, and the grained facial expression recognition. These proposed net-
linear classifier represents the output as a product of two low- works have significantly outperformed the standard bench-
rank matrices. The employed facial expression datasets consist of mark fine-grained facial databases and compete with en-
seven different expression classes (Fig. 1). Due to limited samples couraging performance over the existing state-of-art meth-
in the datasets containing tiny images, the over-fitting problem ods.
generates uncertain results, i.e., the deep learning model fails to • Bi-Linear Feature: The current techniques of facial image
produce the desirable results. To overcome these issues, we have manipulation with synthesized texture attributes to repre-
performed some image augmentation techniques [55]. sent facial features are performed on top of the two CNN
activations layers [57]. These orderless facial features are
3.1. Face pre-processing obtained by calculating the outer product of two feature
vectors extracted from a hybrid approach of the two CNN
The proposed deep learning model recognizes facial expres- models. The synthesized features obtained from hidden lay-
sions for real-time applications in a diversified field of applica- ers of the profound convolutional neural network are known
tions. Here, the employed images are labeled Angry, Disgust, Fear, as bilinear features [60]. The Gram matrix usually represents
Happy, Sad, Surprise, and Neutral expressions. The images are Bilinear features and is suitable for texture synthesis and
either frontal or profile. The facial region is detected considering modeling orderless texture representation. Given an input
these images using the tree-structured part model [58] applied on image F and let li be the ith layer index where i = 1, . . . , N ,
input image I to get facial region image F . The steps for face de- we have obtained a set of features Fli = ψj indexed at
tection have been shown in Fig. 4. The detected facial landmarks location j by computing the activations of the CNN at layer
have several discriminative regions, i.e., receptive fields (Action li . The bilinear feature at layer li is represented by βli . Here
Units). These regions build intra-domain mapping or feature func- the bilinear feature βli of Fli is obtained by computing the
tions that correlate within holistic local areas and intra-domain outer vector product of two CNN networks of each fea-
feature functions across different parts. Here, the task is to cate- ture vector ψj and aggregating them location-wise by the
τ
∑N
gorize each facial expression shown in Fig. 1 into one of the seven following statistical distribution: βli (I ) = N 1
j=1 ψj ψj .
categories {Angry, Disgust, Fear, Happy, Neutral, Sad, Surprise}. The synthesized bilinear features captured from the inner
The advantages of using a face pre-processing task are to reduce convolutional layers of the unified parametric CNN models
the proposed model’s processing time and extract the more ef- visually enhance the facial image F significantly better than
fective and crucial features from input facial images [59]. Some the other enhancement techniques like wavelet co-efficient,
image enhancement techniques have been employed to enhance first and second-order gradient descends. As the bilinear
the image quality due to blurred and low-resolution images to features are of a higher dimensional feature vector, the
improve performance. major drawback is memory overhead for storing.
5
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Fig. 4. Face detection and normalization in the image pre-processing task.

• Feature Map: Feature mapping is denoted by φ : Rw ↦→ RW our proposed CNN is used to extract across location-wise a
and is defined by the output of a set of neurons in convolu- collection of features αi for i = 1, . . . , N . The bilinear pool-
tion layers that share the equal weights at different locations ing extracts the second-order statistics by using the follow-
of an input image F . Here, the bilinear features are pooled ing statistical distribution given in Eq. (2) adding diagonal
and stacked across scales. Stacking allows a subsequent of the matrix χ and a small positive value ξ given by:
classifier to learn how to optimally combine image statistics N
1 ∑
over various scales, increasing the cost of bilinear texture χ= ( αi αi τ ) + ξ × I (2)
feature descriptor dimensionality. The convolved bilinear N
i=1
feature maps are padded with zeros, while in the case of
dense evaluation, padding for the same crop naturally comes Where the matrix χ is of (d1 × d1 ) dimension and αi repre-
from the adjacent parts of an image. Then we performed sents feature of dimension d1 . Using bilinear pooling aims
global average pooling on the resulting feature map, which to enhance the detection rate by providing spatial vari-
ance to the trained objects and reducing the parameters
produces a θ -D texture image descriptor. It substantially
learned by the proposed network model. Here, the bilinear
increases the network receptive field to capture a more
pooling layers suffer from redundant features, over-fitting
informative context.
problems, and unstable second-order statistical information.
• Matrix Normalization: The performance of the proposed
It requires different regularizers to process. The bilinear
method has significantly improved by using matrix square
pooling uses tensor sketching to aggregate and map lower
root normalization. The combined representation of L2 nor-
dimensional features and approximate the bilinear features
malization and element-wise square root known as matrix
with the help of Eq. (2) feature mapping.
logarithm normalization have outperformed the network
• Dropout: Dropout is a special regularization skill [61] that
model’s computations. The matrix square root is effectively
randomly decreases features and a set of layers at a rate
used to normalize the co-variance matrices and for clas-
of 0.25. It is used to avoid the over-fitting problem. How-
sification tasks. For a given positive semi-definite matrix
ever, overfitting is also a severe problem. Increasing the
M, the square root is represented by the matrix Z where
model architecture larger makes the model more complex
ZZ = M is computed by ∑ single-valued decomposition
and slower. Such a model suffers from many difficulties
(SVD) of the matrix, M = U UT . Here, the square root
∑1 in overfitting by combining the predictions of numerous
is computed by Z = U 2 UT . The square root and its
large and complex neural network architectures at train-
gradient computation via SVD decomposition are used for ing and test time. Such type of problem is addressed by
matrix normalization. Here, it is required to compute the the dropout regularize technique. The principle behind this
gradient of the bilinear feature matrix using Eq. (1). This gra- technique is to randomly reduce units of features and their
dient computation is numerically more stable than matrix connections from the neural networks during training. The
backpropagation with SVD. Here we can trade off memory dropout technique prevents branches from co-adaptions.
using the matrix normalization technique. The square root Drop out exponentially samples from different networks
and its gradient computation scale the memory overhead during training time. At test time, it is easy to approximate
with few iterations for large-scale matrices representing the effect of averaging the predictions of all these thinned
bilinear features. networks by simply using a single unthinned network with
1 ∂L ∂L 1 ∂L fewer weights. Here, it successfully decreases overfitting
M2 ( )+( )M 2 = (1) and improves significantly over other regularization meth-
∂M ∂M ∂Z
ods. Employing dropout enhances the performance of the
• Second-order pooling (O2P): Second-order pooling (O2P), bilinear neural networks on FERS.
or bilinear pooling, is used to calculate global image descrip- • L2 -Regularization: Regularization helps the deep learning
tors and semantic image segmentation. For a given image F , model learn the network parameters and reduces the model
6
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Fig. 5. Demonstration of the proposed basic CNN architecture taking Fn×n×3 as input image.

Fig. 6. Demonstration of the proposed Bilinear CNN architecture taking Fn×n×3 as input image.

generalization error. Several regularization strategies have has been reduced by using (d1 × d1 ) pooling layers, which
constraints and restrictions on parameter tuning. In this reduces the output size from one layer to the following input
work, the proposed bilinear features are suitable for pre- hidden layers. Then, (2 × 2) max-pooling operation is per-
dicting attributes by normalizing the features using L2 -norm formed to preserve the most significant features and select
regularization and signed-square-root function. maximum elements [62]. After, these images are down-
sampled to half their original input size. The maps are
Here, the two different CNN architectures have been proposed flattened into one column to feed the pooled output from
by employing above these techniques, which are as follows: the stacked featured map to the final layer. The final layers
are comprised of two fully dense connected layers with N
• Basic CNN: For this network, we have selected an input number of hidden nodes each. These two flattened layers
image Fn×n×r and convolved with a set of filters known are regularized using the dropout regularization technique.
as kernels of size (t1 × t1 ). Here, the mechanism of these Finally, the Softmax layer is employed, followed by two fully
hidden convolutional layers is known as feature function connected layers, and the number of nodes in this layer
or mapping. The feature maps are stacked with fixed-size is equal to the number of expression classes. A detailed
kernels to provide multiple filters on the input. We have description of the CNNBasic model with input image size,
employed (3 × 3) filter’s size having stride one for each hid- generated parameters at convolution, max-pooling, batch
den convolution layer. We have employed Rectified Linear normalization, activation, and dropout layers have been de-
Unit (ReLU ) activation function for each hidden convolution scribed in Table 2. The architecture of CNNBasic is depicted in
layer. The computational complexity of the CNNBasic model Fig. 5 for better understanding and clarity.
7
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 2
Description of parameters in basic CNN with input image size and output shape.
Layers OutputShape ImageSize Parameters
Block-1
Conv2D(3x3)@16 (n,n,16) (200, 200, 16) ((3x3x3)+1)x16=448
Maxpool2D(2x2) (n1 , n1 ,16), n1 = n/2 (100, 100, 16) 0
BatchNorm (n1 , n1 ,16) (100,100, 16) 4x16=64
ActivationReLU (n1 , n1 ,16) (100, 100, 16) 0
Dropout (n1 , n1 ,16) (100, 100, 16) 0
Block-2
Conv2D(3x3)@32 (n1 , n1 ,32) (100, 100, 32) ((3x3x16)+1)x32=4640
Maxpool2D(2x2) (n2 , n2 , 32), n2 = n1 /2 (50,50,32) 0
BatchNorm (n2 , n2 ,32) (50, 50, 32) 4x32=128
ActivationReLU (n2 , n2 ,32) (50, 50, 32) 0
Dropout (n2 , n2 ,32) (50,50,32) 0
Layers OutputShape ImageSize Parameters Layers OutputShape ImageSize Parameters
Block-3 Block-4
Conv2D ((3x3x32) + 1)x32 Conv2D ((3x3x32)+1)
(n2 , n2 ,32) (50,50,32) (n3 , n3 ,64) (25,25,64)
(3x3)@32 =9248 (3x3)@64 x64=18496
Maxpool2D (n3 , n3 ,32) Maxpool2D (n4 , n4 ,64)
(25, 25, 32) 0 (12,12,64) 0
(2x2) n3 = n2 /2 (2x2) n4 = n3 /2
Batch Batch
(n3 , n3 ,32) (25,25,32) 4x32 = 128 (n4 , n4 ,64) (12,12,64) 4x64=256
Norm Norm
Activation Activation
(n3 , n3 ,32) (25,25,32) 0 (n4 , n4 ,64) (12,12,64) 0
ReLU ReLU
Dropout (n3 , n3 ,32) (25, 25, 32) 0 Dropout (n4 , n4 ,64) (12,12,64) 0
Block-5 Block-6
Conv2D ((3x3x64)+1) Conv2D ((3x3x64)+1)
(n4 , n4 ,64) (12, 12, 64) (n5 , n5 ,96) (6,6,96)
(3x3)@64 x 64=36928 (3x3)@96 x96=55392
Maxpool2D (n5 , n5 ,64) Maxpool2D (n6 , n6 ,96)
(6, 6, 64) 0 (3,3,96) 0
(2x2) n5 = n4 /2 (2x2) n6 = n5 /2
Batch Batch
(n5 , n5 ,64) (6, 6, 64) 4x64=256 (n6 , n6 ,96) (3,3,96) 4x96=384
Norm Norm
Activation Activation
(n5 , n5 ,32) (6, 6, 64) 0 (n6 , n6 ,96) (3,3,96) 0
ReLU ReLU
Dropout (n5 , n5 ,64) (6, 6, 64) 0 Dropout (n6 , n6 ,64) (3,3,96) 0
Block-7
Conv2D(3x3)@64 (n6 , n6 ,64) (3,3,64) ((3x3x96)+1)64=55360
Maxpool2D(2x2) (n7 , n7 ,64), n7 = n6 /2 (1,1,64) 0
BatchNorm (n7 , n7 ,64) (1,1,64) 4x64=256
ActivationReLU˜ (n7 , n7 ,64) (1,1,64) 0
Dropout (n7 , n7 ,64) (1,1,64) 0
Layer Output Shape Image Size Parameter
Flatten (1, n7 × n7 × 64) 1, 64 0
(1 + 64)x 256
Dense (1, 256) (1, 256)
=16640
Batch Normalization (1, 256) (1, 256) 1024
Activation Relu (1, 256) (1, 256) 0
Dropout (1, 256) (1, 256) 0
(256+1)x256
Dense (1, 256) (1, 256)
=65792
Batch Normalization (1, 256) (1, 256) 1024
Activation Relu (1, 256) (1, 256) 0
Dropout (1, 256) (1, 256) 0
(256+1)x7
Dense (1, 7) (1, 7)
=1799
Total Parameters for The Input Image: 267975
Total Number of Trainable Parameters: 266215
Non-trainable params: 1760

• Bilinear CNN Model: Recently, CNN has been the most po- detailed description of CNNBilinear has been demonstrated in
tent feature learner using convolutional layers and the most Table 3 and the architectural diagram of CNNBilinear is shown
accurate general-purpose image classifier using flattened, in Fig. 6 respectively.
fully connected layers and softmax classifier. Here we have Here, the first hidden input layer has two convolutional
presented a faster and most efficient robust bilinear CNN layers. Each convolutional layer has 64 feature maps with
model (CNNBilinear ) using gradient normalization techniques filter size (3 × 3) and a rectified linear activation function
to improve the overall performance of the CNNBasic . The (ReLU) with a stride size of 1 pixel. Here, ReLu introduces
CNNBilinear requires a more extensive database than usual non-linearity to classify the model better and improve com-
and a more extended pre-training schedule. CNNBilinear ex- putational time. We have tested our model with hyperbolic
ploits locality information and does not have well-informed tangent activation function and sigmoid functions. Next, we
inductive biases. These benefits include self-attention to define a pooling layer configured with a pool size of (2 × 2)
process long-range and parameter-efficiency and interac- with stride 2. The Dropout regularization technique is used
tions between different regions, i.e., global dependencies, in the next layer. It is configured to randomly exclude 25%
action units, and receptive fields in the input face image. The of neurons in the layer to reduce overfitting. The second
8
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

hidden layer has the same configuration having a 128 fea- classifier generates a score for each class as output. The
ture map of dimension (3 × 3). In comparison, the third test samples are more likely than the training samples if
hidden layer has four convolutional layers with 256 feature the calculated score is higher. The softmax scores range
maps of dimension (3 × 3) and generates (n2 × n2 × r1 ) between 0 & 1. The scores are more intuitive and have
feature vector followed by max-pooling, activation, and probabilistic interpretation. Softmax classifier estimates the
dropout layers. The fourth and fifth hidden layers consist class probabilities using true distributions and minimizes
of four convolutional layers with 512 feature maps followed cross-entropy. The advantages of using softmax classifiers
by max-pooling, activation, and Dropout layers. In this way, are: (i) simplicity in model structure makes it easiest to
we have used seventeen and twenty-one hidden layers in train and calculate, (ii) it is linearly separable for efficient
the CNNBilinear . Then we used a shape extractor and shape classification problems. The disadvantage of the softmax
detector function with Lambda layers for reshaping the classifier is that it does not support null rejection. Other
feature vectors. Here Lamda layers are used for arbitrary hands, the softmax classifier could only perform well if
expressions and are convenient for simple stateless compu- training samples were few.
tation. A layer contains a flatten to convert 3-D matrices into • Transfer-Learning: Transfer-learning techniques have been
1-D vectors and uses the outer vector product to multiply used to validate the effectiveness of our proposed model
these two reshaped detector and extractor vectors of dim on standard benchmark data sets. The transfer learning
(d × d). After performing vector exterior product operation, technique is effective, especially for texture synthesis, object
we applied diverse matrix normalization techniques to rep- detection, fine-grained texture recognition, visual question-
resent the feature betterment. The performance recognition answering, and image classification tasks. The principle con-
of our proposed CNNBilinear model is improved. Finally, the cept behind transfer learning is that low-level features
output layer has seven neurons for seven classes. An ac- learned from a model designed for a problem are used for
tivation function softmax has been employed to generate the newly related problems as a generic model [21]. Transfer
output probability, i.e., class score predictions corresponding learning is an alternative training strategy for small-scale
to individual facial expression categories. domain-specific databases. Here, it utilizes parameters in
traditional CNNBasic transferred from a pre-trained model
and fine-tuned based on the new database and also reduces
3.3. Factors affecting FERS performance training time cost while training a model from scratch.
• Fine Tuning: Fine-tune allows the higher-order feature rep-
In the previous section, we described our proposed network resentations in the base model to make them more relevant
configurations. This section describes details about additional ad- for FERS tasks [21]. Half of the parameters are trained
vanced feature learning techniques and employed hyper- during experimentation using pre-trained weight, and the
parameters tuning during training and evaluation time for our remaining half are non-trained. The proposed bilinear model
proposed CNN model using bilinear architecture. for fine-grained facial action unit identification and facial
expression recognition produces optimum performance by
• Augmentation: In this paper, advanced image augmenta-
fine-tuning all layers and fine-tuning with only fully-
tion techniques have been employed at training time to
connected layers.
improve the ability to generalize the model and reduce over-
fitting by increasing training samples artificially by applying • Fusion: Fusion is done at the match score level to make a
various transformation techniques [63]. The recognition per- final recognition decision. Two types of fusion methods are
formance has significantly improved by combining multiple available, i.e., feature level fusion and score level fusion [65].
image data augmentation techniques [64]. We have utilized Feature level fusion is converting two unimodal systems
a standard set of image data augmentation techniques for into one, i.e., combining two feature vectors to make a
our proposed methodology, commonly used for image clas- single feature vector. On the other hand, in score-level fu-
sification at multiple resolutions [8]. Each image sample sion, the match scores obtained by the classifiers are fused
of the training set is flipped randomly, horizontally, and to make the final decision. Here, the proposed CNNBilinear
vertically by applying image data generator techniques. We model processes images of arbitrary size and generates out-
have also adopted transformation techniques like unsharp put indexed by feature channel and image location. Dur-
filtering, Gaussian blur, image scaling, Gaussian noise, bilat- ing experimentation on test sample F for facial expression
eral filtering, image rotation and translation, image filling, recognition, we have obtained two score vectors SBasic for
contrast normalization, image zooming, and shearing [7]. the basic model and SBilinear for the bilinear( model where
SBasic = s11 , s12 , · · · , s17 and SBilinear = s21 , s22 , · · · , s27 .
( ) )
• Softmax Classifier: The proposed model conducts initial hy- 1 2
perparameters selection, and after performing several trials Here each si and si represent the classification scores ob-
and errors to obtain the score for each class using a combi- tained by CNNBasic and CNNBilinear model for ith expression
nation of intra-class score and employed objective function class. These classification scores have been fused using post-
softmax classifier to maximize the inter-class scores [8]. classification score level fusion approaches [55], and [8].
Most object detection deep learning models incorporate We have employed two techniques for score level fusion,
softmax classifiers for multiclass or key point classifica- i.e., sum-rule and product-rule. The product-rule and sum-rule
tion to fulfill the proposed classification [7]. It normalizes techniques are defined as follows:
all features according to the number of positive feature max {sik × sjk } (3)
classifications. Traditional softmax classifiers involve (i) a i̸ =j, k={1,...,7}
score calculator and (ii) softmax loss. Here, the conventional
softmax classifier performs excellently for multiclass classi-
max {sik + sjk } (4)
i̸ =j, k={1,...,7}
fication. Hence it mitigates the vanishing gradient problems
caused by conventional CNN classifiers for imbalance data 4. Experimentation
generation. Employed softmax classifier magnified the im-
balance data using image augmentation and suppressed In this section, we experiment and analyze our proposed fa-
the wrongly predicted low-shot categories. The softmax cial expression recognition system. To validate the effectiveness
9
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 3
Description of parameters in CNNBilinear with input image-size and output shape.
Output Image Output Image
Layers Parameters Layers Parameters
Shape Size Shape Size
Block-1 Block-2
conv1 Conv2D ((3x3)x3 + 1)x64 conv1 Conv2D ((3x3x64)+1)
(n,n,64) (200,200,64) (n1 , n1 ,128) (100,100,128)
(3x3)@64 = 1792 (3x3)@128 x128=73856
conv2 Conv2D ((3x3)x64 + 1)x64 conv2 Conv2D ((3x3x128)+1)
(n,n,64) (200,200,64) (n1 , n1 ,128) (100,100,128)
(3x3)@64 = 36928 (3x3)@128 x128=147584
Maxpool2D (n1 , n1 ,64) Maxpool2D (n2 , n2 ,128)
(100, 100, 64) 0 (50,50,128) 0
(2x2) n1 =n/2 (2x2) n2 =n1 /2
Dropout (n1 , n1 ,64) (100, 100, 64) 0 Dropout (n2 , n2 ,128) (50,50,128) 0
Block-3 Block-4
conv1 Conv2D ((3x3x128)+1) conv1 Conv2D ((3x3x256)+1)
(n2 , n2 ,256) (50, 50, 256) (n3 , n3 ,512) (25,25,512)
(3x3)@256 x256=295168 (3x3)@512 x512=1180160
conv2 Conv2D ((3x3x256)+1) conv2 Conv2D ((3x3x512)+1)
(n2 , n2 ,256) (50, 50, 256) (n3 , n3 ,512) (25,25,512)
(3x3)@256 x256=590080 (3x3)@512 x512=2359808
conv3 Conv2D ((3x3x256)+1) conv3 Conv2D ((3x3x512)+1)
(n2 , n2 ,256) (50, 50, 256) (n3 , n3 ,512) (25,25,512)
(3x3)@256 x256=590080 (3x3)@512 x512=2359808
conv4 Conv2D ((3x3x256)+1) conv4 Conv2D ((3x3x512)+1)
(n2 , n2 ,256) (50, 50, 256) (n3 , n3 ,512) (25,25,512)
(3x3)@256 x256=590080 (3x3)@512 x512=2359808
Maxpool2D (n3 , n3 ,256) Maxpool2D (n4 , n4 ,512)
(25, 25, 256) 0 (12,12,512) 0
(2x2) n3 =n2 /2 (2x2) n4 =n3 /2
Dropout (n3 , n3 ,256) (25, 25, 256) 0 Dropout (n4 , n4 ,512) (12,12,512) 0
Block-5
Layer Output Shape Image Size Parameter
conv1(Conv2D) ( n4 × n4 × 512) (12,12, 512) ((3x3x512)+1)x512=2359808
conv2(Conv2D) ( n4 × n4 × 512) (12,12, 512) ((3x3x512)+1)x512=2359808
conv3(Conv2D) ( n4 × n4 × 512) (12,12, 512) ((3x3x512)+1)x512=2359808
conv4(Conv2D) ( n4 × n4 × 512) (12,12, 512) ((3x3x512)+1)x512=2359808
MaxPooling2D ((n5 , n5 ,512) n5 =n4 /2) (6,6, 512) 0
Dropout (n5 , n5 ,512) (6,6, 512) 0
Block-Bilinear Pooling and Normalization Layers
Reshape (Reshape) (36,512) (1, 512) 0
Reshape_1 (Reshape) (36,512) (1, 512) 0
Lambda (Lambda) (512, 512) (1, 512) 0
Reshape_2 (Reshape) (None, 262144) (1, 512) 0
Lambda_1 (Lambda) (None, 262144) (1, 512) 0
Lambda_1 (Lambda) (None, 262144) (1, 512) 0
(262144x7)+7
Dense (None, 7)) (1, 7)
=1835015
activation (Activation) (None, 7)) (1, 7) 0
Total Parameters for The Input Image: 21,859,399
Total Number of Trainable Parameters: 1,835,015
Non-trainable params: 20,024,384

and performance of the proposed FERS method, we randomly conditions, illuminations, and expressions. It also suffers from
experimented on the three most challenging and widely used low-resolution face images. The detailed descriptions of these
benchmark facial expression databases. Each database has fixed employed databases are described below:
training and testing datasets provided by the concerned database-
originating institutes and universities. The optimum performance • SFEW 1.0: Our first employed database is Static Facial Ex-
of the proposed system has been reported to correspond to the pressions in-the-Wild 1.0 (SFEW 1.0) [66], which was built
individual database. from the AFEW video database by selecting the keyframes
based on the facial point clustering. The SFEW 1.0 database
4.1. Employed databases contains seven hundred’s (700) images split into a train set
(346 images) and a test set (354 images). The database has
Here, we have employed Static Facial Expressions in the Wild seven facial expression classes, where each image is labeled
1.0 (SFEW 1.0), Static Facial Expressions in the Wild 2.0 (SFEW with one of the seven expression classes, i.e., afraid, sadness,
2.0), and Indian Movie Face (IMFDB) bench-mark databases. These disgust, happiness, neutral, anger, and surprise. Fig. 7 shows
databases are publicly available for facial expression recognition some image samples of SFEW 1.0 databases.
purposes. These datasets are most challenging because the images • SFEW 2.0: The second employed data set is SFEW 2.0 col-
need to be aligned. Some images are not correctly labeled, as lected from AFEW [10]. The SFEW 2.0 database contains 773
we can observe from the following images shown in Figs. 7–9. training, 383 validation, and 653 testing samples. The AFEW
Moreover, some samples do not contain faces as these images database is collected from videos, movies, and television
have low resolution and blurring and are impeded due to wearing serials. The SFEW 2.0 has seven types of facial expressions;
sun-glass and having beard faces. At the same time, the IMFDB each image is labeled with one of the seven expression
database contains small facial texture images. The intra-class classes, i.e., afraid, anger, disgust, happiness, neutral, sad-
variations shown in Fig. 2 are more significant than inter-class ness, and surprise. The SFEW 2.0 database suffers from vari-
variations due to diverse factors like pose variations, lighting ous poses, spontaneous expressions, and illuminations. This
10
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Fig. 7. Some examples of image samples from SFEW 1.0 database.

Fig. 8. Some examples of image samples from SFEW 2.0 database.

in-the-wild database contains rich and extensive challeng- 4.2. Results and discussions
ing facial expressions captured from unconstrained imaging
environments. Here, Fig. 8 shows some examples of SFEW This section has explained the step-by-step experimentation
2.0 databases. of the proposed facial expression recognition system (FERS). We
• IMFDB: The third employed database is the Indian Movie have described how we have enhanced our basic CNN model
Face Database (IMFDB). The IMFDB database is created man- into the deep bilinear convolutional model and improved the
ually by selecting and cropping video frames extracted from model’s efficiency and performance of the recognition system.
Indian movies. This database contains 34512 face images We have analyzed our basic CNN model and the bilinear model
corresponding to 100 Indian actors. Many unconstrained architectures and incorporated advanced feature learning strate-
images have been collected from more than 103 movies. gies such as augmentation, transfer learning, fine-tuning, hyper-
Among the 34512 number of images, we have selected parameter selection, and matrix normalization techniques. The
the 8002 number of images for our proposed FERS. We proposed FERS has been implemented in Python 3.7.9 version,
randomly selected the 4002 number of images for training Keras 2.4.3 version, Tensorflow 2.3.1 version, CUDA Version: 11.2
samples and the 4000 number for test samples. The IMFDB and NVIDIA-SMI 460.79 driver Version in Windows 10 Pro 64-bit,
database has seven facial expressions; each image is labeled Intel(R) Core(TM)-i7-9700 CPU, 3.30GHz(8 CPU) Processor, 8 GB
with one of the seven expression classes, i.e., afraid, dis- NVIDIA GeForce RTX 2080 SUPER XLA GPU device and 16 GB
gust, anger, happiness, neutral, surprise, and sadness. This RAM. During experimentation, we considered three-channel color
database results in a high degree of variability and enormous images. Here the databases are a mixture of RGB-colored images
diversity in scale, expression, pose, age, illumination, occlu- and gray-scaled images. During image prepossessing (discussed
sion, resolution, and makeup. Fig. 9 shows some facial image in Section 3.1), we detected and cropped the required facial
samples from the IMFDB database. region F from each input image I . Then, the cropped face image
F has been normalized to standard image F ∈ R ⃗ (200×200×3) size.
Here, Table 4 describes these databases. The proposed FERS Here, the proposed deep neural network model inputs a mixture
considers: (1) The annotations comprise the classification region, of color facial images with (100 × 100 × 3) and (200 × 200 × 3)
action units, receptive fields, and emotions; (2) The expressions image size.
in these databases are ordinary and generic in people and com-
prise a web interface to scan the database, including integrated 4.2.1. Experiment using CNNBasic
search; (3) The other expressions used in the affective comput- In this work, we first introduced the experiment with our
ing research areas are composed of these basic facial expres- earlier proposed CNNBasic architecture depicted in Fig. 5. During
sions; (4) The sound recognition system for these expressions training our proposed we have employed an input image F ∈
will benefit real-world applications such as e-healthcare frame- ⃗ (100×100×3) and F ∈ R
R ⃗ (200×200×3) into this model. Here the
works, business organizations, social internet-of-things (IoT), and training images are used to train the CNNBasic model, and the
emotion-AI. trained CNNBasic model performance is measured using the test
11
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Fig. 9. Some examples of image samples from IMFDB database [34].

Table 4
Description of employed facial expression databases under various conditions.
Database Class # of training # of testing Remark
images images
SFEW 1.0 7 346 354 Unconstrained
SFEW 2.0 7 773 653 Unconstrained
Constrained/
IMFDB 6 4002 4000
Unconstrained

Table 5
Performance (in Accuracy (%)) of the proposed FERS using CNNBasic
architecture.
Database Data No data
augmentation augmentation
SFEW 1.0 37.91 33.11
SFEW 2.0 41.79 34.21
IMFDB 38.93 32.11

samples. Here, the critical task of the CNNBasic model is to learn


the parameters generated inside the network model’s hidden
layers. It depends on two factors, i.e., ‘batch size’ and ‘the number
of epochs’. These two factors significantly impact the network
model during the learning parameters of the training samples.
Here, we have fixed the batch size=16 and 2000 epochs for the
trade-off between batches and epochs. Advanced image data aug-
mentation techniques have been employed to avoid over-fitting
problems and improve the system’s recognition performance.
Hence the performance of recognition in terms of accuracy metric
is reported in Table 5, and it is observed that the performance
of proposed FERS increases in CNNbasic due to data augmenta-
tion techniques. Still, these performance needs improvements, so
the factors affecting these performance are summarized in the
following subsections.
Effect of Transfer Learning: Here, we have followed two
diverse approaches for the transfer learning method: (i) first,
1 2
we approach freshly trained CNNbasic , and CNNbasic models corre-
sponding to each image size, i.e., training the refreshed models.
1
Here, firstly the CNNbasic architecture is trained with image size
100 × 100 × 3 and then obtained the performance. Then with the
2
progressive image size 200 × 200 × 3, the CNNbasic architecture
2
is retrained such that only the upper layers of CNNbasic will be
1
trained with F200×200×3 images while the lower layers of CNNbasic
remains untrained. So, in the transfer learning of CNNBasic , both
(i) freezing some layers and (ii) whole model trained approaches Fig. 10. Performance (in Accuracy (%)) of the proposed FERS using Basic CNN
architecture due to Effect of Transfer Learning. Here approach 1 is freezing some
have been employed, and for these approaches, the performance
layers, and approach 2 is the whole model trained approach, (a) results without
of FERS is affected. These performances are shown in Fig. 10. data augmentation, and (b) results due to data augmentation.
12
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 6
Description of employed parameter for the CNNBilinear .
Parameter’s name Value Parameter’s name Value
Image height 200 Image Width 200
Number of class 7 No Last Layer Backbone 17 and 21
Name optimizer Stochastic Gradient Descent Learning Rate 0.001
Rate decay learning 0.1 Rate Decay Weight 1e−08
Name initializer Glorot Normal Name Activation Logits Softmax
Name loss Categorical Crossentropy Flag Debug True

4.2.2. Bilinear CNN architecture


Object detection with the traditional basic CNN framework
must be proficient in handling significant intra-class and inter-
class scale variation, residual noise, data sparsity, occlusion, and
blur due to poor contrast or low resolutions. The proposed
CNNBilinear model reduces the computational cost and average
processing time. This CNNbilinear model replaces the several fully
connected (F C ) layers with second-order pooling or bilinear
pooling and two linear layers described in Eq. (2) in Section 3.2
for shape extractor and shape detector. This model compares
favorably to the traditional CNNBasic architecture in terms of
speed. The scalability of the bilinear model for the proposed CNN
is more robust as the model requires image labels and could easily
apply to various fine-grained facial expression databases. Hence,
his CNNbilinear achieves encouraging state-of-the-art performance.
Fig. 11. Performance (in Accuracy (%)) of the proposed FERS using Bi-
Our proposed FERS model using bilinear architecture learned to linear CNN architecture due to different CNN architectures such as VGG16,
focus on the nose and mouth region to make predictions for VGG19, ResNet-50, Inception-v3, and Basic-CNN where the input image size is
disgust, eyes, eyebrows, and forehead fold for fear, only mouth 200 × 200 × 3.
organ for happiness, and sadness, nose, and eyes for surprise
and disgust. For neutral images, focus on all parts of the face
except the nose, which makes sense, given that small changes in
non-nose regions tend to correspond to emotional changes. Some
advanced image augmentation techniques have been utilized to
enhance the number of training samples. These enhanced training
samples would better represent the facial features extracted from
the receptive field and attributes. It helps to learn the parameters
of the CNN model and obtain a better recognition performance.
The data augmentation technique is vital in avoiding over-fitting
problems and adopting the diversity of training samples. In this
experimentation, training the CNN model to recognize people’s
facial emotions from an input face image is to learn the weight ω
in Eq. (5) and bias ζ in Eq. (6) present in the network. A stochastic
gradient descent optimizer has been used to learn the CNN model Fig. 12. Performance (in Accuracy (%)) of the proposed FERS using (a) Bi-linear
with a set of hyperparameters listed in Table 6. CNN (Approach 1), (b) Hyper-parameters (Approach 2), and (c) Transfer Learning
( ) (Approach 3) techniques.
∂L
ωij (ι + 1) = ωij (ι) + a⃗ωij (ι) − λµωij (ι) − µ l (5)
∂ωij
has been indicated in Table 7. Here, it is observed from the
∂L
( )
ζ (ι + 1) = ζ (ι) + a⃗ζ (ι) − λµζ (ι) − µ l (6) Table that the proposed FERS has obtained extremely encourag-
∂ζ ing performance for F200×200×3 images for SFEW 1.0, SFEW 2.0,
Where λ is the L2 weight decay and µ is the learning rate, and a⃗ and IMFDB databases. It has also been observed that encourag-
is the momentum. These parameters are tuned and tested with ing performance is achieved after applying image augmentation
these standard benchmark databases to achieve better results. techniques.
Using this technique contributes to enhancing the effectiveness Hence, from the performance reported in Table 7, it is noticed
and efficiency of the model to detect emotions on a face by that the convolutional blocks used in Bi-Linear CNN are ultra-
extracting essential features for learning about images having a Deep Convolutional Networks (VGG16) trained for massive Visual
Recognition ImageNet ILSVRC dataset. For further experiments,
high degree of variability in their appearances. The use of the
the performance of FERS with other convolutional networks such
following hyper-parameters listed in Table 6 makes the proposed
as VGG19, ResNet-50, Inception-v3, and Basic-CNN (Fig. 5) have
deep CNN model robust. Here the momentum and learning rates
been demonstrated in Fig. 11 using CNNbilinear architecture.
speed up the network during training. Next, the weight decay
From Tables 5 and 7, it can be concluded that our proposed
prevents the weights from growing too large and could be seen
deep CNN method using bilinear architecture performs better
as quadratic regularization. It enables the model to be strong in ⃗ 200×200×3 images. Several factors
than CNNBasic methods for F ∈ R
classification tasks while exposed to a new test sample. affect the performance of the FERS due to the Bilinear CNN model,
We have also applied an image data generator to perform which are as follows:
data augmentation techniques. Then we built a batch of labels
for each input facial image F . The effect on the performance • Effect of hyperparameter tuning and estimation: The im-
of our proposed system using the data augmentation technique pact of hyperparameter tuning and estimation is two-fold.
13
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 7
Performance (in Accuracy (%)) of the proposed FERS using Bi-linear CNN architecture.
Database Without data Data Without data Data
augmentation augmentation augmentation augmentation
100 × 100 200 × 200
SFEW 1.0 35.33 41.54 38.47 42.05
SFEW 2.0 53.25 64.89 63.10 71.29
IMFDB 45.99 48.73 48.99 54.87

First, the optimization is more biased than basic CNN be- Table 8
cause the feature map denoted by φ is an identity mapping. Impact of assembling diverse normalization techniques and Transfer learning on
bi-linear CNN architectures.
The second use of feature normalization improves the reg-
Database AD normalization Fine tuning with Fine tuning with
ularization of the model. Here, better accuracy has been all layers only fully
achieved for the proposed deep bilinear model on the wild- connected layers
database SFEW 1.0, SFEW 2.0, and IMFDB by cross-validating SFEW 1.0 47.05 47.15 47.56
the hyper-parameters listed in Table 6. SFEW 2.0 77.12 79.88 81.05
• Effect of transfer learning: Here, the three different ap- IMFDB 59.45 61.29 62.11
proaches have been employed for the transfer learning tech-
nique to validate its effectiveness on the standard facial
expression databases using the two deep CNN models. An feature-level Fusion. Feature-level Fusion has some limitations in
input image of size F ∈ R ⃗ 200×200×3 is processed through two
the deep learning framework. The score-level fusion technique
deep CNN models and produces two output feature vectors produces an optimum result for occluding images. Score fusion
of size m × n × d one from shape detector CNNbilinear and techniques defined in Eqs. (3) and (4) have been applied to the
another from shape extractor CNNbilinear . We have taken the classification scores obtained by CNNBasic , and CNNBilinear architec-
pooled outer product of these two reshaped feature vectors tures, and the results are obtained for the final CNNFusion , reported
of dimension d × d and proceed with transfer learning in Table 9. From this table, it has been observed that the Fusion
steps as follows: (i) initialize the model with a pre-trained 2
of CNNBilinear , and CNNBilinear using Product-Rule based method
model learned on a large ImageNet database; (ii) Train the obtains higher performance for SFEW 1.0, SFEW 2.0, and IMFDB
entire model several times with 1000 and more epochs database and these performances are considered here to be the
and truncated using feature functions at an intermediate performance for CNNFusion .
layer at a particular resolution; finally, (iii) we have per-
formed end-to-end fine-tune, i.e., the last batch norm with 4.2.4. Comparisons
a higher resolution and fully connected layer. Here we have Here, the proposed system’s performance is compared with
observed that the performance of CNNBilinear is more efficient existing state-of-the-art models on SFEW 1.0, SFEW 2.0, and
for higher-resolution images. Fig. 12 demonstrates the effect IMFDB. These comparisons of performances are made both quali-
of hyperparameter tuning and transfer learning techniques tative and quantitative. For qualitative comparisons, the features
on the FERS for SFEW 1.0, SFEW 2.0, and IMFDB databases. are extracted from the corresponding competing methods, and
• Impact of assembling diverse normalization techniques: the classification results are obtained under the same training
In unconstrained facial expression recognition, accuracy will testing protocol used by the proposed system. The qualitative
drop significantly due to pose variations. The critical solu- performance comparisons are demonstrated in Table 10, where
tion is to combine different normalization techniques. Here, the performances are reported in terms of accuracy (%). From
matrix normalization functions improve bilinear convolu- this table, it has been observed that the proposed system outper-
tional feature representation power for training and evalua- forms other competing methods under several constrained and
tion. Better performance of the proposed model proves that unconstrained imaging environments.
the L2 normalization has a more substantial stabilization For quantitative comparisons, some statistical theories and
outcome than individuals, i.e., matrix square root normal- hypotheses testings are done for the performances obtained by
ization and element-wise signed square root. It is due to the the proposed system (I) and other competing methods (i.e., A, B,
sparsity regularizers, and extra low-rank could successfully C, D, E, F, G, H) shown in Table 10. Due to quantitative compar-
extract the trivial elements in the bilinear features. Here the isons, some empirical studies must examine the cause and effect
model is more generalized and discriminative. between the proposed system and other competing methods. For
• Effect of layer-wise fine-tuning of parameters: We suc- this purpose, One-Tailed t-Tests [72] analysis has been performed.
cessfully applied fine-tuning parameters for the whole net- The one-tailed test is a one-sided test using the t-statistic. In this
work, the matrix square-roots, and the element-wise hypothesis testing, two hypotheses are assumed: (i) H0 (Null-
square-roots successively in the combinational layer. The hypothesis): µI ≤ µcompeting −method (mean performance of the
Lyapunov technique described in Eq. (1) in Section 3.2 is proposed system (I) is less than or equal to the mean performance
used to compute gradients for this experimentation. Table 8 of the competing method); (ii) H1 (Alternative-hypothesis): µI >
demonstrates the effect of assembling diverse normalization µcompeting −method (mean performance of the proposed system (I) is
and layer-wise fine-tuning techniques on the performance greater than the mean performance of the competing method).
of the proposed FERS for SFEW 1.0, SFEW 2.0, and IMFDB The acceptance or rejection of these hypotheses is based on the
databases. calculation of the t-statistic and its derived p-value, which should
be <0.05. The acceptance or rejection of the Null-hypothesis will
establish the quantitative comparison between the two methods.
4.2.3. Experiments using fusion Here, to make a convenient quantitative comparison between
For the proposed FERS model, the score-level fusion tech- the proposed method (I) and other competing methods, the cor-
niques have been utilized to enhance the performance of the responding test dataset of each employed database has been
proposed system for the employed databases compared to the partitioned into four parts. For example, the SFEW 1.0 test dataset
14
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 9
1
Effect of score-level fusion approaches on the performance of the proposed FERS whereas CNNBilinear ,
2
and CNNBilinear indicate the performance of Bilinear CNN due to ‘Fine Tuning with All Layers’ and
‘Fine Tuning with Only Fully Connected Layers’ techniques reported in Table 8.
Method SFEW 1.0 SFEW 2.0 IMFDB
CNNBasic 37.91 41.79 38.93
CNNBilinear 47.56 79.88 61.29
Sum-Rule 41.19 62.29 45.39
(CNNBasic , CNNBilinear )
Product-Rule 43.02 63.15 48.37
(CNNBasic , CNNBilinear )
Sum-Rule 47.39 79.18 62.05
1
(CNNBilinear , CNNBilinear )
Product-Rule 48.02 80.12 63.75
1
(CNNBilinear , CNNBilinear )
Sum-Rule 47.79 79.18 62.05
2
(CNNBilinear , CNNBilinear )
CNNFusion =Product-Rule 48.15 80.34 64.17
2
(CNNBilinear , CNNBilinear )

Table 10
Performance comparison of the proposed FERS for SFEW 1.0, SFEW 2.0, and IMFDB database, respectively.
Method (Name) Accuracy (%) Accuracy (%) Accuracy (%)
Vgg16 [67] (A) 24.78 54.82 34.78
ResNet50 [68] (B) 24.98 53.81 41.66
Inception-v3 [69] (C) 29.52 49.91 39.76
Liu et al. [70] (D) 26.14 41.41 43.18
SPDNet [71] (E) 29.07 58.14 37.65
Hossain et al. [8] (F) 36.79 40.32 45.73
CNNBasic (G) 39.72 41.79 46.93
CNNBilinear (H) 47.56 79.88 61.29
CNNFusion (Proposed) (I) 48.15 80.34 64.17

has 354 samples, which are randomly divided into four test sets 5. Conclusion
such as Test-1 (89), Test-2 (89), Test-3 (88), and Test-4 (88). Simi-
larly, there are Test-1 (164), Test-2 (163), Test-3 (163), and Test-4 An exemplary-grained facial expression recognition system
(163) sets for SFEW 2.0 test, and Test-1 (1000), Test-2 (1000), has been introduced in this paper. This proposed work aims
Test-3 (1000), and Test-4 (1000) test sets for IMFDB database. to recognize seven basic facial expressions on the human face.
The performances due to these test sets are obtained by applying The facial expressions recognition system has diversified fields
the proposed method and other competing methods. The list of of application in healthcare, social-artificial intelligence, indus-
tries, Government, and academics. There are several challeng-
quantitative comparisons for SFEW 1.0, SFEW 2.0, and IMFDB
ing constrained and unconstrained imaging environments in the
databases has been demonstrated in Table 11. From the compar-
current scenario, such as noisiness, illumination, blurriness, and
isons done in Table 11, it has been observed that each comparison motion-blur. The proposed facial expression recognition system
between the proposed system and the other competing method handles challenging issues in identifying the facial expression of
rejects the Null hypothesis and accepts the alternative hypothesis, the human facial region. Here, the proposed method has been
which means the performance of the proposed method (I) is implemented in four steps. The primary component involves
greater than the performance of that competing method. Only detecting facial landmarks from an input image and extracting
the comparison of method (I) and method (H) accepts the null the necessary face region. Then, in the second component, some
hypothesis, and it is because the performance of the proposed discriminant and distinguish feature representations schemes fol-
method (I) depends on the method CNNBilinear (H). lowed by classification have been proposed using deep learn-
Here, the face detection is performed for each image I to get ing approaches. In the first representation scheme, a CNN ar-
the facial region of interest F within O(n) time. Now the extracted chitecture that comprises several convolutional blocks, batch-
facial images Fi (xi ) number of input training samples, fi input normalization, max-pooling, and activation followed by succes-
feature maps, k internal hidden layers, with each internal hidden sive fully-connected layers, has been proposed. In the second
layer containing hi input neurons, and oi+1 output neurons [8]. feature representation scheme, a deep CNN architecture, followed
The total time complexity of our proposed deep CNN model by bilinear pooling, matrix-normalization, and dense layers, has
been proposed. Several techniques, such as image augmentation,
during training is measured by O(n) + O(xi × fi × hki × oi+1 × j) ≃
matrix normalization, fine-tuning parameters, and transfer learn-
O(n + (n × n × n × nk ) × m) ≃ O(nk × m), for m epoch, j
ing, have increased the recognition system’s performance in the
represents the number of iterations at each internal layers. So,
third component. Finally, the post-classification scores obtained
the time complexity for the proposed system during testing is from the proposed architectures have been fused to predict the
O(n) + O(l) ≃ O(2n) ≃ O(n), here O(l) be the classification expression for the proposed method in the last component. Ex-
time for identifying the facial expression on facial image F using tensive experimentations have been performed with challenging
the prediction model of the proposed system. The time com- Static Facial Expressions in the Wild, i.e., SFEW 1.0, SFEW 2.0, and
plexity of the proposed system in seconds for each test sample Indian Movie Face (IMFDB) benchmark databases. The results are
concerning the employed databases has been demonstrated in compared with existing state-of-the-art methods and show that
Table 12. the proposed methodology obtained superior results.
15
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

Table 11
Quantitative comparison of the proposed method with other competing methods reported in Table 10, keeping H0 : µI ≤ µcompeting −method , H1 : µI > µcompeting −method ,
µ (mean performance), σ (standard deviation of performance), N = 4 (number of tests).
Method Test-1 Test-2 Test-3 Test-4 µ σ t-statistic, p-value Remarks
For SFEW 1.0 database
I 52.56 47.67 43.89 49.17 48.32 3.1123 H0: Null hypothesis;
H1: Alternative hypothesis
A 29.24 23.33 25.27 21.05 24.72 3.0055 10.9093, 0.000017 <
α H0 is rejected, I > A
B 28.17 24.15 26.12 20.18 24.94 2.5203 11.6746, 0.000011 <
α H0 is rejected, I > B
C 30.72 28.16 30.89 28.05 29.45 1.3518 11.1208, 0.000015 <
α H0 is rejected, I > C
D 27.05 25.39 28.18 24.28 26.22 1.4985 12.7943, 0.000007 <
α H0 is rejected, I > D
E 29.15 26.56 27.35 26.90 27.49 0.9985 12.7490, 0.000315 <
α H0 is rejected, I > E
F 37.78 34.23 38.13 36.90 36.76 1.5278 06.6699, 0.000274 <
α H0 is rejected, I > F
G 41.17 40.32 37.34 39.43 39.56 1.4243 05.1173, 0.001092 <
α H0 is rejected, > G
H0 cannot be rejected,
H 48.90 47.31 43.56 49.05 47.21 2.2121 00.5853, 0.289800 > α
I depends on H
For SFEW 2.0 database
I 79.91 81.14 79.88 81.05 80.50 0.6009 H0: Null hypothesis; H1: Alternative hypothesis
A 56.28 52.71 55.67 54.89 54.88 1.3502 34.6714, 0.000002 < α H0 is rejected, I > A
B 54.67 53.87 52.89 55.18 54.15 0.8656 50.0129, 0.000002 < α H0 is rejected, I > B
C 51.89 50.23 48.85 48.56 49.88 1.3196 42.2352, 0.000005 < α H0 is rejected, I > C
D 41.17 40.93 42.21 40.78 41.27 0.5588 95.6162, 0.000004 < α H0 is rejected, I > D
E 59.45 57.61 58.45 59.16 58.66 0.7107 46.9332, 0.000003 < α H0 is rejected, I > E
F 39.78 40.15 41.02 40.39 40.33 0.4512 106.9147, .000000 < α H0 is rejected, I > F
G 40.12 41.79 41.31 41.59 41.20 0.6478 88.9556, 0.00000 < α H0 is rejected, > G
H0 cannot be rejected,
H 78.16 79.78 80.56 81.01 79.87 1.0848 01.0160, 0.174400 > α
I depends on H
For IMFDB database
I 65.33 64.12 62.37 65.16 64.24 1.1774 H0: Null hypothesis;
H1: Alternative hypothesis
A 35.81 35.18 34.67 33.90 34.89 0.6998 42.8571, 0.000005 <
α H0 is rejected, I > A
B 42.05 41.98 42.17 41.67 41.96 0.1847 37.3889, 0.000001 <
α H0 is rejected, I > B
C 38.61 39.45 39.77 40.15 39.49 0.2905 40.8177, 0.000007 <
α H0 is rejected, I > C
D 43.16 42.81 43.53 43.49 43.24 1.4985 22.0389, 0.000029 <
α H0 is rejected, I > D
E 38.15 37.67 37.91 37.45 37.79 0.2616 43.8600, 0.000004 <
α H0 is rejected, I > E
F 44.45 46.81 45.48 46.41 45.78 0.9106 24.8044, 0.000014 <
α H0 is rejected, I > F
G 46.01 45.59 46.17 45.31 45.77 0.3397 30.1446, 0.000004 <
α H0 is rejected, > G
H0 cannot be rejected,
H 62.81 60.62 61.67 60.39 61.37 0.9600 03.7784, 0.004599 > α
I depends on H

Table 12
Time complexity of the proposed system in seconds concerning each employed database.
Database Face detection Preprocessing Feature representation & Total
classification
SFEW 1.0 0.2019 0.0341 0.9312 1.1672
SFEW 2.0 0.2136 0.0561 0.9456 1.2153
IMFDB 0.2235 0.0721 0.8321 1.1277

CRediT authorship contribution statement [3] T.K. Pitcairn, S. Clemie, J.M. Gray, B. Pentland, Non-verbal cues in the self-
presentation of Parkinsonian patients, Br. J. Clin. Psychol. 29 (2) (1990)
177–184.
Sanoar Hossain: Investigation, Formal analysis, Original draft,
[4] A.J. Fridlund, Human Facial Expression: An Evolutionary View, Academic
Methodology. Saiyed Umer: Original draft, Preparation, Method-
Press, 2014.
ology, Supervision. Ranjeet Kr. Rout: Validation, Supervision. M.
[5] A. Mehrabian, Communication without words, in: Communication Theory,
Tanveer: Conceptualization, Supervision. Routledge, 2017, pp. 193–200.
[6] K. Kaulard, D.W. Cunningham, H.H. Bülthoff, C. Wallraven, The MPI facial
Declaration of competing interest expression database—a validated database of emotional and conversational
facial expressions, PLoS One 7 (3) (2012) e32321.
[7] S. Umer, R.K. Rout, C. Pero, M. Nappi, Facial expression recognition
The authors declare that they have no known competing finan-
with trade-offs between data augmentation and deep learning features,
cial interests or personal relationships that could have appeared J. Ambient Intell. Humaniz. Comput. (2021) 1–15.
to influence the work reported in this paper. [8] S. Hossain, S. Umer, V. Asari, R.K. Rout, A unified framework of
deep learning-based facial expression recognition system for diversified
Data availability applications, Appl. Sci. 11 (19) (2021) 9174.
[9] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained emotion detection with
gated recurrent neural networks, in: Proceedings of the 55th Annual
The authors do not have permission to share data.
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2017, pp. 718–728.
References [10] H. Zhou, D. Meng, Y. Zhang, X. Peng, J. Du, K. Wang, Y. Qiao, Exploring
emotion features and fusion strategies for audio-video emotion recogni-
[1] A. Fung, D. McDuff, A scalable approach for facial action unit classifier tion, in: 2019 International Conference on Multimodal Interaction, 2019,
training UsingNoisy data for pre-training, 2019, arXiv preprint arXiv:1911. pp. 562–566.
05946. [11] G. Muhammad, M. Alsulaiman, S.U. Amin, A. Ghoneim, M.F. Alhamid,
[2] P. Ekman, Differential communication of affect by head and body cues, J. A facial-expression monitoring system for improved healthcare in smart
Personal. Soc. Psychol. 2 (5) (1965) 726. cities, IEEE Access 5 (2017) 10871–10881.

16
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

[12] J. Paschen, J. Kietzmann, T.C. Kietzmann, Artificial intelligence (AI) and its [41] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear convolutional neural networks
implications for market knowledge in B2B marketing, J. Bus. Ind. Mark. for fine-grained visual recognition, IEEE Trans. Pattern Anal. Mach. Intell.
(2019). 40 (6) (2017) 1309–1322.
[13] M.A. Jarwar, I. Chong, Exploiting IoT services by integrating emotion [42] P. Moreno, A. Bernardino, J. Santos-Victor, Gabor parameter selection for
recognition in Web of Objects, in: 2017 International Conference on local feature detection, in: Iberian Conference on Pattern Recognition and
Information Networking, ICOIN, IEEE, 2017, pp. 54–56. Image Analysis, Springer, 2005, pp. 11–19.
[14] E. Bagheri, P.G. Esteban, H.-L. Cao, A.D. Beir, D. Lefeber, B. Vanderborght, An [43] H. He, S. Chen, Identification of facial expression using a multiple im-
autonomous cognitive empathy model responsive to users’ facial emotion pression feedback recognition model, Appl. Soft Comput. 113 (2021)
expressions, ACM Trans. Interact. Intell. Syst. (TIIS) 10 (3) (2020) 1–23. 107930.
[15] J. Shen, H. Yang, J. Li, Z. Cheng, Assessing learning engagement based on [44] H. Ghazouani, A genetic programming-based feature selection and fusion
facial expression recognition in MOOC’s scenario, Multimedia Syst. (2021) for facial expression recognition, Appl. Soft Comput. 103 (2021) 107173.
1–10. [45] J. Yi, A. Chen, Z. Cai, Y. Sima, M. Zhou, X. Wu, Facial expression recognition
[16] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Web-scale training for face of intercepted video sequences based on feature point movement trend
identification, in: Proceedings of the IEEE Conference on Computer Vision and feature block texture variation, Appl. Soft Comput. 82 (2019) 105540.
and Pattern Recognition, 2015, pp. 2746–2754. [46] A. Barman, P. Dutta, Facial expression recognition using distance and
[17] P. Ekman, Cross-cultural studies of facial expression, in: Darwin and Facial texture signature relevant features, Appl. Soft Comput. 77 (2019) 88–105.
Expression: A Century of Research in Review, Vol. 169222, No. 1, 1973. [47] Z. Sun, H. Zhang, J. Bai, M. Liu, Z. Hu, A discriminatively deep fusion
[18] Y.-I. Tian, T. Kanade, J.F. Cohn, Recognizing action units for facial expression approach with improved conditional GAN (im-cGAN) for facial expression
analysis, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 97–115. recognition, Pattern Recognit. (2022) 109157.
[19] B. Fasel, J. Luettin, Automatic facial expression analysis: a survey, Pattern [48] S. Kamal, F. Sayeed, M. Rafeeq, Facial emotion recognition for human-
Recognit. 36 (1) (2003) 259–275. computer interactions using hybrid feature extraction technique, in:
[20] P. Ekman, W.V. Freisen, S. Ancoli, Facial signs of emotional experience, J. 2016 International Conference on Data Mining and Advanced Computing,
Personal. Soc. Psychol. 39 (6) (1980) 1125. SAPIENCE, IEEE, 2016, pp. 180–184.
[21] A. Mollahosseini, B. Hasani, M.H. Mahoor, Affectnet: A database for facial [49] B. Yang, J. Wu, K. Ikeda, G. Hattori, M. Sugano, Y. Iwasawa, Y. Matsuo,
expression, valence, and arousal computing in the wild, IEEE Trans. Affect. Face-mask-aware facial expression recognition based on face parsing and
Comput. 10 (1) (2017) 18–31. vision transformer, Pattern Recognit. Lett. (2022).
[22] O. Oxford English’Dictionary, The principal historical dictionary of the [50] G. Yolcu, I. Oztel, S. Kazan, C. Oz, K. Palaniappan, T.E. Lever, F. Bunyak, Deep
English language, 2016. learning-based facial expression recognition for monitoring neurological
[23] R.W. Levenson, P. Ekman, W.V. Friesen, Voluntary facial action generates disorders, in: 2017 IEEE International Conference on Bioinformatics and
emotion-specific autonomic nervous system activity, Psychophysiology 27 Biomedicine, BIBM, IEEE, 2017, pp. 1652–1657.
(4) (1990) 363–384. [51] L. Yan, Y. Shi, M. Wei, Y. Wu, Multi-feature fusing local directional
[24] S. Kaiser, Facial expressions as indicators of ‘‘functional’’ and ‘‘dysfunc- ternary pattern for facial expressions signal recognition based on video
tional’’ emotional processes, in: The Human Face, Springer, 2003, pp. communication system, Alex. Eng. J. 63 (2023) 307–320.
235–253. [52] Y. Xie, T. Chen, T. Pu, H. Wu, L. Lin, Adversarial graph representation
[25] J. Panksepp, Affective Neuroscience: The Foundations of Human and adaptation for cross-domain facial expression recognition, in: Proceedings
Animal Emotions, Oxford University Press, 2004. of the 28th ACM International Conference on Multimedia, 2020, pp.
[26] J.F. Cohn, Z. Ambadar, P. Ekman, Observer-based measurement of facial 1255–1264.
expression with the Facial Action Coding System, Handb. Emot. Elicitation [53] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained
Assess. 1 (3) (2007) 203–221. visual recognition, in: Proceedings of the IEEE International Conference on
[27] W.M. Alaluosi, Recognition of human facial expressions using DCT-DWT Computer Vision, 2015, pp. 1449–1457.
and artificial neural network, Iraqi J. Sci. (2021) 2090–2098. [54] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[28] M. Doroszuk, Facial action coding system (FACS)–practical application, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Emotional Expression and Communication Magazine (2) (2018) 93–103. Recognition, 2016, pp. 770–778.
[29] J. Hamm, C.G. Kohler, R.C. Gur, R. Verma, Automated facial action coding [55] S. Umer, B.C. Dhara, B. Chanda, Face recognition using fusion of feature
system for dynamic analysis of facial expressions in neuropsychiatric learning techniques, Measurement 146 (2019) 43–54.
disorders, J. Neurosci. Methods 200 (2) (2011) 237–256. [56] C. Yu, X. Zhao, Q. Zheng, P. Zhang, X. You, Hierarchical bilinear pooling for
[30] J.M. Girard, J.F. Cohn, M.H. Mahoor, S.M. Mavadati, Z. Hammal, D.P. Rosen- fine-grained visual recognition, in: Proceedings of the European Conference
wald, Nonverbal social withdrawal in depression: Evidence from manual on Computer Vision, ECCV, 2018, pp. 574–589.
and automatic analyses, Image Vis. Comput. 32 (10) (2014) 641–647. [57] T.-Y. Lin, S. Maji, Visualizing and understanding deep texture represen-
[31] P. Ekman, E.L. Rosenberg, What the Face Reveals: Basic and Applied Studies tations, in: Proceedings of the IEEE Conference on Computer Vision and
of Spontaneous Expression using the Facial Action Coding System (FACS), Pattern Recognition, 2016, pp. 2791–2799.
Oxford University Press, USA, 1997. [58] X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark
[32] T. Qian, F. Zhang, S.U. Khan, Facial expression recognition based on edge localization in the wild, in: 2012 IEEE Conference on Computer Vision and
computing, in: 2019 15th International Conference on Mobile Ad-Hoc and Pattern Recognition, IEEE, 2012, pp. 2879–2886.
Sensor Networks, MSN, IEEE, 2019, pp. 410–415. [59] S. Barra, S. Hossain, C. Pero, S. Umer, A facial expression recognition
[33] Y.-L. Tian, T. Kanade, J.F. Cohn, Facial expression analysis, in: Handbook of approach for social IoT frameworks, Big Data Res. (2022) 100353.
Face Recognition, Springer, 2005, pp. 247–275. [60] L.A. Gatys, A.S. Ecker, M. Bethge, Texture synthesis and the controlled
[34] S. Setty, M. Husain, P. Beham, J. Gudavalli, M. Kandasamy, R. Vaddi, generation of natural stimuli using convolutional neural networks, in:
V. Hemadri, J.C. Karure, R. Raju, V.K. Rajan, C.V. Jawahar, Indian movie Bernstein Conference 2015, 2015, pp. 219–219.
face database: A benchmark for face recognition under wide variations, [61] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
in: National Conference on Computer Vision, Pattern Recognition, Image Dropout: a simple way to prevent neural networks from overfitting, J.
Processing and Graphics, NCVPRIPG, 2013. Mach. Learn. Res. 15 (1) (2014) 1929–1958.
[35] H. Siqueira, S. Magg, S. Wermter, Efficient facial feature learning with [62] D. Scherer, A. Müller, S. Behnke, Evaluation of pooling operations in con-
wide ensemble-based convolutional neural networks, in: Proceedings of volutional architectures for object recognition, in: International Conference
the AAAI Conference on Artificial Intelligence, Vol. 34, No. 04, 2020, pp. on Artificial Neural Networks, Springer, 2010, pp. 92–101.
5800–5809. [63] A. Hernández-García, P. König, Further advantages of data augmentation
[36] W. Zhao, R. Chellappa, P.J. Phillips, A. Rosenfeld, Face recognition: A on convolutional neural networks, in: International Conference on Artificial
literature survey, ACM Comput. Surv. 35 (4) (2003) 399–458. Neural Networks, Springer, 2018, pp. 95–103.
[37] V. Sharma, L. Van Gool, Does v-nir based image enhancement come with [64] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
better features? 2016, arXiv preprint arXiv:1608.06521. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings
[38] V. Sharma, J.Y. Hardeberg, S. George, RGB-NIR image enhancement by of the IEEE Conference on Computer Vision and Pattern Recognition, 2015,
fusing bilateral and weighted least squares filters, J. Imaging Sci. Technol. pp. 1–9.
61 (4) (2017) 040409-1–040409-9. [65] R.A. Rasool, Feature-level vs. Score-level fusion in the human identification
[39] V. Sharma, A. Diba, D. Neven, M.S. Brown, L. Van Gool, R. Stiefelhagen, system, in: Applied Computational Intelligence and Soft Computing. Vol.
Classification-driven dynamic image enhancement, in: Proceedings of the 2021, Hindawi, 2021.
IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. [66] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Static facial expression analysis in
4033–4041. tough conditions: Data, evaluation protocol and benchmark, in: 2011 IEEE
[40] T.-Y. Lin, S. Maji, Improved bilinear pooling with cnns, 2017, arXiv preprint International Conference on Computer Vision Workshops, ICCV Workshops,
arXiv:1707.06772. IEEE, 2011, pp. 2106–2112.

17
S. Hossain, S. Umer, R.K. Rout et al. Applied Soft Computing 134 (2023) 109997

[67] K. Simonyan, A. Zisserman, Very deep convolutional networks for [70] M. Liu, S. Li, S. Shan, X. Chen, Au-aware deep networks for facial expression
large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556. recognition, in: 2013 10th IEEE International Conference and Workshops
[68] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking on Automatic Face and Gesture Recognition, FG, IEEE, 2013, pp. 1–6.
the inception architecture for computer vision, in: Proceedings of the [71] D. Acharya, Z. Huang, D. Pani Paudel, L. Van Gool, Covariance pooling for
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. facial expression recognition, in: Proceedings of the IEEE Conference on
2818–2826. Computer Vision and Pattern Recognition Workshops, 2018, pp. 367–374.
[69] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception- [72] J.D. Gibbons, S. Chakraborti, Comparisons of the Mann-Whitney, Student’st,
resnet and the impact of residual connections on learning, in: Thirty-First and alternate t tests for means of normal distributions, J. Exp. Educ. 59
AAAI Conference on Artificial Intelligence, 2017. (3) (1991) 258–267.

18

You might also like